Read Data from HTML Files

Learn how to read data from the HTML file format.

We'll cover the following

Markup language files
HTML file format
Read from HTML files

Markup language files

A markup language is a computer language that separates document elements by tags so there is a clear structure for dividing information into sections. Unlike programming languages, markup languages are human-readable and can be opened with most text editors. While there are numerous types of markup languages, we’ll cover the two most popular ones—HTML and XML.

HTML file format

HTML stands for HyperText Markup Language and is the standard markup language for creating webpages. The web page’s structure is described by the elements in the HTML file so that the browser can correctly display the contents.

Press + to interact

Read from HTML files

The read_html() reads HTML tables by searching for <table> HTML tags before returning the contents as a list of pandas DataFrames. We can similarly use this function for local HTML files.

For example, say we have a local HTML file saved from Wikipedia called continents.html containing tabular data of area and population estimates of seven continents. By using read_html(), we can transcribe the HTML table data into pandas DataFrames. Because the output for this example is a list with only one element (the continents table), we can directly retrieve the table we want by accessing index 0 of the list as shown below:

Press + to interact

Before We Begin

Reading Data into pandas

Combining Data

Reshaping and Manipulating Data

Encoding Data Types

Handling Numerical Data

Handling Categorical Data

Handling Text Data

Handling Time Series Data

Handling Sparse Data Structures

Handling Missing Data

Data Analysis and Visualization with sidetable and Bokeh

Leveraging Further Features of pandas

Utilizing Extended Libraries

Wrap Up

Appendix

Time Series Analysis and Visualization Using Python and Plotly

Read Data from HTML Files

Markup language files

HTML file format

Read from HTML files