Read Data from HTML Files
Learn how to read data from the HTML file format.
We'll cover the following
Markup language files
A markup language is a computer language that separates document elements by tags so there is a clear structure for dividing information into sections. Unlike programming languages, markup languages are human-readable and can be opened with most text editors. While there are numerous types of markup languages, we’ll cover the two most popular ones—HTML and XML.
HTML file format
HTML stands for HyperText Markup Language and is the standard markup language for creating webpages. The web page’s structure is described by the elements in the HTML file so that the browser can correctly display the contents.
<!DOCTYPE html><html><head><title>Welcome to Educative</title></head><body><h1>Advanced Pandas - Going Beyond the Basics</h1><p>You are currently on Chapter 2 of the course</p></body></html>
Read from HTML files
The read_html()
reads HTML tables by searching for <
table
>
HTML tags before returning the contents as a list of pandas
DataFrames. We can similarly use this function for local HTML files.
For example, say we have a local HTML file saved from Wikipedia called continents.html
containing tabular data of area and population estimates of seven continents. By using read_html()
, we can transcribe the HTML table data into pandas
DataFrames. Because the output for this example is a list with only one element (the continents table), we can directly retrieve the table we want by accessing index 0
of the list as shown below:
# Define path to HTML filehtml_path = '../usr/local/data/html/continents.html'# Retrieve first element from list of HTML tablescontinents_df = pd.read_html(html_path)[0]# Display table contents as HTMLprint(continents_df.to_html())
Note: We can expect to do some manual data cleanup after using the
read_html()
function, such as assigning column names, converting column data types, etc.
If we know that a table has specific HTML attributes, we can use the attrs
parameter to retrieve it specifically. For example, the contents of a table with class
name wikitable
can be read with the following code:
# Specify attributes of table and index of table to retrievedf = pd.read_html(html_path, attrs = {'class': 'wikitable'})[0]