Read Data from the Web
Learn to read data from websites on the internet.
We'll cover the following
Read from online files (CSV and JSON)
The exploding volumes of data and activity on the internet mean that websites are a fantastic source of data for us to utilize. If we have the Uniform Resource Locator (URL) to the website hosting remote CSV or JSON files we want, the read_csv()
and read_json()
functions will do the trick. Let’s say we’d like to read a raw CSV file that is uploaded on a public GitHub repository:
We can do so by indicating the URL to the remote CSV file and passing it as an argument into read_csv()
in the code below:
# Define GitHub URL containing the CSV fileurl = 'https://raw.githubusercontent.com/kennethleungty/Simulated-Annealing-Feature-Selection/main/data/raw/train.csv'# Parse URL into pandas functiondf = pd.read_csv(url)# Display first 5 rowsprint(df.head())
Besides HTTP URLs, read_csv()
(or read_json()
if dealing with JSON files) is also capable of handling other valid URL schemes, such as ftp, s3, and gs. Let’s suppose we wish to read a CSV file from a public AWS S3 bucket for the NOAA Water-Column Sonar Data Archive.
We next specify the S3 bucket URL to the CSV file while establishing an anonymous connection via the storage_options
parameter because we aren’t using any credentials for bucket access:
# Ensure that we have pip install fsspec and s3fs# Define S3 bucket URLs3_url = 's3://noaa-wcsd-pds/data/processed/SH1305/bottom/SaKe2013-D20130523-T080854_to_SaKe2013-D20130523-T085643.csv'# Pass URL into pandas functiondf = pd.read_csv(s3_url,storage_options={"anon": True} # Anonymous connection to S3)# Display first 5 rowsprint(df.head())
Note: To access data on private S3 buckets, we must provide additional AWS credentials along with the use of the handy
boto3
package. Theboto3
is a popular Python software development kit (SDK) for interacting with AWS services using Python code.
Beyond CSV files, we can easily read JSON files online with the read_json()
function. For example, say we want to import the information on the latest blockchain block on Bitcoin, which comes in JSON format on the Blockchain.info website, as shown below:
Here is the code to read the data stored online in JSON format:
# Define URL with the JSON datajson_url = 'https://blockchain.info/latestblock'# Parse URL into functiondf = pd.read_json(json_url)# Display first 5 rowsprint(df.head())
Read from HTML tables on websites
Sometimes, we may want to retrieve data from HTML tables available on public websites. For example, we may wish to obtain data on Amazon’s financial performance from the tables displayed on Google Finance.
We can retrieve this tabular data directly with the use of read_html()
. After parsing the website URL, the function searches for <
table
>
related HTML tags and returns the corresponding content within those tags:
# URL to Amazon page on Google Financeurl = 'https://www.google.com/finance/quote/AMZN:NASDAQ'# Retrieve HTML tables directly from websiteoutput = pd.read_html(url)# Display outputprint(output)
After executing the code above, we’ll notice that the output is a list of the table contents. We can then use the square bracket operators to access each table in the list. For example, to access the first identified table from the Google Finance page (i.e., quarterly income statement), we run the following code:
# Access contents of first tableurl = 'https://www.google.com/finance/quote/AMZN:NASDAQ'table1 = pd.read_html(url)[0]# Display table contentsprint(table1)
To retrieve data from the other tables, we access the corresponding item with the appropriate index of the list. For example, we can retrieve the following table (i.e., the quarterly balance sheet) by incrementing the index by one—that is pd.read_html(url)[1]
.
We can see that the output above matches the actual table on the Google Finance page shown below:
From the output displayed, it’s clear that we have to process the extracted data to resolve several data quality issues, such as unrecognized characters and concatenated column text. It’s a good reminder that the data extracted from the web won’t always come in the perfect shape or form for immediate data analysis, and further data cleaning is necessary.
If read_html()
returns too many tables, we can use the match
parameter to find tables that contain a specific string. For example, we can define match='Revenue'
to retrieve tables containing the text 'Revenue'
, as shown below:
# Access contents of first tableurl = 'https://www.google.com/finance/quote/AMZN:NASDAQ'print(pd.read_html(url, match='Revenue'))
Read from clipboard
If we want to quickly copy and paste tabular data directly from a website without using the read_html()
functions, we can do so with the read_clipboard()
function. Instead of copying data and pasting it into an Excel spreadsheet, the read_clipboard()
directly reads the data we have copied (saved onto our clipboard). The read_clipboard()
then passes the data into the read_csv()
function for us. Let’s say we want to retrieve information about English Premier League managers from Wikipedia:
We can highlight the table contents on the website and then press “Ctrl (or Command) + C” on our keyboard to copy the data onto the clipboard. From there, we execute the read_clipboard()
function (without any arguments) to transcribe the copied data into a pandas
DataFrame:
# Read saved tabular data from clipboardpd.read_clipboard()
The output from read_clipboard()
will appear like this, with the clipboard data now successfully stored in a DataFrame: