Loading a CSV Dataset From a URL

Learn to import a CSV dataset from a URL.

Loading CSV files

The CSV format is popular for storing and transferring data. Files with a .csv extension are plain text files containing data records with comma-separated values.

Let’s see how we can analyze data from a CSV file using Python by loading the file from a URL.

Press + to interact
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/CourseMaterial/DataWrangling/main/flowerdataset.csv')
print(df.head())

Let’s review the code line by line:

  • Line 1: We start by first importing the pandas library using import pandas as pd.

  • Line 2: We pass the URL of the dataset, enclosed in quotes, to the read_csv() function and save the result in the df variable.

Note: When we save the dataset inside a variable, we refer to the variable as a DataFrame. A DataFrame is a tabular data structure that contains data represented in rows and columns.

  • Line 3: We print the first five records of df using the df.head() function.

Note: We can print more rows by passing that value as an argument to the head() function, i.e., df.head(10).

We observe the following facts from the output:

  • The output contains the first five records of the DataFrame, df. These records help us understand how the rest of the data looks.

  • The first column is called the index column and contains the values 0, 1, 2, 3, and 4.

    • Each row within the DataFrame is assigned a unique index value.

    • We usually don't use the index column when providing recommendations.

  • Other than that, we can see that the DataFrame, df, contains five columns: sepal_length, sepal_width, petal_length, petal_width, and species.

Parameters

The read_csv() function has multiple parameters we can set to apply certain conditions when retrieving data from the data source. Three popular parameters are usecols, nrows, and dtype.

We'll now see how we can apply these three parameters.

The usecols parameter

To save memory space, we can specify the dataset columns we want to work with using the usecols parameter and setting its value as the columns we want.

Press + to interact
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/CourseMaterial/DataWrangling/main/flowerdataset.csv', usecols = ['sepal_length', 'sepal_width'])
print(df.head())

Let’s review the code line by line:

  • Line 1: We first import pandas and other required libraries.

  • Line 2: While reading the dataset, we set the usecols parameter inside the read_csv() function and assign it a list containing the desired dataset columns, such as ['sepal_length', 'sepal_width'].

Note: We can also pass column numbers instead of names as well, i.e., pd.read_csv('http://bit.ly/flowerdataset', usecols = [0, 1]).

  • Line 3: We preview the dataset.

From the output, we see that only the desired columns, sepal_length and sepal_width, were selected. If we want to select other columns, we add them to the list assigned to usecols.

The nrows parameter

Another useful parameter for analysis is nrows. We use this parameter to set how many rows of data to load from the data source instead of loading the entire dataset.

Press + to interact
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/CourseMaterial/DataWrangling/main/flowerdataset.csv', nrows = 15)
print(df)

Let’s review the code line by line:

  • Line 1: We import the pandas library.

  • Line 2: When reading data from the data source using the read_csv() function, we use the nrows parameter to set the number of rows we want to work with.

  • Line 3: We preview the dataset.

From the output, we can see 15 records from our dataset. This is because we set the nrows parameter to 15.

The skiprows parameter

Sometimes we might want to skip the first row of a dataset if it's irrelevant to our analysis. To do this, we use the skiprows parameter as shown below.

Press + to interact
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/CourseMaterial/DataWrangling/main/flowerdataset.csv', skiprows = 1)
print(df)

Let’s review the code line by line:

  • Line 1: We first load the required libraries

  • Line 2: While using the read_csv function to read the dataset, we pass the skiprows parameter and set its value to the number of rows we want to skip from the beginning of the dataset.

  • Line 3: We preview the resulting dataset.

As we can see from the output, the resulting dataset doesn't have the original column names. This is because we set the skiprows parameter to 1, which means that the first row of the CSV file (containing the column names) is skipped during the reading process. As a result, the values in the second row of the CSV file become the new column names in the DataFrame, and the index now starts from 0 for the third row, which is where the new rows of the new DataFrame begin.

All in all, it's important to note that excluding the header row from a DataFrame and replacing it with the first row can potentially impact the accuracy of data analysis because the dataset would be incomplete. This operation is only appropriate if the first row in a CSV file contains null values and the second row contains the actual column names.