Exercise: Verifying Basic Data Integrity
Learn how to verify the integrity of the data using basic pandas functions.
We'll cover the following
In this exercise, we will perform a basic check on whether our dataset contains what we expect and verify whether there is the correct number of samples.
Data consistency assessment
The data is supposed to have observations for 30,000 credit accounts. While there are 30,000 rows, we should also check whether there are 30,000 unique account IDs. It’s possible that, if the SQL query used to generate the data was run on an unfamiliar schema, values that are supposed to be unique are in fact not unique. To examine this, we can check if the number of unique account IDs is the same as the number of rows. Perform the following steps to complete the exercise:
-
Import pandas, load the data, and examine the column names by running the following command in a cell, using “Shift + Enter”:
import pandas as pd df = pd.read_excel('default_of_credit_card_clients'\ '__courseware_version_1_21_19.xls') df.columns
The
columns
method of the DataFrame is employed to examine all the column names. You will obtain the following output once you run the cell:Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'default payment next month'], dtype='object')
We can see that all column names are listed in the output. The account
ID
column is referenced asID
. The remaining columns appear to be our features, with the last column being the response variable. Let’s quickly review the dataset information that was given to us by the client:-
LIMIT_BAL
: Amount of credit provided (in New Taiwanese (NT) dollar) including individual consumer credit and the family (supplementary) credit. -
SEX
: Gender (1 = male; 2 = female)> Note: We will not be using the gender data to decide credit worthiness owing to ethical considerations.
-
EDUCATION
: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others). -
MARRIAGE
: Marital status (1 = married; 2 = single; 3 = others). -
AGE
: Age (year). -
PAY_1
–PAY_6
: A record of past payments. Past monthly payments, recorded from April to September, are stored in these columns.PAY_1
represents the repayment status in September;PAY_2
is the repayment status in August; and so on up toPAY_6
, which represents the repayment status in April.The measurement scale for the repayment status is as follows: -1 = pay duly; 1 = payment delay for 1 month; 2 = payment delay for 2 months; and so on up to 8 = payment delay for 8 months; 9 = payment delay for 9 months and above.
-
BILL_AMT1
–BILL_AMT6
: Bill statement amount (in NT dollar).BILL_AMT1
represents the bill statement amount in September;BILL_AMT2
represents the bill statement amount in August; and so on up toBILL_AMT6
, which represents the bill statement amount in April. -
PAY_AMT1
–PAY_AMT6
: Amount of previous payment (NT dollar).PAY_AMT1
represents the amount paid in September;PAY_AMT2
represents the amount paid in August; and so on up toPAY_AMT6
, which represents the amount paid in April.
Let’s now use the
head()
method in the next step to observe the first few rows of data. By default, this will return the first 5 rows. -
-
Run the following command in the subsequent cell:
df.head()
Here is a portion of the output you should see:
Get hands-on with 1200+ tech skills courses.