Exercise: Verifying Basic Data Integrity

Learn how to verify the integrity of the data using basic pandas functions.

We'll cover the following

Data consistency assessment
Try it yourself

In this exercise, we will perform a basic check on whether our dataset contains what we expect and verify whether there is the correct number of samples.

Data consistency assessment

The data is supposed to have observations for 30,000 credit accounts. While there are 30,000 rows, we should also check whether there are 30,000 unique account IDs. It’s possible that, if the SQL query used to generate the data was run on an unfamiliar schema, values that are supposed to be unique are in fact not unique. To examine this, we can check if the number of unique account IDs is the same as the number of rows. Perform the following steps to complete the exercise:

Import pandas, load the data, and examine the column names by running the following command in a cell, using “Shift + Enter”:
```
import pandas as pd
   
df = pd.read_excel('default_of_credit_card_clients'\
'__courseware_version_1_21_19.xls')
   
df.columns
```
The columns method of the DataFrame is employed to examine all the column names. You will obtain the following output once you run the cell:
```
Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'default payment next month'], dtype='object')
```
We can see that all column names are listed in the output. The account ID column is referenced as ID. The remaining columns appear to be our features, with the last column being the response variable. Let’s quickly review the dataset information that was given to us by the client:
- LIMIT_BAL: Amount of credit provided (in New Taiwanese (NT) dollar) including individual consumer credit and the family (supplementary) credit.
- SEX: Gender (1 = male; 2 = female)
  
  > Note: We will not be using the gender data to decide credit worthiness owing to ethical considerations.
- EDUCATION: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
- MARRIAGE: Marital status (1 = married; 2 = single; 3 = others).
- AGE: Age (year).
- PAY_1–PAY_6: A record of past payments. Past monthly payments, recorded from April to September, are stored in these columns. PAY_1 represents the repayment status in September; PAY_2 is the repayment status in August; and so on up to PAY_6, which represents the repayment status in April.
  
  The measurement scale for the repayment status is as follows: -1 = pay duly; 1 = payment delay for 1 month; 2 = payment delay for 2 months; and so on up to 8 = payment delay for 8 months; 9 = payment delay for 9 months and above.
- BILL_AMT1–BILL_AMT6: Bill statement amount (in NT dollar). BILL_AMT1 represents the bill statement amount in September; BILL_AMT2 represents the bill statement amount in August; and so on up to BILL_AMT6, which represents the bill statement amount in April.
- PAY_AMT1–PAY_AMT6: Amount of previous payment (NT dollar). PAY_AMT1 represents the amount paid in September; PAY_AMT2 represents the amount paid in August; and so on up to PAY_AMT6, which represents the amount paid in April.
Let’s now use the head() method in the next step to observe the first few rows of data. By default, this will return the first 5 rows.
Run the following command in the subsequent cell:
```
df.head()
```
Here is a portion of the output you should see:

Get hands-on with 1400+ tech skills courses.

Introduction

Data Exploration and Cleaning

(Challenge) Exploring Remaining Financial Features in Dataset

Introduction to scikit-learn and Model Evaluation

Fake News Detection Using Scikit-learn

(Challenge) Logistic Regression and Precision-Recall Curve

Details of Logistic Regression and Feature Extraction

(Challenge) Logistic Regression Model and Coefficients

The Bias-Variance Trade-Off

(Challenge) Cross-Validation and Feature Engineering

Decision Trees and Random Forests

(Challenge) Cross-Validation Grid Search with Random Forest

Gradient Boosting, XGBoost, and SHAP Values

(Challenge) XGBoost and SHAP Explanation for Case Study Data

Predict Frog Toxicity with Python and XGBoost

Test Set Analysis, Financial Insights, and Delivery to the Client

(Challenge) Deriving Financial Insights

Appendix

Exercise: Verifying Basic Data Integrity

Data consistency assessment