The Response Variable and Concluding the Initial Exploration

Learn about binary classification and how to prepare data for it.

We'll cover the following

Binary classification and proportions of classes
Methods for dealing with imbalanced data
Data exploration: A continuous process for successful projects
Try it yourself

We have now looked through all the features to see whether any data is missing, as well as to examine them generally. The features are important because they constitute the inputs to our machine learning algorithm. On the other side of the model lies the output, which is a prediction of the response variable. For our problem, this is a binary flag indicating whether or not a credit account will default next month.

Binary classification and proportions of classes

The key task for the case study project is to come up with a predictive model for this target. Because the response variable is a yes/no flag, this problem is called a binary classification task. In our labeled data, the samples (accounts) that defaulted (that is, 'default payment next month' = 1) are said to belong to the positive class, while those that didn’t belong to the negative class.

The main piece of information to examine regarding the response of a binary classification problem is this: what is the proportion of the positive class? This is an easy check.

Before we perform this check, we load the packages we need with the following code:

import numpy as np #numerical computation
import pandas as pd #data wrangling
import matplotlib.pyplot as plt #plotting package
#Next line helps with rendering plots
%matplotlib inline
import matplotlib as mpl #additional plotting functionality
mpl.rcParams['figure.dpi'] = 400 #high res figures

Now we load the cleaned version of the case study data like this:

df = pd.read_csv('Chapter_1_cleaned_data.csv')

Now, to find the proportion of the positive class, all we need to do is get the average of the response variable over the whole dataset. This has the interpretation of the default rate. It’s also worthwhile to check the number of samples in each class, using groupby and count in pandas. This is presented as follows:

Get hands-on with 1400+ tech skills courses.

Introduction

Data Exploration and Cleaning

(Challenge) Exploring Remaining Financial Features in Dataset

Introduction to scikit-learn and Model Evaluation

(Challenge) Logistic Regression and Precision-Recall Curve

Details of Logistic Regression and Feature Extraction

(Challenge) Logistic Regression Model and Coefficients

The Bias-Variance Trade-Off

(Challenge) Cross-Validation and Feature Engineering

Decision Trees and Random Forests

(Challenge) Cross-Validation Grid Search with Random Forest

Gradient Boosting, XGBoost, and SHAP Values

(Challenge) XGBoost and SHAP Explanation for Case Study Data

Test Set Analysis, Financial Insights, and Delivery to the Client

(Challenge) Deriving Financial Insights

Appendix

The Response Variable and Concluding the Initial Exploration

Binary classification and proportions of classes