Filling In Data

A. Finding the missing values

We previously noted that both the 'CPI' and 'Unemployment' features contain 585 missing values. We’ll find the row indexes containing these missing values by first converting the feature columns in the na_values boolean DataFrame to integers, i.e. 0 and 1.

We then use the nonzero function to find the locations of the 1’s, which correspond to the True values.

Press + to interact

print(na_indexes[0])  # first missing value row index
print()

first_missing_row = merged_features.iloc[169]
print(first_missing_row[['Date','CPI','Unemployment']])
print()

final_val_row = merged_features.iloc[168]
print(final_val_row[['Date','CPI','Unemployment']])
print()

cpi_final_val = merged_features.at[168, 'CPI']
une_final_val = merged_features.at[168, 'Unemployment']
merged_features.at[169, 'CPI'] = cpi_final_val
merged_features.at[169, 'Unemployment'] = une_final_val

new_row = merged_features.iloc[169]
print(new_row[['Date','CPI','Unemployment']])
print()

C. Imputation variants

Filling in missing data with substituted values is known as imputation. The method we used for our dataset replaced missing values with values from closely related observations. However, there are many other forms of imputation, such as replacing the feature’s missing data with the feature’s mean, median, or mode.

There are also more advanced imputation methods such as k-Nearest Neighbors (filling in missing values based on similarity scores from the kNN algorithm) and MICE (applying multiple chained imputations, assuming the missing values are randomly distributed across observations).

In most industry cases these advanced methods are not required, since the data is either perfectly cleaned or the missing values are scarce. Nevertheless, the advanced methods could be useful when dealing with open source datasets, since these tend to be more incomplete.

Time to Code!

In this chapter you’ll be completing the impute_data function. You’ll use final non-missing value to fill in the missing feature values for 'CPI' and 'Unemployment' (as demonstrated in the chapter example).

The na_indexes_cpi object represents an array containing the row indexes of the 'CPI' feature’s missing values.

Since these indexes are in ascending sorted order, we can iterate through them and set the missing values equal to the previous row’s value in merged_features.

Create a for loop that iterates through na_indexes_cpi using variable i.

To ensure that we’re updating the actual merged_features DataFrame, and not just a copy of the 'CPI' column, we need to use the at function.

Inside the for loop, set the value at row i in the 'CPI' column of merged_features equal to the value at row i - 1 in the 'CPI' column. Make sure to use the at function for both the setting and getting operations.

We’ll now perform the same steps to fill in the missing data for the 'Unemployment' column.

Repeat the same process for the 'Unemployment' column using the na_indexes_une array.

Press + to interact

What you'll learn from this course

Preliminary Data Analysis

Data Processing

Model Predictions

Chapter Goals:

A. Finding the missing values

B. Filling in the values

C. Imputation variants

Time to Code!