Exercise: Continuing Verification of Data Integrity
Learn to verify the integrity of the data using boolean masks.
We'll cover the following
Data integrity verification using Boolean masks
In this exercise, with our knowledge of Boolean arrays, we will examine some of the duplicate IDs we discovered. In Exercise: Verifying Basic Data Integrity, we learned that no ID appears more than twice. We can use this learning to locate duplicate IDs and examine them. Then we take action to remove rows of dubious quality from the dataset. Perform the following steps to complete the exercise:
-
Continuing where we left off in previous exercise, we need to get the locations of the
id_counts
Series, where the count is 2, to locate the duplicates. First, we load the data and get the value counts of IDs to bring us to where we left off in the last exercise lesson, then we create a Boolean mask locating the duplicated IDs with a variable calleddupe_mask
and display the first five elements. Use the following commands:import pandas as pd df = pd.read_excel('default_of_credit_card_clients'\ '__courseware_version_1_21_19.xls') id_counts = df['ID'].value_counts() id_counts.head() dupe_mask = id_counts == 2 dupe_mask[0:5]
You will obtain the following output (note the ordering of IDs may be different in your output, as
value_counts
sorts on frequency, not the index of IDs):# 52bcd5ae-72d3 True # f5e3478e-cf68 True # 5deff6b6-62ff True # cb18af1f-3b53 True # ac821a7b-b399 True # Name: ID, dtype: bool
Note that in the preceding output, we are displaying only the first five entries using
dupe_mask
to illustrate the contents of this array. You can edit the integer indices in the square brackets ([]
) to change the number of entries displayed in the output.Our next step is to use this logical mask to select the IDs that are duplicated. The IDs themselves are contained as the index of the
id_count
Series. We can access the index in order to use our logical mask for selection purposes. -
Access the index of
id_count
and display the first five rows as context using the following command:id_counts.index[0:5]
With this, you will obtain the following output:
# Index(['52bcd5ae-72d3', 'f5e3478e-cf68', '5deff6b6-62ff', 'cb18af1f-3b53', 'ac821a7b-b399'], dtype='object')
-
Select and store the duplicated IDs in a new variable called
dupe_ids
using the following command:dupe_ids = id_counts.index[dupe_mask]
-
Convert
dupe_ids
to a list and then obtain the length of the list using the following commands:dupe_ids = list(dupe_ids) len(dupe_ids)
You should obtain the following output:
# 313
We changed the
dupe_ids
variable to alist
, as we will need it in this form for future steps. The list has a length of313
, as can be seen in the preceding output, which matches our knowledge of the number of duplicate IDs from the value count. -
We verify the data in
dupe_ids
by displaying the first five entries using the following command:dupe_ids[0:5]
We obtain the following output:
# ['52bcd5ae-72d3', # 'f5e3478e-cf68', # '5deff6b6-62ff', # 'cb18af1f-3b53', # 'ac821a7b-b399']
We can observe from the preceding output that the list contains the required entries of duplicate IDs. We’re now in a position to examine the data for the IDs in our list of duplicates. In particular, we’d like to look at the values of the features, to see what, if anything, might be different between these duplicate entries. We will use the
isin
andloc
methods of the DataFramedf
for this purpose.Using the first three IDs on our list of dupes,
dupe_ids[0:3]
, we will plan to first find the rows containing these IDs. If we pass this list of IDs to theisin
method of the ID Series, this will create another logical mask we can use on the larger DataFrame to display the rows that have these IDs. Theisin
method is nested in aloc
statement indexing the DataFrame in order to select the location of all rows containingTrue
in the Boolean mask. The second argument of theloc
indexing statement is:
, which implies that all columns will be selected. By performing the following steps, we are essentially filtering the DataFrame in order to view all the columns for the first three duplicate IDs. -
Run the following command in your notebook to execute the plan we formulated in the previous step:
df.loc[df['ID'].isin(dupe_ids[0:3]),:]
Get hands-on with 1200+ tech skills courses.