Exercise: Continuing Verification of Data Integrity

Learn to verify the integrity of the data using boolean masks.

Data integrity verification using Boolean masks

In this exercise, with our knowledge of Boolean arrays, we will examine some of the duplicate IDs we discovered. In Exercise: Verifying Basic Data Integrity, we learned that no ID appears more than twice. We can use this learning to locate duplicate IDs and examine them. Then we take action to remove rows of dubious quality from the dataset. Perform the following steps to complete the exercise:

  1. Continuing where we left off in previous exercise, we need to get the locations of the id_counts Series, where the count is 2, to locate the duplicates. First, we load the data and get the value counts of IDs to bring us to where we left off in the last exercise lesson, then we create a Boolean mask locating the duplicated IDs with a variable called dupe_mask and display the first five elements. Use the following commands:

    import pandas as pd
    df = pd.read_excel('default_of_credit_card_clients'\
    '__courseware_version_1_21_19.xls')
     
    id_counts = df['ID'].value_counts()
    id_counts.head()
        
    dupe_mask = id_counts == 2
    dupe_mask[0:5]
    

    You will obtain the following output (note the ordering of IDs may be different in your output, as value_counts sorts on frequency, not the index of IDs):

    # 52bcd5ae-72d3    True
    # f5e3478e-cf68    True
    # 5deff6b6-62ff    True
    # cb18af1f-3b53    True
    # ac821a7b-b399    True
    # Name: ID, dtype: bool
    

    Note that in the preceding output, we are displaying only the first five entries using dupe_mask to illustrate the contents of this array. You can edit the integer indices in the square brackets ([]) to change the number of entries displayed in the output.

    Our next step is to use this logical mask to select the IDs that are duplicated. The IDs themselves are contained as the index of the id_count Series. We can access the index in order to use our logical mask for selection purposes.

  2. Access the index of id_count and display the first five rows as context using the following command:

    id_counts.index[0:5]
    

    With this, you will obtain the following output:

    # Index(['52bcd5ae-72d3', 'f5e3478e-cf68', '5deff6b6-62ff', 'cb18af1f-3b53', 'ac821a7b-b399'], dtype='object')
    
  3. Select and store the duplicated IDs in a new variable called dupe_ids using the following command:

    dupe_ids = id_counts.index[dupe_mask]
    
  4. Convert dupe_ids to a list and then obtain the length of the list using the following commands:

    dupe_ids = list(dupe_ids)
    len(dupe_ids)
    

    You should obtain the following output:

    # 313
    

    We changed the dupe_ids variable to a list, as we will need it in this form for future steps. The list has a length of 313, as can be seen in the preceding output, which matches our knowledge of the number of duplicate IDs from the value count.

  5. We verify the data in dupe_ids by displaying the first five entries using the following command:

    dupe_ids[0:5]
    

    We obtain the following output:

    # ['52bcd5ae-72d3', 
    #  'f5e3478e-cf68',
    #  '5deff6b6-62ff',
    #  'cb18af1f-3b53',
    #  'ac821a7b-b399']
    

    We can observe from the preceding output that the list contains the required entries of duplicate IDs. We’re now in a position to examine the data for the IDs in our list of duplicates. In particular, we’d like to look at the values of the features, to see what, if anything, might be different between these duplicate entries. We will use the isin and loc methods of the DataFrame df for this purpose.

    Using the first three IDs on our list of dupes, dupe_ids[0:3], we will plan to first find the rows containing these IDs. If we pass this list of IDs to the isin method of the ID Series, this will create another logical mask we can use on the larger DataFrame to display the rows that have these IDs. The isin method is nested in a loc statement indexing the DataFrame in order to select the location of all rows containing True in the Boolean mask. The second argument of the loc indexing statement is :, which implies that all columns will be selected. By performing the following steps, we are essentially filtering the DataFrame in order to view all the columns for the first three duplicate IDs.

  6. Run the following command in your notebook to execute the plan we formulated in the previous step:

    df.loc[df['ID'].isin(dupe_ids[0:3]),:]
    

Get hands-on with 1200+ tech skills courses.