Filtering

Filter DataFrames for values that fit certain conditions.

Chapter Goals:

  • Understand how to filter a DataFrame based on filter conditions
  • Write code to filter a dataset of MLB statistics

A. Filter conditions

In the Data Manipulation section, we used relation operations on NumPy arrays to create filter conditions. These filter conditions returned boolean arrays, which represented the locations of the elements that pass the filter.

In pandas, we can also create filter conditions for DataFrames. Specifically, we can use relation operations on a DataFrame's column features, which will return a boolean Series representing the DataFrame rows that pass the filter.

The code below demonstrates how to use relation operations as filter conditions.

Press + to interact
df = pd.DataFrame({
'playerID': ['bettsmo01', 'canoro01', 'cruzne02', 'ortizda01', 'cruzne02'],
'yearID': [2016, 2016, 2016, 2016, 2017],
'teamID': ['BOS', 'SEA', 'SEA', 'BOS', 'SEA'],
'HR': [31, 39, 43, 38, 39]})
print('{}\n'.format(df))
cruzne02 = df['playerID'] == 'cruzne02'
print('{}\n'.format(cruzne02))
hr40 = df['HR'] > 40
print('{}\n'.format(hr40))
notbos = df['teamID'] != 'BOS'
print('{}\n'.format(notbos))

In the code above, we created filter conditions for df based on the columns labeled 'playerID', 'HR', and 'teamID'. The boolean Series outputs have True for the rows that pass the filter, and False for the rows that don't.

B. Filters from functions

Apart from relation operations, pandas provides various functions for creating specific filter conditions. For columns with string values, we can use str.startswith, str.endswith, and str.contains to filter for specific strings. These functions work the exact same as their namesakes from the Python standard library.

The code below shows various examples of string filter conditions. In the final example using str.contains, we prepend the ~ operation, which negates the filter condition. This means our final filter condition checked for player IDs that do not contain 'o'.

Press + to interact
df = pd.DataFrame({
'playerID': ['bettsmo01', 'canoro01', 'cruzne02', 'ortizda01', 'cruzne02'],
'yearID': [2016, 2016, 2016, 2016, 2017],
'teamID': ['BOS', 'SEA', 'SEA', 'BOS', 'SEA'],
'HR': [31, 39, 43, 38, 39]})
print('{}\n'.format(df))
str_f1 = df['playerID'].str.startswith('c')
print('{}\n'.format(str_f1))
str_f2 = df['teamID'].str.endswith('S')
print('{}\n'.format(str_f2))
str_f3 = ~df['playerID'].str.contains('o')
print('{}\n'.format(str_f3))

We can also create filter conditions that check for values in a specific set, by using the isin function. The function only takes in one argument, which is a list of values that we want to filter for.

The code below demonstrates how to use the isin function for filter conditions.

Press + to interact
df = pd.DataFrame({
'playerID': ['bettsmo01', 'canoro01', 'cruzne02', 'ortizda01', 'cruzne02'],
'yearID': [2016, 2016, 2016, 2016, 2017],
'teamID': ['BOS', 'SEA', 'SEA', 'BOS', 'SEA'],
'HR': [31, 39, 43, 38, 39]})
print('{}\n'.format(df))
isin_f1 = df['playerID'].isin(['cruzne02',
'ortizda01'])
print('{}\n'.format(isin_f1))
isin_f2 = df['yearID'].isin([2015, 2017])
print('{}\n'.format(isin_f2))

In pandas, when a Series or DataFrame has a missing value at a location, it is represented by NaN. The NaN value in pandas is equivalent to np.nan in NumPy.

Similar to Numpy, we cannot use a relation operation to create a filter condition for NaN values. Instead, we use the isna and notna functions.

Press + to interact
df = pd.DataFrame({
'playerID': ['bettsmo01', 'canoro01', 'doejo01'],
'yearID': [2016, 2016, 2017],
'teamID': ['BOS', 'SEA', np.nan],
'HR': [31, 39, 99]})
print('{}\n'.format(df))
isna = df['teamID'].isna()
print('{}\n'.format(isna))
notna = df['teamID'].notna()
print('{}\n'.format(notna))

The isna function returns True in the locations that contain NaN and False in the locations that don't, while the notna function does the opposite.

C. Feature filtering

It is really easy to filter a DataFrame's rows based on filter conditions. Similar to direct indexing of a DataFrame, we use square brackets. However, the inside of the square brackets will now contain a filter condition.

When applying filter conditions within square brackets, we retrieve the rows of the DataFrame that pass the filter condition (i.e. the rows for which the filter condition is True).

The code below shows how to filter using square brackets and filter conditions.

Press + to interact
df = pd.DataFrame({
'playerID': ['bettsmo01', 'canoro01', 'cruzne02', 'ortizda01', 'bettsmo01'],
'yearID': [2016, 2016, 2016, 2016, 2015],
'teamID': ['BOS', 'SEA', 'SEA', 'BOS', 'BOS'],
'HR': [31, 39, 43, 38, 18]})
print('{}\n'.format(df))
hr40_df = df[df['HR'] > 40]
print('{}\n'.format(hr40_df))
not_hr30_df = df[~(df['HR'] > 30)]
print('{}\n'.format(not_hr30_df))
str_df = df[df['teamID'].str.startswith('B')]
print('{}\n'.format(str_df))

Time to Code!

In this chapter's code exercises, we'll apply various filters to a predefined DataFrame, mlb_df, which contains MLB statistics.

We'll first filter mlb_df for the top MLB hitting seasons in history, which we define as having a batting average above .300.

Set top_hitters equal to mlb_df[] applied with mlb_df['BA'] > .300 as the filter condition.

Press + to interact
# CODE HERE

Next we filter for the players whose player ID does not start with the letter a.

Set exclude_a equal to mlb_df[] applied with the negation of mlb_df['playerID'].str.startswith('a') as the filter condition.

Press + to interact
# CODE HERE

We'll now retrieve the statistics for two specific players. Their player IDs are 'bondsba01' and 'troutmi01'.

Set two_ids equal to a list containing the two specified player IDs.

Set two_players equal to mlb_df[] applied with mlb_df['playerID'].isin(two_ids) as the filter condition.

Press + to interact
# CODE HERE

Get hands-on with 1300+ tech skills courses.