
Use pandas to obtain statistical metrics for data.

Chapter Goals:

  • Understand the common metrics used to summarize numeric data
  • Learn how to describe categorical data using histograms

A. Numeric metrics

When working with numeric features, we usually want to calculate metrics such as mean, standard deviation, etc. These metrics give us more insight into the type of data we're working with, which benefits our overall analysis of the dataset.

Rather than calculating several different metrics separately, pandas provides the describe function to obtain a summary of a DataFrame's numeric data.

The metrics included in the output summary of describe are

Metric Description
count The number of rows in the DataFrame
mean The mean value for a feature
std The standard deviation for a feature
min The minimum value in a feature
25% The 25th percentile of a feature
50% The 50th percentile of a feature. Note that this is identical to the median
75% The 75th percentile of a feature
max The maximum value in a feature

The code below shows how to use the describe function.

Press + to interact
# df is predefined
metrics1 = df.describe()
hr_rbi = df[['HR','RBI']]
metrics2 = hr_rbi.describe()

Using describe with a DataFrame will return a summary of metrics for each of the DataFrame's numeric features. In our example, df had three features with numerical values: yearID, HR, and RBI.

Since we normally treat yearID as a categorical feature, the second time we used describe was with the hr_rbi DataFrame, which only included the HR and RBI features.

To have describe return specific percentiles, we can use the percentiles keyword argument. The percentiles argument takes in a list of decimal percentages, representing the percentiles we want returned in the summary.

Press + to interact
metrics1 = hr_rbi.describe(percentiles=[.5])
metrics2 = hr_rbi.describe(percentiles=[.1])
metrics3 = hr_rbi.describe(percentiles=[.2,.8])

Note that the 50th percentile, i.e. the median, is always returned. The values specified in the percentiles list will replace the default 25th and 75th percentiles.

B. Categorical features

With categorical features, we don't calculate metrics like mean, standard deviation, etc. Instead, we use frequency counts to describe a categorical feature.

The frequency count for a specific category of a feature refers to how many times that category appears in the dataset. In pandas, we use the value_counts function to obtain the frequency counts for each category in a column feature.

The code below uses the value_counts function to get frequency counts of the 'playerID' feature.

Press + to interact
p_ids = df['playerID']

Using value_counts without any keyword arguments will return the frequency counts for each category, sorted in descending order.

Setting normalize=True returns the frequency proportions, rather than counts, for each category (note that the sum of all the proportions is 1). We can also set ascending=True to get the frequencies sorted in ascending order.

If we just want the names of each unique category in a column, rather than the frequencies, we use the unique function.

Press + to interact
unique_players = df['playerID'].unique()
unique_teams = df['teamID'].unique()

So far we've focused on categorical features with string values. However, categorical features can also have integer values. For example, we can use yearID as a categorical feature with each unique year as a separate category.

Press + to interact
y_ids = df['yearID']

Time to Code!

The coding exercises for this chapter involve getting metrics from a DataFrame of MLB players, player_df.

First, we'll get a summary of all the statistics in player_df.

Set summary_all equal to player_df.describe with no arguments.

Press + to interact

Next, we want to get summaries specifically for the home run totals. The first summary will contain the default metrics from describe, while the second summary will contain the 10th and 90th percentiles.

Set hr_df equal to player_df[] directly indexed with 'HR'.

Set summary_hr equal to hr_df.describe with no arguments.

Set low_high_10 equal to hr_df.describe with [.1,.9] as the percentiles keyword argument.

Press + to interact

Finally, we'll treat the 'HR' feature as a categorical variable, with each unique home run total as a separate category. We then get the frequency counts for each category.

Set hr_counts equal to hr_df.value_counts with no arguments.

Press + to interact

Get hands-on with 1300+ tech skills courses.