Boxplots

Explore boxplots and how to plot them in R.

While faceted histograms are one type of visualization used to compare the distribution of a numerical variable split by the values of another variable, another type of visualization that achieves this same goal is a side-by-side boxplot. A boxplot is constructed from the information provided in the five-number summary of a numerical variable.

Five-number summary

The five-number summary consists of five summary statistics: the minimum, the first quartile (25th percentile), the second quartile (median or 50th percentile), the third quartile (75th percentile), and the maximum.

The quartiles are calculated as:

  • The first quartile (Q1Q_1): The median of the first half of the sorted data

  • The third quartile (Q3Q_3): The median of the second half of the sorted data

The interquartile range (IQR) is defined as Q3Q1Q_3-Q_1 and is a measure of how spread out the middle 50% of values are. The IQR corresponds to the length of the box in a boxplot.

The median and the IQR aren’t influenced by the presence of outliers in the ways that the mean and standard deviation are. They are, therefore, recommended for skewed datasets. We can say, in this case, that the median and IQR are more robust to outliers.

To keep things simple for now, let’s only consider the 2,141 hourly temperature recordings for the month of November, each represented as a jittered point in the following figure.

Get hands-on with 1200+ tech skills courses.