The Grammar of Graphics

Look at the basics of the grammar of graphics and how to get started with visualization in R.

We start with a discussion of a theoretical framework for data visualization known as the grammar of graphics. This framework serves as the foundation for the ggplot2 package, which we’ll use extensively in this chapter. Think of how we construct and form sentences in English by combining different elements, like nouns, verbs, articles, subjects, objects, etc. We can’t just combine these elements in any arbitrary order. We must do combine them following a set of rules known as linguistic grammar. Similarly to linguistic grammar, the grammar of graphics defines a set of rules for constructing statistical graphics by combining different types of layers. This grammar was created by Leland Wilkinson (Wilkinson, 2005) and has been implemented in various data visualization software platforms like R.

Components of grammar

In short, grammar tells us that a statistical graphic is a mapping of data variables to aesthetic attributes of geometric objects.

Specifically, we can break a graph into the following three essential components:

  • data: This is the variable of interest contained by the dataset.

  • geom: This is the geometric object in question, and refers to the type of object we can observe in a plot. For example, points, lines, and bars.

  • aes: These are the aesthetic attributes of the geometric object. For example, x/y position, color, shape, and size. Aesthetic attributes are mapped to variables in the dataset.

Example: The Gapminder data

In February 2006, a Swedish physician and data advocate named Hans Rosling gave a TED Talk titled “The best stats you’ve ever seen.” In this TED Talk, he presented the global economic, health, and development data from the website gapminder.org. For example, for data on 142 countries in 2007, let’s consider only a few countries in the following table as a peak into the data:

Gapminder 2007 Data: First 3 of 142 Countries

Country

Continent 

Life Expectancy 

 Population 

GDP Per Capita

Afghanistan 

Asia

43.8 

31,889,923 

975 

Albania

Europe

76.4

3,600,523

5,937

Algeria

Africa

72.3 

33,333,216

6,223 

Each row in this table corresponds to a country in 2007. For each row, we have five columns:

  • Country: This is the name of the country.

  • Continent: Which of the five continents the country is part of. We should note that “Americas” here includes countries in both North and South America, and that Antarctica is excluded.

  • Life expectancy: This is the life expectancy in years.

  • Population: This is the number of people living in the country.

  • GDP per capita: This is the gross domestic product (in US dollars).

Now consider the following figure, where the data above is plotted for all the 142 countries.

Press + to interact
Life expectancy over GDP per capita in 2007
Life expectancy over GDP per capita in 2007

Let’s view this plot through the grammar of graphics. The data variable:

  • GDP per capita gets mapped to the x-position aesthetic of the points.

  • Life expectancy gets mapped to the y-position aesthetic of the points.

  • Population gets mapped to the size aesthetic of the points.

  • Continent gets mapped to the color aesthetic of the points.

We’ll shortly see that data corresponds to the particular data frame where our data is saved, and that data variables correspond to particular columns in the data frame. Furthermore, the type of geometric object considered in this plot is points. While in this example we’re considering points, graphics aren’t just limited to points. We can also use lines, bars, and other geometric objects.

Summary of the Grammar of Graphics for this Plot

Data Variable 

aes 

geom

GDP per capita 

Point

Life expectancy 

Point

Population 

Size

Point

Continent

Color

Point

Other components

There are other components of the grammar of graphics we can control as well. As we explore the grammar of graphics, we’ll start to encounter these topics more frequently. In this course, we’ll keep things simple and only work with these two additional components:

  • Faceting breaks up a plot into several plots split by the values of another variable

  • Position adjustments for bar plots

The ggplot2 package

In this course, we’ll use the ggplot2 package for data visualization, which is an implementation of the grammar of graphics for R (Wickham et al., 2019a). Various components of the grammar of graphics are specified in the ggplot() function included in the ggplot2 package. For the purposes of this course, we’ll always provide the ggplot() function with the following arguments (i.e., inputs) at a minimum:

  • The data frame where the variables exist: The data argument

  • The mapping of the variables to aesthetic attributes: The mapping argument specifies the aesthetic attributes involved

After we’ve specified these components, we add layers to the plot using the + sign. The most essential layer to add to a plot is the layer that specifies which type of geometric object we want the plot to involve. These geometric objects can include points, lines, bars, and others. Other layers that we can add to a plot include the plot title, axes labels, visual themes for the plots, and facets.