forked from hadley/ggplot2-book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
mastery.Rmd
261 lines (181 loc) · 19.2 KB
/
mastery.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
```{r mastery, include = FALSE}
source("common.R")
```
# (PART) The Grammar {-}
# Mastering the grammar {#mastery}
## Introduction
In order to unlock the full power of ggplot2, you'll need to master the underlying grammar. By understanding the grammar, and how its components fit together, you can create a wider range of visualizations, combine multiple sources of data, and customise to your heart's content.
This chapter describes the theoretical basis of ggplot2: the layered grammar of graphics. The layered grammar is based on Wilkinson's grammar of graphics [@wilkinson:2006], but adds a number of enhancements that help it to be more expressive and fit seamlessly into the R environment. The differences between the layered grammar and Wilkinson's grammar are described fully in @wickham:2008. In this chapter you will learn a little bit about each component of the grammar and how they all fit together. The next chapters discuss the components in more detail, and provide more examples of how you can use them in practice. \index{Grammar!theory}
The grammar makes it easier for you to iteratively update a plot, changing a single feature at a time. The grammar is also useful because it suggests the high-level aspects of a plot that *can* be changed, giving you a framework to think about graphics, and hopefully shortening the distance from mind to paper. It also encourages the use of graphics customised to a particular problem, rather than relying on specific chart types.
This chapter begins by describing in detail the process of drawing a simple plot. Section \@ref(simple-plot) starts with a simple scatterplot, then Section \@ref(complex-plot) makes it more complex by adding a smooth line and facetting. While working through these examples you will be introduced to all six components of the grammar, which are then defined more precisely in Section \@ref(components).
## Building a scatterplot {#simple-plot}
How are engine size and fuel economy related? We might create a scatterplot of engine displacement and highway mpg with points coloured by number of cylinders:
`r columns(1, 2/3)`
```{r}
ggplot(mpg, aes(displ, hwy, colour = factor(cyl))) +
geom_point()
```
You can create plots like this easily, but what is going on underneath the surface? How does ggplot2 draw this plot? \index{Scatterplot!principles of}
### Mapping aesthetics to data
What precisely is a scatterplot? You have seen many before and have probably even drawn some by hand. A scatterplot represents each observation as a point, positioned according to the value of two variables. As well as a horizontal and vertical position, each point also has a size, a colour and a shape. These attributes are called **aesthetics**, and are the properties that can be perceived on the graphic. Each aesthetic can be mapped to a variable, or set to a constant value. In the previous graphic, `displ` is mapped to horizontal position, `hwy` to vertical position and `cyl` to colour. Size and shape are not mapped to variables, but remain at their (constant) default values. \index{Aesthetics!mapping}
Once we have these mappings we can create a new dataset that records this information:
```{r mapping, echo = FALSE}
scatter <- with(mpg, data.frame(x = displ, y = hwy, colour = cyl))
knitr::kable(head(scatter, 8))
```
This new dataset is a result of applying the aesthetic mappings to the original data. We can create many different types of plots using this data. The scatterplot uses points, but were we instead to draw lines we would get a line plot. If we used bars, we'd get a bar plot. Neither of those examples makes sense for this data, but we could still draw them (I've omitted the legends to save space):
`r columns(2, 2/3)`
```{r other-geoms}
ggplot(mpg, aes(displ, hwy, colour = factor(cyl))) +
geom_line() +
theme(legend.position = "none")
ggplot(mpg, aes(displ, hwy, colour = factor(cyl))) +
geom_bar(stat = "identity", position = "identity", fill = NA) +
theme(legend.position = "none")
```
In ggplot, we can produce many plots that don't make sense, yet are grammatically valid. This is no different than English, where we can create senseless but grammatical sentences like the angry rock barked like a comma.
Points, lines and bars are all examples of geometric objects, or **geoms**. Geoms determine the ``type'' of the plot. Plots that use a single geom are often given a special name:
|Named plot |Geom |Other features |
|:--------------------|:-------|:-------------------------|
|scatterplot |point | |
|bubblechart |point |size mapped to a variable |
|barchart |bar | |
|box-and-whisker plot |boxplot | |
|line chart |line | |
More complex plots with combinations of multiple geoms don't have a special name, and we have to describe them by hand. For example, this plot overlays a per group regression line on top of a scatterplot:
`r columns(1, 2/3)`
```{r complex-plot}
ggplot(mpg, aes(displ, hwy, colour = factor(cyl))) +
geom_point() +
geom_smooth(method = "lm")
```
What would you call this plot? Once you've mastered the grammar, you'll find that many of the plots that you produce are uniquely tailored to your problems and will no longer have special names. \index{Named plots}
### Scaling
The values in the previous table have no meaning to the computer. We need to convert them from data units (e.g., litres, miles per gallon and number of cylinders) to graphical units (e.g., pixels and colours) that the computer can display. This conversion process is called **scaling** and performed by scales. Now that these values are meaningful to the computer, they may not be meaningful to us: colours are represented by a six-letter hexadecimal string, sizes by a number and shapes by an integer. These aesthetic specifications that are meaningful to R are described in `vignette("ggplot2-specs")`. \index{Scales!introduction}
In this example, we have three aesthetics that need to be scaled: horizontal position (`x`), vertical position (`y`) and `colour`. Scaling position is easy in this example because we are using the default linear scales. We need only a linear mapping from the range of the data to $[0, 1]$. We use $[0, 1]$ instead of exact pixels because the drawing system that ggplot2 uses, **grid**, takes care of that final conversion for us. A final step determines how the two positions (x and y) are combined to form the final location on the plot. This is done by the coordinate system, or **coord**. In most cases this will be Cartesian coordinates, but it might be polar coordinates, or a spherical projection used for a map.
The process for mapping the colour is a little more complicated, as we have a non-numeric result: colours. However, colours can be thought of as having three components, corresponding to the three types of colour-detecting cells in the human eye. These three cell types give rise to a three-dimensional colour space. Scaling then involves mapping the data values to points in this space. There are many ways to do this, but here since `cyl` is a categorical variable we map values to evenly spaced hues on the colour wheel, as shown in Figure \@ref(fig:colour-wheel). A different mapping is used when the variable is continuous. \index{Colour!wheel}
```{r colour-wheel, echo = FALSE, out.width = "50%", fig.cap="A colour wheel illustrating the choice of five equally spaced colours. This is the default scale for discrete variables."}
knitr::include_graphics("diagrams/colour-wheel.png", dpi = 300)
```
The result of these conversions is below. As well as aesthetics that have been mapped to variable, we also include aesthetics that are constant. We need these so that the aesthetics for each point are completely specified and R can draw the plot. The points will be filled circles (shape 19 in R) with a 1-mm diameter:
```{r scaled, echo = FALSE}
rescale01 <- function(x) (x - min(x)) / (max(x) - min(x))
p <- ggplot(mpg, aes(displ, hwy, colour = factor(cyl))) +
geom_point()
b <- ggplot_build(p)
scaled <- b$data[[1]][c("x", "y", "colour")]
scaled$x <- rescale01(scaled$x)
scaled$y <- rescale01(scaled$y)
scaled$size <- 1
scaled$shape <- 19
knitr::kable(head(scaled, 8), digits = 3, align = "l")
```
Finally, we need to render this data to create the graphical objects that are displayed on the screen. To create a complete plot we need to combine graphical objects from three sources: the *data*, represented by the point geom; the *scales and coordinate system*, which generate axes and legends so that we can read values from the graph; and *plot annotations*, such as the background and plot title.
## Adding complexity {#complex-plot}
With a simple example under our belts, let's now turn to look at this slightly more complicated example:
`r columns(1, 1 / 2, 0.75)`
```{r complex, message = FALSE}
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth() +
facet_wrap(~year)
```
This plot adds three new components to the mix: facets, multiple layers and statistics. The facets and layers expand the data structure described above: each facet panel in each layer has its own dataset. You can think of this as a 3d array: the panels of the facets form a 2d grid, and the layers extend upwards in the 3rd dimension. In this case the data in the layers is the same, but in general we can plot different datasets on different layers.
The smooth layer is different to the point layer because it doesn't display the raw data, but instead displays a statistical transformation of the data. Specifically, the smooth layer fits a smooth line through the middle of the data. This requires an additional step in the process described above: after mapping the data to aesthetics, the data is passed to a statistical transformation, or **stat**, which manipulates the data in some useful way. In this example, the stat fits the data to a loess smoother, and then returns predictions from evenly spaced points within the range of the data. Other useful stats include 1 and 2d binning, group means, quantile regression and contouring.
As well as adding an additional step to summarise the data, we also need some extra steps when we get to the scales. This is because we now have multiple datasets (for the different facets and layers) and we need to make sure that the scales are the same across all of them. Scaling actually occurs in three parts: transforming, training and mapping. We haven't mentioned transformation before, but you have probably seen it before in log-log plots. In a log-log plot, the data values are not linearly mapped to position on the plot, but are first log-transformed.
* Scale transformation occurs before statistical transformation so that
statistics are computed on the scale-transformed data. This ensures that a
plot of $\log(x)$ vs. $\log(y)$ on linear scales looks the same as $x$ vs.
$y$ on log scales. There are many different transformations that can be used,
including taking square roots, logarithms and reciprocals.
See Section \@ref(scale-position) for more details.
* After the statistics are computed, each scale is trained on every dataset
from all the layers and facets. The training operation combines the ranges
of the individual datasets to get the range of the complete data. Without
this step, scales could only make sense locally and we wouldn't be able to
overlay different layers because their positions wouldn't line up. Sometimes
we do want to vary position scales across facets (but never across layers),
and this is described more fully in
Section \@ref(controlling-scales).
* Finally the scales map the data values into aesthetic values. This is a
local operation: the variables in each dataset are mapped to their aesthetic
values, producing a new dataset that can then be rendered by the geoms.
Figure \@ref(fig:schematic) illustrates the complete process schematically.
```{r schematic, echo = FALSE, out.width = "75%", fig.cap="Schematic description of the plot generation process. Each square represents a layer, and this schematic represents a plot with three layers and three panels. All steps work by transforming individual data frames except for training scales, which doesn't affect the data frame and operates across all datasets simultaneously."}
knitr::include_graphics("diagrams/mastery-schema.png", dpi = 300, auto_pdf = TRUE)
```
## Components of the layered grammar {#components}
In the examples above, we have seen some of the components that make up a plot: data and aesthetic mappings, geometric objects (geoms), statistical transformations (stats), scales, and facetting. We have also touched on the coordinate system. One thing we didn't mention is the position adjustment, which deals with overlapping graphic objects. Together, the data, mappings, stat, geom and position adjustment form a **layer**. A plot may have multiple layers, as in the example where we overlaid a smoothed line on a scatterplot. All together, the layered grammar defines a plot as the combination of: \index{Grammar!components}
* A default dataset and set of mappings from variables to aesthetics.
* One or more layers, each composed of a geometric object, a statistical
transformation, a position adjustment, and optionally, a dataset and
aesthetic mappings.
* One scale for each aesthetic mapping.
* A coordinate system.
* The facetting specification.
The following sections describe each of the higher-level components more precisely, and point you to the parts of the book where they are documented.
### Layers {#mastering-layers}
**Layers** are responsible for creating the objects that we perceive on the plot. A layer is composed of five parts:
1. Data
1. Aesthetic mappings.
1. A statistical transformation (stat).
1. A geometric object (geom).
1. A position adjustment.
The properties of a layer are described in Chapter \@ref(layers) and their uses for data visualisation in are outlined in Chapters \@ref(individual-geoms) to \@ref(annotations).
### Scales {#mastering-scales}
A **scale** controls the mapping from data to aesthetic attributes, and we need a scale for every aesthetic used on a plot. Each scale operates across all the data in the plot, ensuring a consistent mapping from data to aesthetics. Some examples are shown in Figure \ref{fig:scale-legends}.
`r columns(1, 1 / 4, 1)`
```{r scale-legends, echo = FALSE, fig.cap = "Examples of legends from four different scales. From left to right: continuous variable mapped to size, and to colour, discrete variable mapped to shape, and to colour. The ordering of scales seems upside-down, but this matches the labelling of the $y$-axis: small values occur at the bottom."}
x <- 1:10
y <- factor(letters[1:5])
draw_legends(
qplot(x, x, size = x),
qplot(x, x, colour = x),
qplot(y, y, shape = y),
qplot(y, y, colour = y)
)
```
A scale is a function and its inverse, along with a set of parameters. For example, the colour gradient scale maps a segment of the real line to a path through a colour space. The parameters of the function define whether the path is linear or curved, which colour space to use (e.g., LUV or RGB), and the colours at the start and end.
The inverse function is used to draw a guide so that you can read values from the graph. Guides are either axes (for position scales) or legends (for everything else). Most mappings have a unique inverse (i.e., the mapping function is one-to-one), but many do not. A unique inverse makes it possible to recover the original data, but this is not always desirable if we want to focus attention on a single aspect.
For more details, see Chapter \@ref(scales).
### Coordinate system {#coordinate-systems}
A coordinate system, or **coord** for short, maps the position of objects onto the plane of the plot. Position is often specified by two coordinates $(x, y)$, but potentially could be three or more (although this is not implemented in ggplot2). The Cartesian coordinate system is the most common coordinate system for two dimensions, while polar coordinates and various map projections are used less frequently.
Coordinate systems affect all position variables simultaneously and differ from scales in that they also change the appearance of the geometric objects. For example, in polar coordinates, bar geoms look like segments of a circle. Additionally, scaling is performed before statistical transformation, while coordinate transformations occur afterward. The consequences of this are shown in Section \@ref(coord-non-linear).
Coordinate systems control how the axes and grid lines are drawn. Figure \ref{fig:coord} illustrates three different types of coordinate systems. Very little advice is available for drawing these for non-Cartesian coordinate systems, so a lot of work needs to be done to produce polished output. See Chapter \@ref(coord) for more details.
`r columns(3)`
```{r coord, echo = FALSE, fig.cap = "Examples of axes and grid lines for three coordinate systems: Cartesian, semi-log and polar. The polar coordinate system illustrates the difficulties associated with non-Cartesian coordinates: it is hard to draw the axes well."}
df <- data.frame(x1 = c(1, 10), y1 = c(1, 5))
p <- ggplot(df, aes(x1, y1)) +
scale_x_continuous(NULL) +
scale_y_continuous(NULL) +
theme_linedraw()
p
p + coord_trans(y = "log10")
p + coord_polar()
```
### Facetting {#intro-facetting}
There is also another thing that turns out to be sufficiently useful that we should include it in our general framework: facetting, a general case of conditioned or trellised plots. This makes it easy to create small multiples, each showing a different subset of the whole dataset. This is a powerful tool when investigating whether patterns hold across all conditions. The facetting specification describes which variables should be used to split up the data, and whether position scales should be free or constrained. Facetting is described in Section \@ref(position).
## Exercises
1. One of the best ways to get a handle on how the grammar works is to apply
it to the analysis of existing graphics. For each of the graphics listed
below, write down the components of the graphic. Don't worry if you don't
know what the corresponding functions in ggplot2 are called (or if they even
exist!), instead focussing on recording the key elements of a plot so you
could communicate it to someone else.
1. "Napoleon's march" by Charles John Minard:
<http://www.datavis.ca/gallery/re-minard.php>
1. "Where the Heat and the Thunder Hit Their Shots", by Jeremy White,
Joe Ward, and Matthew Ericson at The New York Times.
<http://nyti.ms/1duzTvY>
1. "London Cycle Hire Journeys", by James Cheshire. <http://bit.ly/1S2cyRy>
1. The Pew Research Center's favorite data visualizations of 2014:
<http://pewrsr.ch/1KZSSN6>
1. "The Tony's Have Never Been so Dominated by Women", by
Joanna Kao at FiveThirtyEight: <http://53eig.ht/1cJRCyG>.
1. "In Climbing Income Ladder, Location Matters" by the Mike Bostock,
Shan Carter, Amanda Cox, Matthew Ericson, Josh Keller, Alicia
Parlapiano, Kevin Quealy and Josh Williams at the New York Times:
<http://nyti.ms/1S2dJQT>
1. "Dissecting a Trailer: The Parts of the Film That Make the Cut",
by Shan Carter, Amanda Cox, and Mike Bostock at the New York Times:
<http://nyti.ms/1KTJQOE>