-
Notifications
You must be signed in to change notification settings - Fork 700
/
visualizing_distributions_I.Rmd
441 lines (366 loc) · 25.3 KB
/
visualizing_distributions_I.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
```{r echo = FALSE, message = FALSE}
# run setup script
source("_common.R")
library(tidyr)
library(ggridges) # for geom_density_line()
```
# Visualizing distributions: Histograms and density plots {#histograms-density-plots}
We frequently encounter the situation where we would like to understand how a particular variable is distributed in a dataset. To give a concrete example, we will consider the passengers of the Titanic, a data set we encountered already in Chapter \@ref(visualizing-amounts). There were approximately 1300 passengers on the Titanic (not counting crew), and we have reported ages for 756 of them. We might want to know how many passengers of what ages there were on the Titanic, i.e., how many children, young adults, middle-aged people, seniors, and so on. We call the relative proportions of different ages among the passengers the *age distribution* of the passengers.
## Visualizing a single distribution
We can obtain a sense of the age distribution among the passengers by grouping all passengers into bins with comparable ages and then counting the number of passengers in each bin. This procedure results in a table such as Table \@ref(tab:titanic-ages).
```{r titanic-ages}
titanic <- titanic_all
age_counts <- hist(titanic$age, breaks = (0:15) * 5 + .01, plot = FALSE)$counts
age_hist <- data.frame(
`age range` = c("0--5", "6--10", "11--15", "16--20", "21--25", "26--30", "31--35", "36--40", "41--45", "46--50", "51--55", "56--60", "61--65", "66--70", "71--75"),
count = age_counts,
check.names = FALSE
)
age_hist_display <- rename(
age_hist,
`Age range` = `age range`,
Count = count
)
knitr::kable(
list(
age_hist_display[1:6,], age_hist_display[7:12,], age_hist_display[13:15,]
),
caption = 'Numbers of passenger with known age on the Titanic.', booktabs = TRUE,
row.names = FALSE
)
```
We can visualize this table by drawing filled rectangles whose heights correspond to the counts and whose widths correspond to the width of the age bins (Figure \@ref(fig:titanic-ages-hist1)). Such a visualization is called a histogram. (Note that all bins must have the same width for the visualization to be a valid histogram.)
(ref:titanic-ages-hist1) Histogram of the ages of Titanic passengers.
```{r titanic-ages-hist1, fig.cap='(ref:titanic-ages-hist1)'}
age_hist <- cbind(age_hist, age = (1:15) * 5 - 2.5)
h1 <- ggplot(age_hist, aes(x = age, y = count)) + geom_col(width = 4.7, fill = "#56B4E9") +
scale_y_continuous(expand = c(0, 0), breaks = 25 * (0:5)) +
scale_x_continuous(name = "age (years)", limits = c(0, 75), expand = c(0, 0)) +
coord_cartesian(clip = "off") +
theme_dviz_hgrid() +
theme(
axis.line.x = element_blank(),
plot.margin = margin(3, 7, 3, 1.5)
)
h1
```
Because histograms are generated by binning the data, their exact visual appearance depends on the choice of the bin width. Most visualization programs that generate histograms will choose a bin width by default, but chances are that bin width is not the most appropriate one for any histogram you may want to make. It is therefore critical to always try different bin widths to verify that the resulting histogram reflects the underlying data accurately. In general, if the bin width is too small, then the histogram becomes overly peaky and visually busy and the main trends in the data may be obscured. On the other hand, if the bin width is too large, then smaller features in the distribution of the data, such as the dip around age 10, may disappear.
For the age distribution of Titanic passengers, we can see that a bin width of one year is too small and a bin width of fifteen years is too large, whereas bin widths between three to five years work fine (Figure \@ref(fig:titanic-ages-hist-grid)).
(ref:titanic-ages-hist-grid) Histograms depend on the chosen bin width. Here, the same age distribution of Titanic passengers is shown with four different bin widths: (a) one year; (b) three years; (c) five years; (d) fifteen years.
```{r titanic-ages-hist-grid, fig.width=5.5*6/4.2, fig.cap='(ref:titanic-ages-hist-grid)'}
age_hist_1 <- data.frame(
age = (1:75) - 0.5,
count = hist(titanic$age, breaks = (0:75) + .01, plot = FALSE)$counts
)
age_hist_3 <- data.frame(
age = (1:25) * 3 - 1.5,
count = hist(titanic$age, breaks = (0:25) * 3 + .01, plot = FALSE)$counts
)
age_hist_15 <- data.frame(
age = (1:5) * 15 - 7.5,
count = hist(titanic$age, breaks = (0:5) * 15 + .01, plot = FALSE)$counts
)
h2 <- ggplot(age_hist_1, aes(x = age, y = count)) +
geom_col(width = .85, fill = "#56B4E9") +
scale_y_continuous(expand = c(0, 0), breaks = 10 * (0:5)) +
scale_x_continuous(name = "age (years)", limits = c(0, 75), expand = c(0, 0)) +
coord_cartesian(clip = "off") +
theme_dviz_hgrid(12) +
theme(
axis.line.x = element_blank(),
plot.margin = margin(3, 1.5, 3, 1.5)
)
h3 <- ggplot(age_hist_3, aes(x = age, y = count)) + geom_col(width = 2.75, fill = "#56B4E9") +
scale_y_continuous(expand = c(0, 0), breaks = 25 * (0:5)) +
scale_x_continuous(name = "age (years)", limits = c(0, 75), expand = c(0, 0)) +
coord_cartesian(clip = "off") +
theme_dviz_hgrid(12) +
theme(
axis.line.x = element_blank(),
plot.margin = margin(3, 1.5, 3, 1.5)
)
h4 <- ggplot(age_hist_15, aes(x = age, y = count)) + geom_col(width = 14.5, fill = "#56B4E9") +
scale_y_continuous(expand = c(0, 0), breaks = 100 * (0:4)) +
scale_x_continuous(name = "age (years)", limits = c(0, 75), expand = c(0, 0)) +
coord_cartesian(clip = "off") +
theme_dviz_hgrid(12) +
theme(
axis.line.x = element_blank(),
plot.margin = margin(3, 1.5, 3, 1.5)
)
plot_grid(
h2, NULL, h3,
NULL, NULL, NULL,
h1 + theme_dviz_hgrid(12) + theme(axis.line.x = element_blank(), plot.margin = margin(3, 1.5, 3, 1.5)),
NULL, h4,
align = 'hv',
labels = c("a", "", "b", "", "", "", "c", "", "d"),
rel_widths = c(1, .04, 1),
rel_heights = c(1, .04, 1)
)
```
```{block type='rmdtip', echo=TRUE}
When making a histogram, always explore multiple bin widths.
```
Histograms have been a popular visualization option since at least the 18th century, in part because they are easily generated by hand. More recently, as extensive computing power has become available in everyday devices such as laptops and cell phones, we see them increasingly being replaced by density plots. In a density plot, we attempt to visualize the underlying probability distribution of the data by drawing an appropriate continuous curve (Figure \@ref(fig:titanic-ages-dens1)). This curve needs to be estimated from the data, and the most commonly used method for this estimation procedure is called *kernel density estimation.* In kernel density estimation, we draw a continuous curve (the kernel) with a small width (controlled by a parameter called *bandwidth*) at the location of each data point, and then we add up all these curves to obtain the final density estimate. The most widely used kernel is a Gaussian kernel (i.e., a Gaussian bell curve), but there are many other choices.
(ref:titanic-ages-dens1) Kernel density estimate of the age distribution of passengers on the Titanic. The height of the curve is scaled such that the area under the curve equals one. The density estimate was performed with a Gaussian kernel and a bandwidth of 2.
```{r titanic-ages-dens1, fig.cap='(ref:titanic-ages-dens1)'}
ggplot(titanic, aes(x = age)) +
geom_density_line(fill = "#56B4E9", color = darken("#56B4E9", 0.5), bw = 2, kernel = "gaussian") +
scale_y_continuous(limits = c(0, 0.046), expand = c(0, 0), name = "density") +
scale_x_continuous(name = "age (years)", limits = c(0, 75), expand = c(0, 0)) +
coord_cartesian(clip = "off") +
theme_dviz_hgrid() +
theme(
axis.line.x = element_blank(),
plot.margin = margin(3, 7, 3, 1.5)
)
```
Just as is the case with histograms, the exact visual appearance of a density plot depends on the kernel and bandwidth choices (Figure \@ref(fig:titanic-ages-dens-grid)). The bandwidth parameter behaves similarly to the bin width in histograms. If the bandwidth is too small, then the density estimate can become overly peaky and visually busy and the main trends in the data may be obscured. On the other hand, if the bandwidth is too large, then smaller features in the distribution of the data may disappear. In addition, the choice of the kernel affects the shape of the density curve. For example, a Gaussian kernel will have a tendency to produce density estimates that look Gaussian-like, with smooth features and tails. By contrast, a rectangular kernel can generate the appearance of steps in the density curve (Figure \@ref(fig:titanic-ages-dens-grid)d). In general, the more data points there are in the data set, the less the choice of the kernel matters. Therefore, density plots tend to be quite reliable and informative for large data sets but can be misleading for data sets of only a few points.
(ref:titanic-ages-dens-grid) Kernel density estimates depend on the chosen kernel and bandwidth. Here, the same age distribution of Titanic passengers is shown for four different combinations of these parameters: (a) Gaussian kernel, bandwidth = 0.5; (b) Gaussian kernel, bandwidth = 2; (c) Gaussian kernel, bandwidth = 5; (d) Rectangular kernel, bandwidth = 2.
```{r titanic-ages-dens-grid, fig.width=5.5*6/4.2, fig.cap='(ref:titanic-ages-dens-grid)'}
pdens1 <- ggplot(titanic, aes(x = age)) +
geom_density_line(fill = "#56B4E9", color = darken("#56B4E9", 0.5), bw = .5, kernel = "gaussian") +
scale_y_continuous(limits = c(0, 0.046), expand = c(0, 0), name = "density") +
scale_x_continuous(name = "age (years)", limits = c(0, 75), expand = c(0, 0)) +
coord_cartesian(clip = "off") +
theme_dviz_hgrid(12) +
theme(
axis.line.x = element_blank(),
plot.margin = margin(3, 1.5, 3, 1.5)
)
pdens2 <- ggplot(titanic, aes(x = age)) +
geom_density_line(fill = "#56B4E9", color = darken("#56B4E9", 0.5), bw = 2, kernel = "gaussian") +
scale_y_continuous(limits = c(0, 0.046), expand = c(0, 0), name = "density") +
scale_x_continuous(name = "age (years)", limits = c(0, 75), expand = c(0, 0)) +
coord_cartesian(clip = "off") +
theme_dviz_hgrid(12) +
theme(
axis.line.x = element_blank(),
plot.margin = margin(3, 1.5, 3, 1.5)
)
pdens3 <- ggplot(titanic, aes(x = age)) +
geom_density_line(fill = "#56B4E9", color = darken("#56B4E9", 0.5), bw = 5, kernel = "gaussian") +
scale_y_continuous(limits = c(0, 0.046), expand = c(0, 0), name = "density") +
scale_x_continuous(name = "age (years)", limits = c(0, 75), expand = c(0, 0)) +
coord_cartesian(clip = "off") +
theme_dviz_hgrid(12) +
theme(
axis.line.x = element_blank(),
plot.margin = margin(3, 1.5, 3, 1.5)
)
pdens4 <- ggplot(titanic, aes(x = age)) +
geom_density_line(fill = "#56B4E9", color = darken("#56B4E9", 0.5), bw = 2, kernel = "rectangular") +
scale_y_continuous(limits = c(0, 0.046), expand = c(0, 0), name = "density") +
scale_x_continuous(name = "age (years)", limits = c(0, 75), expand = c(0, 0)) +
coord_cartesian(clip = "off") +
theme_dviz_hgrid(12) +
theme(
axis.line.x = element_blank(),
plot.margin = margin(3, 1.5, 3, 1.5)
)
plot_grid(
pdens1, NULL, pdens2,
NULL, NULL, NULL,
pdens3, NULL, pdens4,
align = 'hv',
labels = c("a", "", "b", "", "", "", "c", "", "d"),
rel_widths = c(1, .04, 1),
rel_heights = c(1, .04, 1)
)
```
Density curves are usually scaled such that the area under the curve equals one. This convention can make the *y* axis scale confusing, because it depends on the units of the *x* axis. For example, in the case of the age distribution, the data range on the *x* axis goes from 0 to approximately 75. Therefore, we expect the mean height of the density curve to be 1/75 = 0.013. Indeed, when looking at the age density curves (e.g., Figure \@ref(fig:titanic-ages-dens-grid)), we see that the *y* values range from 0 to approximately 0.04, with an average of somewhere close to 0.01.
Kernel density estimates have one pitfall that we need to be aware of: They have a tendency to produce the appearance of data where none exists, in particular in the tails. As a consequence, careless use of density estimates can easily lead to figures that make nonsensical statements. For example, if we don't pay attention, we might generate a visualization of an age distribution that includes negative ages (Figure \@ref(fig:titanic-ages-dens-negative)).
(ref:titanic-ages-dens-negative) Kernel density estimates can extend the tails of the distribution into areas where no data exist and no data are even possible. Here, the density estimate has been allowed to extend into the negative age range. This is clearly nonsensical and should be avoided.
```{r titanic-ages-dens-negative, fig.cap='(ref:titanic-ages-dens-negative)'}
pdens_neg <- ggplot(titanic, aes(x = age)) +
geom_density_line(fill = "#56B4E9", color = darken("#56B4E9", 0.5)) +
scale_y_continuous(limits = c(0, 0.046), expand = c(0, 0), name = "density") +
scale_x_continuous(name = "age (years)", limits = c(-10, 79), expand = c(0, 0)) +
coord_cartesian(clip = "off") +
theme_dviz_hgrid() +
theme(
axis.line.x = element_blank(),
plot.margin = margin(3, 7, 3, 1.5)
)
stamp_wrong(pdens_neg)
```
```{block type='rmdtip', echo=TRUE}
Always verify that your density estimate does not predict the existence of nonsensical data values.
```
So should you use a histogram or a density plot to visualize a distribution? Heated discussions can be had on this topic. Some people are vehemently against density plots and believe that they are arbitrary and misleading. Others realize that histograms can be just as arbitrary and misleading. I think the choice is largely a matter of taste, but sometimes one or the other option may more accurately reflect the specific features of interest in the data at hand. There is also the possibility of using neither and instead choosing empirical cumulative density functions or q-q plots (Chapter \@ref(ecdf-qq)). Finally, I believe that density estimates have an inherent advantage over histograms as soon as we want to visualize more than one distribution at a time (see next section).
## Visualizing multiple distributions at the same time {#multiple-histograms-densities}
In many scenarios we have multiple distributions we would like to visualize simultaneously. For example, let's say we'd like to see how the ages of Titanic passengers are distributed between men and women. Were men and women passengers generally of the same age, or was there an age difference between the genders? One commonly employed visualization strategy in this case is a stacked histogram, where we draw the histogram bars for women on top of the bars for men, in a different color (Figure \@ref(fig:titanic-age-stacked-hist)).
(ref:titanic-age-stacked-hist) Histogram of the ages of Titanic passengers stratified by gender. This figure has been labeled as "bad" because stacked histograms are easily confused with overlapping histograms (see also Figure \@ref(fig:titanic-age-overlapping-hist)). In addition, the heights of the bars representing female passengers cannot easily be compared to each other.
```{r titanic-age-stacked-hist, fig.cap='(ref:titanic-age-stacked-hist)'}
data.frame(
age = (1:25)*3 - 1.5,
male = hist(filter(titanic, sex == "male")$age, breaks = (0:25)*3 + .01, plot = FALSE)$counts,
female = hist(filter(titanic, sex == "female")$age, breaks = (0:25)*3 + .01, plot = FALSE)$counts
) %>%
gather(gender, count, -age) -> gender_counts
gender_counts$gender <- factor(gender_counts$gender, levels = c("female", "male"))
p_hist_stacked <- ggplot(gender_counts, aes(x = age, y = count, fill = gender)) +
geom_col(position = "stack") +
scale_x_continuous(name = "age (years)", limits = c(0, 75), expand = c(0, 0)) +
scale_y_continuous(limits = c(0, 89), expand = c(0, 0), name = "count") +
scale_fill_manual(values = c("#D55E00", "#0072B2")) +
coord_cartesian(clip = "off") +
theme_dviz_hgrid() +
theme(
axis.line.x = element_blank(),
legend.position = c(.9, .87),
legend.justification = c("right", "top"),
legend.box.background = element_rect(fill = "white", color = "white"),
plot.margin = margin(3, 7, 3, 1.5)
)
stamp_bad(p_hist_stacked)
```
In my opinion, this type of visualization should be avoided. There are two key problems here: First, from just looking at the figure, it is never entirely clear where exactly the bars begin. Do they start where the color changes or are they meant to start at zero? In other words, are there about 25 females of age 18--20 or are there almost 80? (The former is the case.) Second, the bar heights for the female counts cannot be directly compared to each other, because the bars all start at a different height. For example, the men were on average older than the women, and this fact is not at all visible in Figure \@ref(fig:titanic-age-stacked-hist).
We could try to address these problems by having all bars start at zero and making the bars partially transparent (Figure \@ref(fig:titanic-age-overlapping-hist)).
(ref:titanic-age-overlapping-hist) Age distributions of male and female Titanic passengers, shown as two overlapping histograms. This figure has been labeled as "bad" because there is no clear visual indication that all blue bars start at a count of 0.
```{r titanic-age-overlapping-hist, fig.cap='(ref:titanic-age-overlapping-hist)'}
p_hist_overlapped <- ggplot(gender_counts, aes(x = age, y = count, fill = gender)) +
geom_col(position = "identity", alpha = 0.7) +
scale_x_continuous(name = "age (years)", limits = c(0, 75), expand = c(0, 0)) +
scale_y_continuous(limits = c(0, 56), expand = c(0, 0), name = "count") +
scale_fill_manual(
values = c("#D55E00", "#0072B2"),
guide = guide_legend(reverse = TRUE)
) +
coord_cartesian(clip = "off") +
theme_dviz_hgrid() +
theme(
axis.line.x = element_blank(),
legend.position = c(.9, .87),
legend.justification = c("right", "top"),
legend.box.background = element_rect(fill = "white", color = "white"),
plot.margin = margin(3, 7, 3, 1.5)
)
stamp_bad(p_hist_overlapped)
```
However, this approach generates new problems. Now it appears that there are actually three different groups, not just two, and we're still not entirely sure where each bar starts and ends. Overlapping histograms don't work well because a semi-transparent bar drawn on top of another tends to not look like a semi-transparent bar but instead like a bar drawn in a different color.
Overlapping density plots don't typically have the problem that overlapping histograms have, because the continuous density lines help the eye keep the distributions separate. However, for this particular dataset, the age distributions for male and female passengers are nearly identical up to around age 17 and then diverge, so that the resulting visualization is still not ideal (Figure \@ref(fig:titanic-age-overlapping-dens)).
(ref:titanic-age-overlapping-dens) Density estimates of the ages of male and female Titanic passengers. To highlight that there were more male than female passengers, the density curves were scaled such that the area under each curve corresponds to the total number of male and female passengers with known age (468 and 288, respectively).
```{r titanic-age-overlapping-dens, fig.cap='(ref:titanic-age-overlapping-dens)'}
titanic2 <- titanic
titanic2$sex <- factor(titanic2$sex, levels = c("male", "female"))
ggplot(titanic2, aes(x = age, y = ..count.., fill = sex, color = sex)) +
geom_density_line(bw = 2, alpha = 0.7) +
scale_x_continuous(name = "age (years)", limits = c(0, 75), expand = c(0, 0)) +
scale_y_continuous(limits = c(0, 19), expand = c(0, 0), name = "scaled density") +
scale_fill_manual(values = c("#0072B2", "#D55E00"), name = "gender") +
scale_color_manual(values = darken(c("#0072B2", "#D55E00"), 0.5), name = "gender") +
guides(fill = guide_legend(override.aes = list(linetype = 0))) +
coord_cartesian(clip = "off") +
theme_dviz_hgrid() +
theme(
axis.line.x = element_blank(),
legend.position = c(.9, .87),
legend.justification = c("right", "top"),
legend.box.background = element_rect(fill = "white", color = "white"),
plot.margin = margin(3, 7, 3, 1.5)
)
```
A solution that works well for this dataset is to show the age distributions of male and female passengers separately, each as a proportion of the overall age distribution (Figure \@ref(fig:titanic-age-fractional-dens)). This visualization shows intuitively and clearly that there were many fewer women than men in the 20--50-year age range on the Titanic.
(ref:titanic-age-fractional-dens) Age distributions of male and female Titanic passengers, shown as proportion of the passenger total. The colored areas show the density estimates of the ages of male and female passengers, respectively, and the gray areas show the overall passenger age distribution.
```{r titanic-age-fractional-dens, fig.width = 5.5*6/4.2, fig.asp = .45, fig.cap='(ref:titanic-age-fractional-dens)'}
ggplot(titanic2, aes(x = age, y = ..count..)) +
geom_density_line(
data = select(titanic, -sex), aes(fill = "all passengers"),
color = "transparent"
) +
geom_density_line(aes(fill = sex), bw = 2, color = "transparent") +
scale_x_continuous(limits = c(0, 75), name = "passenger age (years)", expand = c(0, 0)) +
scale_y_continuous(limits = c(0, 26), name = "scaled density", expand = c(0, 0)) +
scale_fill_manual(
values = c("#b3b3b3a0", "#D55E00", "#0072B2"),
breaks = c("all passengers", "male", "female"),
labels = c("all passengers ", "males ", "females"),
name = NULL,
guide = guide_legend(direction = "horizontal")
) +
coord_cartesian(clip = "off") +
facet_wrap(~sex, labeller = labeller(sex = function(sex) paste(sex, "passengers"))) +
theme_dviz_hgrid() +
theme(
axis.line.x = element_blank(),
strip.text = element_text(size = 14, margin = margin(0, 0, 0.2, 0, "cm")),
legend.position = "bottom",
legend.justification = "right",
legend.margin = margin(4.5, 0, 1.5, 0, "pt"),
legend.spacing.x = grid::unit(4.5, "pt"),
legend.spacing.y = grid::unit(0, "pt"),
legend.box.spacing = grid::unit(0, "cm")
)
```
Finally, when we want to visualize exactly two distributions, we can also make two separate histograms, rotate them by 90 degrees, and have the bars in one histogram point into the opposite direction of the other. This trick is commonly employed when visualizing age distributions, and the resulting plot is usually called an *age pyramid* (Figure \@ref(fig:titanic-age-pyramid)).
(ref:titanic-age-pyramid) The age distributions of male and female Titanic passengers visualized as an age pyramid.
```{r titanic-age-pyramid, fig.cap='(ref:titanic-age-pyramid)'}
ggplot(gender_counts, aes(x = age, y = ifelse(gender == "male",-1, 1)*count, fill = gender)) +
geom_col() +
scale_x_continuous(name = "age (years)", limits = c(0, 75), expand = c(0, 0)) +
scale_y_continuous(name = "count", breaks = 20*(-2:1), labels = c("40", "20", "0", "20")) +
scale_fill_manual(values = c("#D55E00", "#0072B2"), guide = "none") +
draw_text(x = 70, y = -39, "male", hjust = 0) +
draw_text(x = 70, y = 21, "female", hjust = 0) +
coord_flip() +
theme_dviz_grid() +
theme(axis.title.x = element_text(hjust = 0.61))
```
Importantly, this trick does not work when there are more than two distributions we want to visualize at the same time. For multiple distributions, histograms tend to become highly confusing, whereas density plots work well as long as the distributions are somewhat distinct and contiguous. For example, to visualize the distribution of butterfat percentage among cows from four different cattle breeds, density plots are fine (Figure \@ref(fig:butterfat-densitites)).
(ref:butterfat-densitites) Density estimates of the butterfat percentage in the milk of four cattle breeds. Data Source: Canadian Record of Performance for Purebred Dairy Cattle
```{r butterfat-densitites, fig.cap='(ref:butterfat-densitites)'}
cows %>%
mutate(breed = as.character(breed)) %>%
filter(breed != "Canadian") -> cows_filtered
# compute densities for sepal lengths
cows_dens <- group_by(cows_filtered, breed) %>%
do(ggplot2:::compute_density(.$butterfat, NULL)) %>%
rename(butterfat = x)
# get the maximum values
cows_max <- filter(cows_dens, density == max(density)) %>%
ungroup() %>%
mutate(
hjust = c(0, 0, 0, 0),
vjust = c(0, 0, 0, 0),
nudge_x = c(-0.2, -0.2, 0.1, 0.23),
nudge_y = c(0.03, 0.03, -0.2, -0.06)
)
cows_p <- ggplot(cows_dens, aes(x = butterfat, y = density, color = breed, fill = breed)) +
geom_density_line(stat = "identity") +
geom_text(
data = cows_max,
aes(
label = breed, hjust = hjust, vjust = vjust,
color = breed,
x = butterfat + nudge_x,
y = density + nudge_y
),
inherit.aes = FALSE,
size = 12/.pt
) +
scale_color_manual(
values = darken(c("#56B4E9", "#E69F00", "#D55E00", "#009E73"), 0.3),
breaks = c("Ayrshire", "Guernsey", "Holstein-Friesian", "Jersey"),
guide = "none"
) +
scale_fill_manual(
values = c("#56B4E950", "#E69F0050", "#D55E0050", "#009E7350"),
breaks = c("Ayrshire", "Guernsey", "Holstein-Friesian", "Jersey"),
guide = "none"
) +
scale_x_continuous(
expand = c(0, 0),
labels = scales::percent_format(accuracy = 1, scale = 1),
name = "butterfat contents"
) +
scale_y_continuous(limits = c(0, 1.99), expand = c(0, 0)) +
coord_cartesian(clip = "off") +
theme_dviz_hgrid() +
theme(axis.line.x = element_blank())
cows_p
```
```{block type='rmdtip', echo=TRUE}
To visualize several distributions at once, kernel density plots will generally work better than histograms.
```