forked from geanders/RProgrammingForResearch
-
Notifications
You must be signed in to change notification settings - Fork 0
/
04-reportingresults1.Rmd
1106 lines (872 loc) · 52.6 KB
/
04-reportingresults1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Reporting data results #1
[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week4.pdf) a pdf of the lecture slides covering this topic.
## Guidelines for good plots
There are a number of very thoughtful books and articles about creating graphics that effectively communicate information. Some of the authors I highly recommend (and from whose work I've pulled the guidelines for good graphics we'll talk about this week) are:
- Edward Tufte
- Howard Wainer
- Stephen Few
- Nathan Yau
You should plan, in particular, to read *The Visual Display of Quantitative Information* by Edward Tufte before you graduate.
This week, we'll focus on six guidelines for good graphics, based on the writings of these and other specialists in data display. The guidelines are:
1. Aim for high data density.
2. Use clear, meaningful labels.
3. Provide useful references.
4. Highlight interesting aspects of the data.
5. Make order meaningful.
6. When possible, use small multiples.
For the examples, I'll use `dplyr` for data cleaning and, for plotting, the packages `ggplot2`, `gridExtra`, and `ggthemes`.
```{r message = FALSE}
library(tidyverse) ## Loads `dplyr` and `ggplot2`
library(gridExtra)
library(ggthemes)
```
You can load the data for today's examples with the following code:
```{r message = FALSE}
library(faraway)
data(nepali)
data(worldcup)
library(dlnm)
data(chicagoNMMAPS)
chic <- chicagoNMMAPS
chic_july <- chic %>%
filter(month == 7 & year == 1995)
```
## High data density
> Guideline 1: **Aim for high data density.**
You should try to increase, as much as possible, the **data to ink ratio** in your graphs. This is the ratio of "ink" providing information to all ink used in the figure. One way to think about this is that the only graphs you make that use up a lot of your printer's ink should be packed with information.
The two graphs in Figure \@ref(fig:datainkratio1) show the same information, but use very different amounts of ink. Each shows the number of players in each of four positions in the `worldcup` dataset. Notice how, in the plot on the right, a single dot for each category shows the same information that a whole filled bar is showing on the left. Further, the plot on the right has removed the gridded background, removing even more "ink".
```{r datainkratio1, echo = FALSE, fig.height = 3, fig.width = 8, fig.align = "center", fig.cap = "Example of plots with lower (left) and higher (right) data-to-ink ratios. Each plot shows the number of players in each position in the worldcup dataset from the faraway package."}
a <- ggplot(worldcup, aes(Position)) +
geom_bar() + coord_flip() +
ylab("Number of players") +
ggtitle("1. Lower data density")
ex <- group_by(worldcup, Position) %>%
summarise(n = n())
b <- ggplot(ex, aes(x = n, y = Position)) +
geom_point() +
xlab("Number of players") + ylab("") +
theme_few() +
xlim(0, 250) +
ggtitle("2. Higher data density")
grid.arrange(a, b, ncol = 2)
```
Figure \@ref(fig:datainkratio2) gives another example of two plots that show the same information but with very different data densities. This figure uses the `chicagoNMMAPS` data from the `dlnm` package, which includes daily mortality, weather, and air pollution data for Chicago, IL. Both plots show daily mortality counts during July 1995, when a very severe heat wave hit Chicago. Notice how many of the elements in the plot on the left, including the shading under the mortality time series and the colored background and grid lines, are unnecessary for interpreting the message from the data.
```{r datainkratio2, echo = FALSE, fig.height = 3, fig.width = 8, fig.align = "center", fig.cap = "Example of plots with lower (left) and higher (right) data-to-ink ratios. Each plot shows daily mortality in Chicago, IL, in July 1995 using the chicagoNMMAPS data from the dlnm package."}
a <- ggplot(chic_july, aes(x = date, y = death)) +
geom_area(fill = "black") +
xlab("Day in July 1995") +
ylab("All-cause deaths") +
ylim(0, 450) +
theme_excel() +
ggtitle("1. Lower data density")
b <- ggplot(chic_july, aes(x = as.POSIXlt(date)$mday,
y = death)) +
geom_line() +
xlab("Day in July 1995") +
ylab("All-cause deaths") +
ylim(0, 450) +
theme_tufte() +
ggtitle("2. Higher data density")
grid.arrange(a, b, ncol = 2)
```
By increasing the data-to-ink ratio in a plot, you can help viewers see the message of the data more quickly. A cluttered plot is harder to interpret. Further, you leave room to add some of the other elements I'll talk about, including highlighting interesting data and adding useful references. Notice how the plots on the left in Figures \@ref(fig:datainkratio1) and \@ref(fig:datainkratio2) are already cluttered and leave little room for adding extra elements, while the plots on the right of those figures have much more room for additions.
One quick way to increase data density in `ggplot2` is to change the *theme* for the plot. The theme specifies a number of the "background" elements to a plot, including elements like the plot grid, background color, and the font used for labeling. Some themes come with `ggplot2`, including:
- `theme_bw`
- `theme_minimal`
- `theme_void`
You can find more themes in packages that extend `ggplot2`. The `ggthemes` package, in particular, has some excellent additional themes.
Figures \@ref(fig:themeexamples) shows some examples of the effects of using different themes. All show the same information-- a plot of daily deaths in Chicago in July 1995. The top left graph shows the graph with the default theme. The other plots show the effects of adding different themes, including the black-and-white theme that comes with `ggplot2` (top right) and various themes from the `ggthemes` package. You can even use themes to add some questionable choices for different elements, like the Excel theme (bottom left).
```{r themeexamples, echo = FALSE, fig.height = 9, fig.width = 8, fig.align = "center", fig.cap = "Daily mortality in Chicago, IL, in July 1995. This figure gives an example of the plot using different themes."}
chic_plot <- ggplot(chic_july, aes(x = date, y = death)) +
geom_point(color = "red")
a <- chic_plot + ggtitle("Default theme")
b <- chic_plot + theme_bw() + ggtitle("`theme_bw`")
c <- chic_plot + theme_few() + ggtitle("`theme_few`")
d <- chic_plot + theme_tufte() + ggtitle("`theme_tufte`")
e <- chic_plot + theme_excel() + ggtitle("`theme_excel`")
grid.arrange(a, b, c, d, e, ncol = 2)
```
## Meaningful labels
> Guideline 2: **Use clear, meaningful labels.**
Graphs often default to use abbreviations for axis labels and other labeling. For example, the default is for `ggplot2` plots to use column names for the x- and y-axes of a scatterplot. While this is convenient for exploratory plots, it's often not adequate for plots for presentations and papers. You'll want to use short and easy-to-type column names in your dataframe to make coding easier, but you should use longer and more meaningful labeling in plots and tables that others need to interpret.
Furthermore, text labels can sometimes be aligned in a way that makes them hard to read. For example, when plotting a categorical variable along the x-axis, it can be difficult to fit labels for each category that are long enough to be meaningful.
Figure \@ref(fig:labelsexample) gives an example of the same information shown with labels that are harder to interpret (left) versus with clear, meaningful labels (right). Notice how the graph on the left is using abbreviations for the categorical variable ("DF" for "Defense"), abbreviations for axis labels ("Pos" for "Position" and "Pls" for "Number of players"), and has the player position labels in a vertical alignment. On the right graph, I have made the graph easier to quickly read and interpret by spelling out all labels and switching the x- and y-axes, so that there's room to fully spell out each position while still keeping the alignment horizontal, so the reader doesn't have to turn the page (or their head) to read the values.
```{r labelsexample, echo = FALSE, fig.height = 3, fig.width = 8, fig.align = "center", fig.cap = "The number of players in each position in the worldcup data from the faraway package. Both graphs show the same information, but the left graph has murkier labels, while the right graph has labels that are easier to read and interpret."}
ex <- worldcup
ex$Position <- factor(ex$Position,
levels = c("Defender",
"Forward",
"Goalkeeper",
"Midfielder"),
labels = c("DF", "FW",
"GK", "MF"))
a <- ggplot(ex, aes(Position)) +
geom_bar() +
ylab("Pls") +
xlab("Pos") +
ggtitle("1. Murkier labels") +
theme(axis.text.x =
element_text(angle = 90,
vjust = 0.5,
hjust=1))
b <- ggplot(worldcup, aes(Position)) +
geom_bar(fill = "lightgray") + coord_flip() +
ylab("Number of players") + xlab("") +
theme_tufte() +
ggtitle("2. Clearer labels")
grid.arrange(a, b, ncol = 2)
```
There are a few strategies you can use to make labels clearer when plotting with `ggplot2`:
- Add `xlab` and `ylab` elements to the plot, rather than relying on the column names in the original data. You can also relabel x- and y-axes with `scale` elements (e.g., `scale_x_continuous`), and the `scale` functions give you more power to also make other changes to the x- and y-axes (e.g., changing break points for the axis ticks). However, if you only need to change axis labels, `xlab` and `ylab` are often quicker.
- Include units of measurement in axis titles when relevant. If units are dollars or percent, check out the `scales` package, which allows you to add labels directly to axis elements by including arguments like `labels = percent` in `scale` elements. See the helpfile for `scale_x_continuous` for some examples.
- If the x-variable requires longer labels, as is often the case with categorical data (for example, player positions Figure \@ref(fig:labelsexample)), consider flipping the coordinates, rather than abbreviating or rotating the labels. You can use `coord_flip` to do this.
## References
> Guideline 3: **Provide useful references.**
Data is easier to interpret when you add references. For example, if you show what it typical, it helps viewers interpret how unusual outliers are.
Figure \@ref(fig:referenceexample1) shows daily mortality during July 1995 in Chicago, IL. The graph on the right has added shading showing the range of daily death counts in July in Chicago for neighboring years (1990--1994 and 1996--2000). This added reference helps clarify for viewers how unusual the number of deaths during the July 1995 heat wave was.
```{r referenceexample1, echo = FALSE, fig.height = 3, fig.width = 8, fig.align = "center", fig.cap = "Daily mortality during July 1995 in Chicago, IL. In the graph on the right, I have added a shaded region showing the range of daily mortality counts for neighboring years, to show how unusual this event was."}
chic_july <- subset(chic, month == 7 & year == 1995)
chic_july_ref <- filter(chic, month == 7 &
year %in% c(1990:1994,
1996:2000)) %>%
summarise(mean = mean(death),
min = min(death),
max = max(death))
ref_points <- data.frame(date = c(-2, 33, 33, -2),
death = c(rep(chic_july_ref$max, 2),
rep(chic_july_ref$min, 2)))
a <- ggplot(chic_july, aes(x = as.POSIXlt(date)$mday,
y = death)) +
geom_line() +
xlab("Day in July 1995") +
ylab("All-cause deaths") +
ylim(0, 450) +
theme_tufte() +
ggtitle("1. No reference")
b <- ggplot(chic_july, aes(x = as.POSIXlt(date)$mday,
y = death)) +
xlab("Day in July 1995") +
ylab("All-cause deaths") +
ylim(0, 450) +
theme_tufte() +
geom_polygon(aes(x = date, y = death),
data = ref_points,
color = "lightgray",
alpha = 0.1) +
geom_line() +
ggtitle("2. Reference")
grid.arrange(a, b, ncol = 2)
```
Another useful way to add references is to add a linear or smooth fit to the data, to help clarify trends in the data. Figure \@ref(fig:referenceexample2) shows the relationship between passes and shots for Forwards in the `worldcup` dataset. The plot on the right has added a smooth function of the relationship between these two variables.
```{r referenceexample2, echo = FALSE, message = FALSE, fig.width = 8, fig.height = 4, fig.align = "center", fig.cap = "Relationship between passes and shots taken among Forwards in the worldcup dataset from the faraway package. The plot on the right has a smooth function added to help show the relationship between these two variables."}
ex <- filter(worldcup, Position == "Forward")
a <- ggplot(ex, aes(x = Passes, y = Shots)) +
geom_point(size = 1.5) +
theme_few() +
ggtitle("1. No reference")
b <- ggplot(ex, aes(x = Passes, y = Shots)) +
geom_point(size = 1.5) +
theme_few() +
geom_smooth() +
ggtitle("2. Reference")
grid.arrange(a, b, ncol = 2)
```
For scatterplots created with `ggplot2`, you can use the function `geom_smooth` to add a smooth or linear reference line. Here is the code that produces Figure \@ref(fig:referenceexample2):
```{r eval = FALSE}
ggplot(filter(worldcup, Position == "Forward"),
geom_point(size = 1.5) +
theme_few() +
geom_smooth()
```
The most useful `geom_smooth` parameters to know are:
- `method`: The default is to add a loess curve if the data includes less than 1000 points and a generalized additive model for 1000 points or more. However, you can change to show the fitted line from a linear model using `method = "lm"` or from a generalized linear model using `method = "glm"`.
- `span`: How wiggly or smooth the smooth line should be (smaller value: more wiggly; larger value: more smooth)
- `se`: TRUE or FALSE, indicating whether to include shading for 95% confidence intervals.
- `level`: Confidence level for confidence interval (e.g., `0.90` for 90% confidence intervals)
Lines and polygons can also be useful for adding references, as in Figure \@ref(fig:referenceexample1). Useful geoms for such shapes include:
- `geom_hline`, `geom_vline`: Add a horizontal or vertical line
- `geom_abline`: Add a line with an intercept and slope
- `geom_polygon`: Add a filled polygon
- `geom_path`: Add an unfilled polygon
You want these references to support the main data shown in the plot, but not overwhelm it. When adding these references:
- Add reference elements first, so they will be plotted under the data, instead of on top of it.
- Use `alpha` to add transparency to these elements.
- Use colors that are unobtrusive (e.g., grays).
- For lines, consider using non-solid line types (e.g., `linetype = 3`).
## Highlighting
> Guideline 4: **Highlight interesting aspects.**
Consider adding elements to highlight noteworthy elements of the data. For example, in the graph on the right of Figure \@ref(fig:highlightexample1), the days of the heat wave (based on temperature measurements) have been highlighted over the mortality time series by using a thick red line.
```{r highlightexample1, echo = FALSE, fig.height = 3, fig.width = 8, fig.align = "center", fig.cap = "Mortality in Chicago, July 1995. In the plot on the right, a thick red line has been added to show the dates of a heat wave."}
chic_july <- subset(chic, month == 7 & year == 1995)
chic_july_ref <- filter(chic, month == 7 &
year %in% c(1990:1994,
1996:2000)) %>%
summarise(mean = mean(death),
min = min(death),
max = max(death))
ref_points <- data.frame(date = c(-2, 33, 33, -2),
death = c(rep(chic_july_ref$max, 2),
rep(chic_july_ref$min, 2)))
hw <- data.frame(date = c(12, 16, 16, 12),
death = c(425, 425, 0, 0))
a <- ggplot(chic_july, aes(x = as.POSIXlt(date)$mday,
y = death)) +
xlab("Day in July 1995") +
ylab("All-cause deaths") +
ylim(0, 450) +
theme_tufte() +
geom_polygon(aes(x = date, y = death),
data = ref_points,
color = "lightgray",
alpha = 0.1) +
geom_line() +
ggtitle("1. No highlighting")
b <- ggplot(chic_july, aes(x = as.POSIXlt(date)$mday,
y = death)) +
xlab("Day in July 1995") +
ylab("All-cause deaths") +
ylim(0, 450) +
theme_tufte() +
geom_polygon(aes(x = date, y = death),
data = ref_points,
color = "lightgray",
alpha = 0.1) +
geom_line(aes(x = date, y = death),
data = hw[1:2, ],
color = "red",
size = 2) +
geom_line() +
ggtitle("2. With highlighting")
grid.arrange(a, b, ncol = 2)
```
In the below graphs, the names of the players with the most shots and passes have been added to highlight these unusual points. \bigskip
```{r echo = FALSE, message = FALSE, fig.width = 8, fig.height = 4}
ex <- subset(worldcup, Position == "Forward")
a <- ggplot(ex, aes(x = Passes, y = Shots)) +
geom_point(size = 1.5, alpha = 0.5) +
theme_few() +
ggtitle("1. No highlighting")
most_shots <- ex[which.max(ex$Shots), ]
most_passes <- ex[which.max(ex$Passes), ]
b <- ggplot(ex, aes(x = Passes, y = Shots)) +
geom_point(size = 1.5, alpha = 0.5) +
theme_few() +
ggtitle("2. Highlighting") +
geom_text(data = most_shots,
label = paste(rownames(most_shots), ",",
most_shots$Team, " "),
colour = "blue", size = 3,
hjust = 1, vjust = 0.4) +
geom_text(data = most_passes,
label = paste(rownames(most_passes), ",",
most_passes$Team, " "),
colour = "blue", size = 3,
hjust = 1, vjust = 0.4)
grid.arrange(a, b, ncol = 2)
```
One helpful way to annotate is with text, using `geom_text()`. For this, you'll first need to create a dataframe with the hottest day in the data:
```{r}
hottest_day <- chic_july %>%
filter(temp == max(temp))
hottest_day[ , 1:6]
```
```{r fig.height = 3, fig.width = 4, out.width = "0.7\\textwidth", fig.align = "center"}
chic_plot + geom_text(data = hottest_day,
label = "Max",
size = 3)
```
With `geom_text`, you'll often want to use position adjustment (the `position` parameter) to move the text so it won't be right on top of the data points:
```{r fig.height = 3, fig.width = 4, out.width = "0.5\\textwidth", fig.align = "center", message = FALSE}
chic_plot + geom_text(data = hottest_day,
label = "Max",
size = 3, hjust = 0, vjust = -1)
```
You can also use lines to highlight. For this, it is often useful to create a new dataframe with data for the reference. To add a line for the Chicago heat wave, I've added a dataframe called `hw` with the relevant date range. I'm setting the y-value to be high enough (425) to ensure the line will be placed above the mortality data.
```{r}
hw <- data.frame(date = c(as.Date("1995-07-12"),
as.Date("1995-07-16")),
death = c(425, 425))
b <- chic_plot +
geom_line(data = hw,
aes(x = date, y = death),
size = 2)
```
```{r fig.height = 3, fig.width = 4, out.width = "0.7\\textwidth", fig.align = "center"}
b
```
## Order
> Guideline 5: **Make order meaningful.**
You can make the ranking of data clearer from a graph by using order to show rank. Often, factor or categorical variables are ordered by something that is not interesting, like alphabetical order.
```{r echo = FALSE, fig.width = 8, fig.height = 5}
ex <- group_by(worldcup, Team) %>%
summarise(mean_time = mean(Time))
a <- ggplot(ex, aes(x = mean_time, y = Team)) +
geom_point() +
theme_few() +
xlab("Mean time per player (minutes)") + ylab("") +
ggtitle("1. Alphabetical order")
ex2 <- arrange(ex, mean_time) %>%
mutate(Team = factor(Team, levels = Team))
b <- ggplot(ex2, aes(x = mean_time, y = Team)) +
geom_point() +
theme_few() +
xlab("Mean time per player (minutes)") + ylab("") +
ggtitle("2. Meaningful order")
grid.arrange(a, b, ncol = 2)
```
You can re-order factor variables in a graph by resetting the factor using the `factor` function and changing the order that levels are included in the `levels` parameter.
## Small multiples
> Guideline 6: **When possible, use small multiples.** \bigskip
*Small multiples* are graphs that use many small plots showing the same thing for different facets of the data. For example, instead of using color in a single plot to show data for males and females, you could use two small plots, one each for males and females. \bigskip
Typically, in small multiples, all plots use the same x- and y-axes. This makes it easier to compare across plots, and it also allows you to save room by limiting axis annotation.
```{r echo = FALSE, message = FALSE, fig.height = 6, fig.width = 8}
ex <- subset(worldcup, Position %in% c("Forward",
"Midfielder"))
ex2 <- group_by(ex, Team) %>%
summarise(mean = mean(Shots)) %>%
arrange(desc(mean))
ex$Team <- factor(ex$Team,
levels = ex2$Team)
a <- ggplot(ex, aes(x = Time, y = Shots)) +
geom_point() +
theme_few() +
facet_wrap(~ Team, ncol = 8) +
geom_smooth(method = "lm", se = FALSE)
a
```
You can use the `facet` functions to create small multiples. This separates the graph into several small graphs, one for each level of a factor. \bigskip
The `facet` functions are:
- `facet_grid()`
- `facet_wrap()`
For example, to create small multiples by sex for the Nepali dataset, when plotting height versus weight, you can call:
```{r warning = FALSE, fig.width = 8, fig.height = 3}
ggplot(nepali, aes(ht, wt)) +
geom_point() +
facet_grid(. ~ sex)
```
The `facet_grid` function can facet by one or two variables. One will be shown by rows, and one by columns:
```{r eval = FALSE}
## Generic code
facet_grid([factor for rows] ~ [factor for columns])
```
The `facet_wrap()` function can only facet by one variable, but it can "wrap" the small graphs for that variable, so the don't all have to be in one row or column:
```{r eval = FALSE}
## Generic code
facet_wrap(~ [factor for faceting], ncol = [number of columns])
```
Often, when you do faceting, you'll want to re-name your factors levels or re-order them. For this, you'll need to use the `factor()` function on the original vector. For example, to rename the `sex` factor levels from "1" and "2" to "Male" and "Female", you can run:
```{r}
nepali <- nepali %>%
mutate(sex = factor(sex, levels = c(1, 2),
labels = c("Male", "Female")))
```
Notice that the labels for the two graphs have now changed:
```{r warning = FALSE, fig.width = 8, fig.height = 3}
ggplot(nepali, aes(ht, wt)) +
geom_point() +
facet_grid(. ~ sex)
```
To re-order the factor, and show the plot for "Female" first, you can use `factor` to change the order of the levels:
```{r}
nepali <- nepali %>%
mutate(sex = factor(sex, levels = c("Female", "Male")))
```
Now notice that the order of the plots has changed:
```{r warning = FALSE, fig.width = 8, fig.height = 3}
ggplot(nepali, aes(ht, wt)) +
geom_point() +
facet_grid(. ~ sex)
```
## Advanced customization
### Scales
There are a number of different functions for adjusting scales. These follow the following convention:
```{r eval = FALSE}
## Generic code
scale_[aesthetic]_[vector type]
```
For example, to adjust the x-axis scale for a continuous variable, you'd use `scale_x_continuous`. You can use a `scale` function for an axis to change things like the axis label (which you could also change with `xlab` or `ylab`) as well as position and labeling of breaks.
For example, here is the default for plotting time versus passes for the `worldcup` dataset, with the number of shots taken shown by size and position shown by color:
```{r fig.width = 7, fig.height = 4, out.width = "0.8\\textwidth", fig.align = "center"}
ggplot(worldcup, aes(x = Time, y = Passes,
color = Position, size = Shots)) +
geom_point(alpha = 0.5)
```
```{r fig.width = 7, fig.height = 4, out.width = "0.8\\textwidth", fig.align = "center"}
ggplot(worldcup, aes(x = Time, y = Passes,
color = Position, size = Shots)) +
geom_point(alpha = 0.5) +
scale_x_continuous(name = "Time played (minutes)",
breaks = 90 * c(2, 4, 6),
minor_breaks = 90 * c(1, 3, 5))
```
Parameters you might find useful in `scale` functions include:
```{r echo = FALSE}
scale_params <- data.frame(param = c("name",
"breaks",
"minor_breaks",
"labels",
"limits"),
desc = c("Label or legend name",
"Vector of break points",
"Vector of minor break points",
"Labels to use for each break",
"Limits to the range of the axis"))
knitr::kable(scale_params, col.names = c("Parameter", "Description"))
```
For dates, you can use `scale` functions like `scale_x_date` and `scale_x_datetime`. For example, here's a plot of deaths in Chicago in July 1995 using default values for the x-axis:
```{r fig.width = 5, fig.height = 2, out.width = "0.9\\textwidth", fig.align = "center"}
ggplot(chic_july, aes(x = date, y = death)) +
geom_line()
```
And here's an example of changing the formating and name of the x-axis:
```{r fig.width = 5, fig.height = 2, out.width = "0.9\\textwidth", fig.align = "center"}
ggplot(chic_july, aes(x = date, y = death)) +
geom_line() +
scale_x_date(name = "Date in July 1995",
date_labels = "%m-%d")
```
You can also use the `scale` functions to transform an axis. For example, to show the Chicago plot with "deaths" on a log scale, you can run:
```{r fig.width = 5, fig.height = 2, out.width = "0.9\\textwidth", fig.align = "center"}
ggplot(chic_july, aes(x = date, y = death)) +
geom_line() +
scale_y_log10()
```
For colors and fills, the conventions for the names of the `scale` functions can vary. For example, to adjust the color scale when you're mapping a discrete variable (i.e., categorical, like gender or animal breed) to color, you'd use `scale_color_hue`. To adjust the color scale for a continuous variable, like age, you'll use `scale_color_gradient`.
For any color scales, consider starting with `brewer` first (e.g., `scale_color_brewer`, `scale_color_distiller`). Scale functions from `brewer` allow you to set colors using different palettes. You can explore these palettes at http://colorbrewer2.org/.
The Brewer palettes fall into three categories: sequential, divergent, and qualitative. You should use sequential or divergent for continuous data and qualitative for categorical data. Use `display.brewer.pal` to show the palette for a given number of colors.
```{r out.width = "0.32\\textwidth", fig.show='hold', fig.height = 3, fig.width = 4}
library(RColorBrewer)
display.brewer.pal(name = "Set1", n = 8)
display.brewer.pal(name = "PRGn", n = 8)
display.brewer.pal(name = "PuBuGn", n = 8)
```
Use the `palette` argument within a `scales` function to customize the palette:
```{r fig.width = 8, fig.height = 2, out.width = "\\textwidth"}
a <- ggplot(data.frame(x = 1:5, y = rnorm(5),
group = letters[1:5]),
aes(x = x, y = y, color = group)) +
geom_point()
b <- a + scale_color_brewer(palette = "Set1")
c <- a + scale_color_brewer(palette = "Pastel2") +
theme_dark()
grid.arrange(a, b, c, ncol = 3)
```
```{r fig.width = 7, fig.height = 4, out.width = "0.8\\textwidth", fig.align = "center"}
ggplot(worldcup, aes(x = Time, y = Passes,
color = Position, size = Shots)) +
geom_point(alpha = 0.5) +
scale_color_brewer(palette = "Dark2",
name = "Player position")
```
You can also set colors manually:
```{r fig.width = 7, fig.height = 4, out.width = "0.8\\textwidth", fig.align = "center"}
ggplot(worldcup, aes(x = Time, y = Passes,
color = Position, size = Shots)) +
geom_point(alpha = 0.5) +
scale_color_manual(values = c("blue", "red",
"darkgreen", "darkgray"))
```
## To find out more
Some excellent further references for plotting are:
- R Graphics Cookbook (book and website)
- Google images
For more technical details about plotting in R:
- ggplot2: Elegant Graphics for Data Analysis, Hadley Wickham
- R Graphics, Paul Murrell
## In-course exercise
### Designing a plot
For today's exercise, you'll be building a plot using the `worldcup` data from the `faraway` package. First, load in that data. The name of each player is in the rownames of this data. Use the `tibble::rownames_to_column()` function to move those rownames into a new column named `Player`. Also install and load the `ggplot2` and `ggthemes` packages.
Next, say you want to look at the relationship between the number of minutes that a player played in the 2010 World Cup (`Time`) and the number of shots the player took on goal (`Shots`). On a sheet of paper, and talking with your partner, decide how the two of you would design a plot to explore and present this relationship. How would you incorporate some of the principles of creating good graphs?
#### Example R code
For this section, the only code needed is code to load the required packages, load the data, and move the rownames to a column named `Player`.
```{r}
library(faraway)
data(worldcup)
head(worldcup, 2)
```
This dataset has the players' names as rownames, rather than in a column. Once we start using `dplyr` functions, we'll lose these rownames. Therefore, start by converting the rownames to a column called `Player`:
```{r message = FALSE}
library(dplyr)
worldcup <- worldcup %>%
tibble::rownames_to_column(var = "Player")
head(worldcup, 2)
```
Install and load the `ggplot2` package:
```{r}
# install.packages("ggplot2")
library(ggplot2)
# install.packages("ggthemes")
library(ggthemes)
```
### Implementing plot guidelines #1
In this section, we'll work on creating a plot like this:
```{r, fig.width = 8, fig.height = 2.5, echo = FALSE}
most_shots <- worldcup[which.max(worldcup$Shots), ]
top_four <- c("Spain", "Germany", "Uruguay", "Netherlands")
worldcup$top_four <- factor(worldcup$Team %in% top_four,
levels = c(TRUE, FALSE),
labels = c("Top 4", "Other"))
worldcup$Position <- factor(worldcup$Position,
levels = c("Goalkeeper",
"Defender",
"Midfielder",
"Forward"))
ggplot(worldcup, aes(x = Time, y = Shots, color = top_four)) +
geom_vline(xintercept = 90 * 3, color = "lightgray",
linetype = 2, alpha = 0.5) +
geom_point(size = 1.8, alpha = 0.7) +
facet_grid(. ~ Position) +
scale_x_continuous("Time played in World Cup (minutes)",
breaks = 180 * 0:7) +
theme_few() +
geom_text(data = most_shots,
aes(label = paste0(Player, ",", Team, " ")),
colour = "black", size = 3,
hjust = 1, vjust = 0.4) +
scale_color_discrete(name = "Team's final\n ranking")
data(worldcup)
worldcup <- worldcup %>%
mutate(Player = rownames(worldcup))
```
Do the following tasks:
- Create a simple scatterplot of Time versus Shots for the World Cup data. It should look like this:
```{r echo = FALSE, fig.width = 3, fig.height = 3}
worldcup %>%
ggplot(aes(x = Time, y = Shots)) +
geom_point()
```
- Next, before any more coding, talk with your group members about how this graph is different from the simple one you created with `ggplot` in the last section. Also discuss what you can figure out from this new graph that was less clear from a simpler scatterplot of Time versus Shots for this data.
- Often, in graphs with a lot of points, it's hard to see some of the points, because they overlap other points. Three strategies to address this are: (a) make the points smaller; and (b) make the points somewhat transparent. Try doing these first two with the scatterplot you're creating. At this point, the plot should look something like this:
```{r echo = FALSE, fig.width = 3, fig.height = 3}
worldcup %>%
ggplot(aes(x = Time, y = Shots)) +
geom_point(alpha = 0.5, size = 1)
```
- Create a new column in the `worldcup` data called `top_four` that specifies whether or not the `Team` for that observation was one of the top four teams in the tournament (Netherlands, Uruguay, Spain, and Germany). Make the colors of the points correspond to whether the team was a top-four team. At this point, the plot should look something like this:
```{r echo = FALSE, fig.width = 4, fig.height = 3}
worldcup %>%
mutate(top_4 = Team %in% c("Netherlands", "Uruguay", "Spain", "Germany")) %>%
ggplot(aes(x = Time, y = Shots, color = top_4)) +
geom_point(alpha = 0.5, size = 1)
```
- Increase data density: Try changing the theme, to come up with a graph with a bit less non-data ink. From the `ggthemes` package, try some of the following themes: `theme_few()`, `theme_tufte()`, `theme_stata()`, `theme_fivethirtyeight()`, `theme_economist_white()`, and `theme_wsj()`. Pick a theme that helps increase the graph's data density. At this point, the plot should look something like this:
```{r echo = FALSE, fig.width = 4, fig.height = 3}
worldcup %>%
mutate(top_4 = Team %in% c("Netherlands", "Uruguay", "Spain", "Germany")) %>%
ggplot(aes(x = Time, y = Shots, color = top_4)) +
geom_point(alpha = 0.5, size = 1) +
theme_few()
```
- Use meaningful labels: Use the `labs()` function to make a clearer title for the x-axis. (You may have already written this code in the last section of this exercise.) In addition to setting the x-axis title with the `labs` function, you can also set the title for the color scale (use `color = ` within the `labs` function). You may want to make a line break in the color title-- you can use the linebreak character (`\n`) inside the character string with the title to do that. At this point, the plot should look something like this:
```{r echo = FALSE, fig.width = 4, fig.height = 3}
worldcup %>%
mutate(top_4 = Team %in% c("Netherlands", "Uruguay", "Spain", "Germany")) %>%
ggplot(aes(x = Time, y = Shots, color = top_4)) +
geom_point(alpha = 0.5, size = 1) +
labs(x = "Time played in World Cup (minutes)",
color = "Team's final\nranking") +
theme_few()
```
- Provide useful references: The standard time for a soccer game is 90 minutes. In the World Cup, all teams play at least three games, and then the top teams continue and play more games. Add a reference line at 270 minutes (i.e., the amount of standard time played for the three games that all teams play). At this point, the plot should look something like this:
```{r echo = FALSE, fig.width = 4, fig.height = 3}
worldcup %>%
mutate(top_4 = Team %in% c("Netherlands", "Uruguay", "Spain", "Germany")) %>%
ggplot(aes(x = Time, y = Shots, color = top_4)) +
geom_vline(xintercept = 90 * 3, color = "gray", linetype = 2) +
geom_point(alpha = 0.5, size = 1) +
labs(x = "Time played in World Cup (minutes)",
color = "Team's final\nranking") +
theme_few()
```
#### Example R code
As a reminder, here's the code to do a simple scatterplot ot Shots by Time for the `worldcup` data:
```{r fig.width = 3, fig.height = 3}
ggplot(data = worldcup) +
geom_point(mapping = aes(x = Time, y = Shots))
```
Next, try to make it clearer to see the points by making them smaller and somewhat transparent. This can be done with the `size` and `alpha` aesthetics for `geom_points`. For the `size` aesthetic, a value smaller than about 2 = smaller than default, larger than about 2 = larger than default. For the `alpha` aesthetic, closer to 0 = more tranparent, closer to 1 = more opaque. As a reminder, in this case you are changing all of the points in the same way, so you will be setting those aesthetics to constant values. That means that you should specify the values **outside** of an `aes` call. This code could make these changes:
```{r fig.width = 3, fig.height = 3}
ggplot(data = worldcup) +
geom_point(mapping = aes(x = Time, y = Shots),
size = 1, alpha = 0.5)
```
To create a new column called `top_four`, first create vector that lists those top four teams, then create a logical vector in the dataframe for whether the team for that observation is in one of the top four teams:
```{r}
worldcup <- worldcup %>%
mutate(top_4 = Team %in% c("Spain", "Germany",
"Uruguay", "Netherlands"))
head(worldcup)
summary(worldcup$top_4)
```
To color points by this variable, use `color = ` in the `aes()` part of the `ggplot()` call:
```{r fig.width = 4, fig.height = 3}
ggplot(data = worldcup) +
geom_point(mapping = aes(x = Time, y = Shots, color = top_4),
size = 1, alpha = 0.5)
```
To increase the data density, try out different themes for the plot. First, I'll save everything we've done so far as the object `shot_plot`, then I'll try adding different themes:
```{r fig.width = 4, fig.height = 3}
shot_plot <- ggplot(data = worldcup) +
geom_point(mapping = aes(x = Time, y = Shots, color = top_4),
size = 1, alpha = 0.5)
shot_plot + theme_few()
shot_plot + theme_tufte()
shot_plot + theme_wsj()
shot_plot + theme_fivethirtyeight()
shot_plot + theme_stata()
shot_plot + theme_economist_white()
```
The data density is increased with the `theme_few()` theme, so I'll use that:
```{r fig.width = 4, fig.height = 3}
ggplot(data = worldcup) +
geom_point(mapping = aes(x = Time, y = Shots, color = top_4),
size = 1, alpha = 0.5) +
theme_few()
```
To change the titles for some of the scales (the x-axis and color scale), you can use the `labs()` function. Note that you can use `\n` to add a line break inside one of these titles (I've done that for the title for the color scale):
```{r fig.width = 4, fig.height = 3}
ggplot(data = worldcup) +
geom_point(mapping = aes(x = Time, y = Shots, color = top_4),
size = 1, alpha = 0.5) +
theme_few() +
labs(x = "Time played in World Cup (minutes)",
color = "Team's final\nranking")
```
As an extra note, if you want to create nicer labels for the legend for color, convert the `top_four` column into the factor class, with the labels you want to use in the figure legend:
```{r}
worldcup <- worldcup %>%
mutate(top_4 = factor(top_4, levels = c(TRUE, FALSE),
labels = c("Top 4", "Other")))
summary(worldcup$top_4)
```
```{r fig.width = 4, fig.height = 3}
ggplot(data = worldcup) +
geom_point(mapping = aes(x = Time, y = Shots, color = top_4),
size = 1, alpha = 0.5) +
theme_few() +
labs(x = "Time played in World Cup (minutes)",
color = "Team's final\nranking")
```
To add a reference line at 270 minutes of time, use the `geom_vline()` function. You'll want to make it a light color (like light gray) and dashed or dotted (`linetype` of 2 or 3), so it won't be too prominent on the graph:
```{r fig.width = 4, fig.height = 3}
ggplot(data = worldcup) +
geom_vline(xintercept = 270, color = "lightgray", linetype = 2) +
geom_point(mapping = aes(x = Time, y = Shots, color = top_4),
size = 1, alpha = 0.5) +
theme_few() +
labs(x = "Time played in World Cup (minutes)",
color = "Team's final\nranking")
```
### Implementing plot guidelines #2
- Highlighting interesting data: Who had the most shots in the 2010 World Cup? Was he on a top-four team? Use `geom_text()` to label his point on the graph with his name (try out some different values of `hjust` and `vjust` in this function call to get the label in a place you like). At this point, the plot should look something like this:
```{r echo = FALSE, fig.width = 4, fig.height = 3}
top_player <- worldcup %>%
top_n(n = 1, wt = Shots)
worldcup %>%
mutate(top_4 = Team %in% c("Netherlands", "Uruguay", "Spain", "Germany")) %>%
ggplot(aes(x = Time, y = Shots, color = top_4)) +
geom_vline(xintercept = 90 * 3, color = "gray", linetype = 2) +
geom_point(alpha = 0.5, size = 1) +
geom_text(data = top_player, aes(label = Player, color = NULL),
hjust = 1.2, vjust = 0.4) +
labs(x = "Time played in World Cup (minutes)",
color = "Team's final\nranking") +
theme_few()
```
- For labeling the player with the top number of shots, instead of only using the player's name, use the following format: "[Player's name], [Player's team]". (Hint: You may want to use `mutate` to add a new column, where you used `paste0` to paste together the player's name, `", "`, and the team name.) At this point, the plot should look something like this:
```{r echo = FALSE, fig.width = 4, fig.height = 3}
top_player <- worldcup %>%
top_n(n = 1, wt = Shots) %>%
mutate(label = paste0(Player, ", ", Team))
worldcup %>%
mutate(top_4 = Team %in% c("Netherlands", "Uruguay", "Spain", "Germany")) %>%
ggplot(aes(x = Time, y = Shots, color = top_4)) +
geom_vline(xintercept = 90 * 3, color = "gray", linetype = 2) +
geom_point(alpha = 0.5, size = 1) +
geom_text(data = top_player,
aes(label = label, color = NULL),
hjust = 1.1, vjust = 0.4) +
labs(x = "Time played in World Cup (minutes)",
color = "Team's final\nranking") +
theme_few()
```
- Create small multiples. The relationship between time played and shots taken is probably different by the players' positions. Use faceting to create different graphs for each position. At this point, the plot should look something like this:
```{r echo = FALSE, fig.width = 8, fig.height = 2.5}
top_player <- worldcup %>%
top_n(n = 1, wt = Shots) %>%
mutate(label = paste0(Player, ", ", Team))
worldcup %>%
mutate(top_4 = Team %in% c("Netherlands", "Uruguay", "Spain", "Germany")) %>%
ggplot(aes(x = Time, y = Shots, color = top_4)) +
geom_vline(xintercept = 90 * 3, color = "gray", linetype = 2) +
geom_point(alpha = 0.5, size = 1) +
geom_text(data = top_player,
aes(label = label, color = NULL),
hjust = 1.1, vjust = 0.4) +
labs(x = "Time played in World Cup (minutes)",
color = "Team's final\nranking") +
theme_few() +
facet_wrap(~ Position, ncol = 4)
```
- Make order meaningful: What order are the faceted graphs currently in? Offensive players have more chances to take shots than defensive players, so that might be a useful ordering for the facets. Re-order the `Position` factor column to go from nearest your own goal to nearest the opponents goal, and then re-plot the graph from the previous step.
```{r echo = FALSE, fig.width = 8, fig.height = 2.5}
top_player <- worldcup %>%
top_n(n = 1, wt = Shots) %>%
mutate(label = paste0(Player, ", ", Team))
worldcup %>%
mutate(top_4 = Team %in% c("Netherlands", "Uruguay", "Spain", "Germany"),
Position = factor(Position, levels = c("Goalkeeper", "Defender",
"Midfielder", "Forward"))) %>%
ggplot() +
geom_vline(xintercept = 90 * 3, color = "gray", linetype = 2) +
geom_point(aes(x = Time, y = Shots, color = top_4), alpha = 0.5, size = 1) +
geom_text(data = top_player,
aes(x = Time, y = Shots, label = label),
hjust = 1.1, vjust = 0.4) +
labs(x = "Time played in World Cup (minutes)",
color = "Team's final\nranking") +
theme_few() +
facet_wrap(~ Position, ncol = 4)
```
#### Example R code
To add a text label with just the player with the most shots, you'll want to create a new dataframe with just the top player. You can use the `top_n` function to do that (the `wt` option is specifying that we want the top player in terms of values in the `Shots` column):
```{r}
top_player <- worldcup %>%
top_n(n = 1, wt = Shots)
```
Now you can use `geom_text()` to label this player's point on the graph with his name. You may need to mess around with some of the options in `geom_text()`, like `size`, `hjust`, and `vjust` (`hjust` and `vjust` say where, in relation to the point location, to put the label), to get something you're happy with.
```{r echo = FALSE, fig.width = 4, fig.height = 3}
worldcup %>%
mutate(top_4 = Team %in% c("Netherlands", "Uruguay", "Spain", "Germany")) %>%
ggplot(aes(x = Time, y = Shots, color = top_4)) +
geom_vline(xintercept = 90 * 3, color = "gray", linetype = 2) +
geom_point(alpha = 0.5, size = 1) +
geom_text(data = top_player, aes(label = Player, color = NULL),
hjust = 1.2, vjust = 0.4) +
labs(x = "Time played in World Cup (minutes)",
color = "Team's final\nranking") +
theme_few()
```
If you want to put both the player's name and his team, you can add a `mutate()` function when you create the new dataframe with just the top player, and then use this for the label:
```{r echo = FALSE, fig.width = 4, fig.height = 3}
top_player <- worldcup %>%
top_n(n = 1, wt = Shots) %>%
mutate(label = paste0(Player, ", ", Team))
worldcup %>%
mutate(top_4 = Team %in% c("Netherlands", "Uruguay", "Spain", "Germany")) %>%
ggplot(aes(x = Time, y = Shots, color = top_4)) +
geom_vline(xintercept = 90 * 3, color = "gray", linetype = 2) +
geom_point(alpha = 0.5, size = 1) +
geom_text(data = top_player,
aes(label = label, color = NULL),
hjust = 1.1, vjust = 0.4) +
labs(x = "Time played in World Cup (minutes)",
color = "Team's final\nranking") +
theme_few()
```
To create small multiples, use the `facet_wrap()` command (you'll probably want to use `ncol` to specify to use four columns):
```{r echo = FALSE, fig.width = 8, fig.height = 2.5}
top_player <- worldcup %>%
top_n(n = 1, wt = Shots) %>%
mutate(label = paste0(Player, ", ", Team))
worldcup %>%
mutate(top_4 = Team %in% c("Netherlands", "Uruguay", "Spain", "Germany")) %>%
ggplot(aes(x = Time, y = Shots, color = top_4)) +
geom_vline(xintercept = 90 * 3, color = "gray", linetype = 2) +
geom_point(alpha = 0.5, size = 1) +
geom_text(data = top_player,
aes(label = label, color = NULL),
hjust = 1.1, vjust = 0.4) +
labs(x = "Time played in World Cup (minutes)",
color = "Team's final\nranking") +
theme_few() +
facet_wrap(~ Position, ncol = 4)
```