-
Notifications
You must be signed in to change notification settings - Fork 83
/
Copy path10_Graphics.Rmd
1683 lines (1163 loc) · 66.6 KB
/
10_Graphics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Graphics {#Graphics}
Introduction {-#intro-Graphics}
------------
Graphics is a great strength of R. The `graphics` package is part of the
standard distribution and contains many useful functions for creating a
variety of graphic displays. The base functionality has been expanded and made easier with `ggplot2`, part of the tidyverse of packages. In this chapter we will focus on examples using `ggplot2`, and we will occasionally suggest other packages. In this chapter’s See Also sections we mention functions in other packages that do the same job in a different way. We suggest that you explore those alternatives if you
are dissatisfied with what's offered by `ggplot2` or base graphics.
Graphics is a vast subject, and we can only scratch the surface here.
Winston Chang's [*R Graphics Cookbook, 2nd Edition*](http://shop.oreilly.com/product/0636920063704.do), is part of the O'Reilly Cookbook series
and walks through many useful recipes with a focus on `ggplot2`.
If you want to delve deeper, we recommend *R Graphics* by Paul Murrell
(Chapman & Hall, 2006). *R Graphics* discusses the paradigms behind R
graphics, explains how to use the graphics functions, and contains
numerous examples—including the code to re-create them. Some of the
examples are pretty amazing.
### The Illustrations {-}
The graphs in this chapter are mostly plain and unadorned. We did that
intentionally. When you call the `ggplot` function, as in:
```{r message=FALSE}
library(tidyverse)
```
``` {r simpleplot, fig.cap='Simple plot'}
df <- data.frame(x = 1:5, y = 1:5)
ggplot(df, aes(x, y)) +
geom_point()
```
you get a plain, graphical representation of *x* and *y* as shown in Figure \@ref(fig:simpleplot).
You could adorn the
graph with colors, a title, labels, a legend, text, and so forth, but
then the call to `ggplot` becomes more and more crowded, obscuring the
basic intention.
``` {r complicatedplot, fig.cap='Slightly more complicated plot'}
ggplot(df, aes(x, y)) +
geom_point() +
labs(
title = "Simple Plot Example",
subtitle = "with a subtitle",
x = "x-values",
y = "y-values"
) +
theme(panel.background = element_rect(fill = "white", colour = "grey50"))
```
The resulting plot is shown in Figure \@ref(fig:complicatedplot). We want to keep the recipes clean, so we emphasize the basic plot and then show later (as in Recipe \@ref(recipe-id259), ["Adding a Title and Labels"](#recipe-id259)) how to add adornments.
### Notes on ggplot2 basics {-}
While the package is called `ggplot2`, the primary plotting function in the package is called `ggplot`. It is important to understand the basic pieces of a `ggplot2` graph. In the preceding examples, you can see that we pass data into `ggplot`, then define how the graph is created by stacking together small phrases that describe some aspect of the plot. This stacking together of phrases is part of the "grammar of graphics" ethos (that's where the `gg` comes from).
To learn more, you can read ["A Layered Grammar of Graphics"](http://vita.had.co.nz/papers/layered-grammar.pdf) written by `ggplot2` author Hadley Wickham.
The "grammar of graphics" concept originated with Leland Wilkinson, who articulated the idea of building graphics up from a set of primitives (i.e., verbs and nouns). With `ggplot`, the underlying data need not be fundamentally reshaped for each type of graphical representation. In general, the data stays the same and the user then changes syntax slightly to illustrate the data differently. This is significantly more consistent than base graphics, which often require reshaping the data in order to change the way it is visualized.
As we talk about `ggplot` graphics, it's worth defining the components of a `ggplot` graph:
`geometric object functions`
: These are geometric objects that describe the type of graph being created. These start with `geom_` and examples include `geom_line`, `geom_boxplot`, and `geom_point,` along with dozens more.
`aesthetics`
: The aesthetics, or aesthetic mappings, communicate to `ggplot` which fields in the source data get mapped to which visual elements in the graphic. This is the `aes` line in a `ggplot` call.
`stats`
: Stats are statistical transformations that are done before displaying the data. Not all graphs will have stats, but a few common stats are `stat_ecdf` (the empirical cumulative distribution function) and `stat_identity`, which tells `ggplot` to pass the data without doing any stats at all.
`facet functions`
: Facets are subplots where each small plot represents a subgroup of the data. The faceting functions include `facet_wrap` and `facet_grid`.
`themes`
: Themes are the visual elements of the plot that are not tied to data. These might include titles, margins, table of contents locations, or font choices.
`layer`
: A layer is a combination of data, aesthetics, a geometric object, a stat, and other options to produce a visual layer in the `ggplot` graphic.
### "Long" Versus "Wide" Data with ggplot {-}
One of the first sources of confusion for new `ggplot` users is that they are inclined to reshape their data to be "wide" before plotting it. "Wide" here means every variable they are plotting is its own column in the underlying data frame. This is an approach that many users develop while using Excel and then bring with them to R. `ggplot` works most easily with "long" data where additional variables are added as rows in the data frame rather than columns. The great side effect of adding more measurements as rows is that any properly constructed `ggplot` graphs will automatically update to reflect the new data without changing the `ggplot` code. If each additional variable were added as a column, then the plotting code would have to be changed to introduce additional variables. This idea of "long" versus "wide" data will become more obvious in the examples in the rest of this chapter.
### Graphics in Other Packages {-}
R is highly programmable, and many people have extended its graphics
machinery with additional features. Quite often, packages include
specialized functions for plotting their results and objects. The `zoo`
package, for example, implements a time series object. If you create a
`zoo` object `z` and call `plot(z)`, then the `zoo` package does the
plotting; it creates a graphic that is customized for displaying a time
series. `zoo` uses base graphics so the resulting graph will not be a `ggplot` graphic.
There are even entire packages devoted to extending R with new graphics
paradigms. The `lattice` package is an alternative to base
graphics that predates `ggplot2`. It uses a powerful graphics paradigm that enables you to
create informative graphics more easily. It was implemented by Deepayan Sarkar, who also
wrote *Lattice: Multivariate Data Visualization with R* (Springer, 2008),
which explains the package and how to use it. The lattice package is
also described in [*R in a Nutshell*](http://oreilly.com/catalog/9780596801717) (O’Reilly).
There are two chapters in Hadley Wickham's excellent book *R for Data Science* that deal with graphics. Chapter 7, "Exploratory Data Analysis," focuses on exploring data with `ggplot2`, while Chapter 28, "Graphics for Communication," explores communicating to others with graphics. *R for Data Science* is available in a printed version from O'Reilly or [online](http://r4ds.had.co.nz/graphics-for-communication.html).
Creating a Scatter Plot {#recipe-id171}
-----------------------
### Problem {-#problem-id171}
You have paired observations: (*x*~1~, *y*~1~), (*x*~2~, *y*~2~), ...,
(*x~n~*, *y~n~*). You want to create a scatter plot of the pairs.
### Solution {-#solution-id171}
We can plot the data by calling `ggplot`, passing in the data frame, and invoking a geometric point function:
```{r, eval=FALSE}
ggplot(df, aes(x, y)) +
geom_point()
```
In this example, the data frame is called `df` and the *x* and *y* data are in fields named `x` and `y`, which we pass to the aesthetic in the call `aes(x, y)`.
### Discussion {-#discussion-id171}
A scatter plot is a common first attack on a new dataset. It’s a quick
way to see the relationship, if any, between *x* and *y*.
Plotting with `ggplot` requires telling `ggplot` what data frame to use, then what type of graph to create, and which aesthetic mapping (`aes`) to use. The `aes` in this case defines which field from `df` goes into which axis on the plot. Then the command `geom_point` communicates that you want a point graph, as opposed to a line or other type of graphic.
We can use the built-in `mtcars` dataset to illustrate plotting horsepower (`hp`) on the x-axis and fuel economy (`mpg`) on the y-axis:
```{r point-ex, fig.cap="Scatter plot example"}
ggplot(mtcars, aes(hp, mpg)) +
geom_point()
```
The resulting plot is shown in Figure \@ref(fig:point-ex).
### See Also {-#see_also-id171}
See Recipe \@ref(recipe-id259), ["Adding a Title and Labels"](#recipe-id259), for adding a title and labels; see Recipes \@ref(recipe-id261,) ["Adding a Grid"](#recipe-id261), and \@ref(recipe-id260),
["Adding (or Removing) a Legend"](#recipe-id260), for adding a grid and a legend
(respectively). See Recipe \@ref(recipe-id184), ["Plotting All Variables Against All Other Variables"](#recipe-id184), for
plotting multiple variables.
Adding a Title and Labels {#recipe-id259}
-------------------------
### Problem {-#problem-id259}
You want to add a title to your plot or add labels for the axes.
### Solution {-#solution-id259}
With `ggplot` we add a `labs` element that controls the labels for the title and axes.
When calling `labs` in `ggplot`, specify:
`title`
: The desired title text
`x`
: x-axis label
`y`
: y-axis label
```{r eval=FALSE}
ggplot(df, aes(x, y)) +
geom_point() +
labs(title = "The Title",
x = "X-axis Label",
y = "Y-axis Label")
```
### Discussion {-#discussion-id259}
The graph created in Recipe \@ref(recipe-id171), ["Creating a Scatter Plot"](#recipe-id171), is quite plain. A title and better labels will make it more interesting and easier to interpret.
Note that in `ggplot` you build up the elements of the graph by connecting the parts with the plus sign, `+`. So we add further graphical elements by stringing together phrases. You can see this in the following code, which uses the built-in `mtcars` dataset and plots horsepower versus fuel economy in a scatter plot, shown in Figure \@ref(fig:car-plot)
```{r car-plot, fig.cap='Labeled axis and title'}
ggplot(mtcars, aes(hp, mpg)) +
geom_point() +
labs(title = "Cars: Horsepower vs. Fuel Economy",
x = "HP",
y = "Economy (miles per gallon)")
```
Adding (or Removing) a Grid {#recipe-id261}
-------------
### Problem {-#problem-id261}
You want to change the background grid to your graphic.
### Solution {-#solution-id261}
With `ggplot` background grids come as a default, as you have seen in other recipes. However, we can alter the background grid using the `theme` function or by applying a prepackaged theme to our graph.
We can use `theme` to alter the background panel of our graphic:
```{r whitebackground, fig.cap='White background'}
ggplot(df) +
geom_point(aes(x, y)) +
theme(panel.background = element_rect(fill = "white", colour = "grey50"))
```
### Discussion {-#discussion-id261}
`ggplot` fills in the background with a grey grid by default. So you may find yourself wanting to remove that grid completely or change it to something else. Let's create a `ggplot` graphic and then incrementally change the background style.
We can add or change aspects of our graphic by creating a `ggplot` object, then calling the object and using the `+` to add to it. The background shading in a `ggplot` graphic is actually three different graph elements:
`panel.grid.major`:
These are white by default and heavy.
`panel.grid.minor`:
These are white by default and light.
`panel.background`:
This is the background that is grey by default.
You can see these elements if you look carefully at the background of Figure \@ref(fig:car-plot):
If we set the background as `element_blank`, then the major and minor grids are there, but they are white on white so we can't see them in Figure \@ref(fig:examplebackground):
```{r examplebackground, fig.cap = "Blank background"}
g1 <- ggplot(mtcars, aes(hp, mpg)) +
geom_point() +
labs(title = "Cars: Horsepower vs. Fuel Economy",
x = "HP",
y = "Economy (miles per gallon)") +
theme(panel.background = element_blank())
g1
```
Notice in the previous code we put the `ggplot` graph into a variable called `g1`. Then we printed the graphic by just calling `g1`. By having the graph inside of `g1`, we can then add further graphical components without rebuilding the graph.
But if we wanted to show the background grid with unusual patterns for illustration, it's as easy as setting its components to a color and setting a line type, which is shown in Figure \@ref(fig:majorgrid).
```{r majorgrid, fig.cap = "Major and minor gridlines" }
g2 <- g1 + theme(panel.grid.major =
element_line(color = "black", linetype = 3)) +
# linetype = 3 is dash
theme(panel.grid.minor =
element_line(color = "darkgrey", linetype = 4))
# linetype = 4 is dot dash
g2
```
Figure \@ref(fig:majorgrid) lacks visual appeal, but you can clearly see that the doted black lines make up the major grid and the dashed grey lines are the minor grid.
Or we could do something less garish and take the `ggplot` object `g1` from before and add grey gridlines to the white background, shown in Figure \@ref(fig:backgrids).
```{r backgrids, fig.cap='Grey major gridlines'}
g1 +
theme(panel.grid.major = element_line(colour = "grey"))
```
### See Also {-#see_also-id261}
See Recipe \@ref(recipe-theme), ["Applying a Theme to a ggplot Figure"](#recipe-theme), to see how to apply an entire canned theme to your figure.
Applying a Theme to a ggplot Figure {#recipe-theme}
------------------------------------------------------
### Problem {-#problem-theme}
You want your plot to use a preset collection of colors, styles, and formatting.
### Solution {-#solution-theme}
`ggplot` supports *themes*, which are collections of settings for your figures. To use one of the themes, just add the desired theme function to your `ggplot` with a `+`:
```{r, eval=FALSE}
ggplot(df, aes(x, y)) +
geom_point() +
theme_bw()
```
The `ggplot2` package contains the following themes:
```
theme_bw()
theme_dark()
theme_classic()
theme_gray()
theme_linedraw()
theme_light()
theme_minimal()
theme_test()
theme_void()
```
### Discussion {-#discussion-theme}
Let's start with a simple plot and then show how it looks with a few of the built-in themes. Figure \@ref(fig:startingfigure) shows a basic `ggplot` figure with no theme applied.
```{r, startingfigure, fig.cap = "Starting plot"}
p <- ggplot(mtcars, aes(x = disp, y = hp)) +
geom_point() +
labs(title = "mtcars: Displacement vs. Horsepower",
x = "Displacement (cubic inches)",
y = "Horsepower")
p
```
Let's create the same plot multiple times, but apply a different theme to each one:
`theme_bw`
:
(Figure \@ref(fig:themebw)):
```{r, themebw, fig.cap = '(ref:bw)'}
p + theme_bw()
```
`theme_classic`
:
(Figure \@ref(fig:themeclassic)):
```{r, themeclassic, fig.cap = "(ref:classic)"}
p + theme_classic()
```
`theme_minimal`
:
(Figure \@ref(fig:thememinimal)):
```{r, thememinimal, fig.cap = "(ref:minimal)"}
p + theme_minimal()
```
`theme_void`
:
(Figure \@ref(fig:themevoid)):
```{r, themevoid, fig.cap = "(ref:void)"}
p + theme_void()
```
In addition to the themes included in `ggplot2`, there are packages, like `ggtheme`, that include themes to help you make your figures look more like the figures found in popular tools and publications such as Stata or *The Economist*.
### See Also {-#see_also-theme}
See Recipe \@ref(recipe-id261), ["Adding (or Removing) a Grid"](#recipe-id261), to see how to change a single theme element.
Creating a Scatter Plot of Multiple Groups {#recipe-id262}
------------------------------------------
### Problem {-#problem-id262}
You have data in a data frame with multiple observations per record: *x*, *y*, and a
factor *f* that indicates the group. You want to create a scatter
plot of *x* and *y* that distinguishes among the groups.
### Solution {-#solution-id262}
With `ggplot` we control the mapping of shapes to the factor `f` by passing `shape = f` to the `aes`.
```{r eval=FALSE}
ggplot(df, aes(x, y, shape = f)) +
geom_point()
```
### Discussion {-#discussion-id262}
Plotting multiple groups in one scatter plot creates an uninformative
mess unless we distinguish one group from another. We make this distinction in `ggplot` by setting the `shape` parameter of the `aes` function.
The built-in `iris` dataset contains paired measures of `Petal.Length` and
`Petal.Width`. Each measurement also has a `Species` property indicating
the species of the flower that was measured. If we plot all the data at
once, we just get the scatter plot shown in Figure \@ref(fig:irisnoshape):
```{r irisnoshape, fig.cap="iris: length vs. width"}
ggplot(data = iris,
aes(x = Petal.Length,
y = Petal.Width)) +
geom_point()
```
The graphic would be far more informative if we distinguished the points
by species. In addition to distinguishing species by shape, we could also differentiate by color. We can add `shape = Species` and `color = Species` to our `aes` call, to get each species with a different shape and color, shown in Figure \@ref(fig:irisshape).
```{r irisshape, fig.cap="iris: shape and color"}
ggplot(data = iris,
aes(
x = Petal.Length,
y = Petal.Width,
shape = Species,
color = Species
)) +
geom_point()
```
`ggplot` conveniently sets up a legend for you as well, which is handy.
### See Also {-#see_also-id262}
See Recipe \@ref(recipe-id260), ["Adding (or Removing) a Legend"](#recipe-id260), for more on how to add a legend.
Adding (or Removing) a Legend {#recipe-id260}
---------------
### Problem {-#problem-id260}
You want your plot to include a *legend*, the little box that decodes
the graphic for the viewer.
### Solution {-#solution-id260}
In most cases `ggplot` will add the legends automatically, as you can see in the previous recipe. If you do not have explicit grouping in the `aes`, then `ggplot` will not show a legend by default. If we want to force `ggplot` to show a legend, we can set the shape or line type of our graph to a constant. `ggplot` will then show a legend with one group. We then use `guides` to guide `ggplot` in how to label the legend.
This can be illustrated with our `iris` scatter plot:
```{r needslegend, fig.cap='Legend added'}
g <- ggplot(data = iris,
aes(x = Petal.Length,
y = Petal.Width,
shape="Observation")) +
geom_point() +
guides(shape=guide_legend(title="My Legend Title"))
g
```
Figure \@ref(fig:needslegend) illustrates the result of setting the shape to a string value and then relabeling the legend using `guides`.
More commonly, you may want to turn legends off, which you can do by setting the `legend.position = "none"` in the `theme`. We can use the `iris` plot from the prior recipe and add the `theme` call as shown in Figure \@ref(fig:irisshapelegend):
```{r irisshapelegend, fig.cap="Legend removed"}
g <- ggplot(data = iris,
aes(
x = Petal.Length,
y = Petal.Width,
shape = Species,
color = Species
)) +
geom_point() +
theme(legend.position = "none")
g
```
### Discussion {-#discussion-id260}
Adding legends to `ggplot` when there is no grouping is an exercise in "tricking" `ggplot` into showing the legend by passing a string to a grouping parameter in `aes`. While this will not change the grouping (as there is only one group), it will result in a legend being shown with a name.
Then we can use `guides` to alter the legend title. It's worth noting that we are not changing anything about the data, just exploiting settings in order to coerce `ggplot` into showing a legend when it typically would not.
One of the huge benefits of `ggplot` is its very good defaults. Getting positions and correspondence between labels and their point types is done automatically, but can be overridden if needed. To remove a legend totally, we set `theme` parameters with `theme(legend.position = "none")`. In addition to `"none"` you can set the `legend.position` to be `"left"`, `"right"`, `"bottom"`, `"top"`, or a two-element numeric vector. Use a two-element numeric vector in order to pass `ggplot` specific coordinates of where you want the legend. If you're using the coordinate positions, the values passed are between 0 and 1 for the *x* and *y* position, respectively.
Figure \@ref(fig:irisshapelegend-moved) shows an example of a legend positioned at the bottom, created with this adjustment to the `legend.position`:
```{r irisshapelegend-moved, fig.cap='Legend on the bottom'}
g + theme(legend.position = "bottom")
```
Or we could use the two-element numeric vector to put the legend in a specific location, as in Figure \@ref(fig:irisshapelegend-moved2). The example puts the center of the legend at 80% to the right and 20% up from the bottom.
```{r irisshapelegend-moved2, fig.cap='Legend at a point'}
g + theme(legend.position = c(.8, .2))
```
In many aspects beyond legends, `ggplot` uses sane defaults but offers the flexibility to override them and tweak the details. You can find more details on `ggplot` options related to legends in the help for `theme` by typing `**?theme**` or looking in the `ggplot` [online reference material](http://ggplot2.tidyverse.org/reference/theme.html).
Plotting the Regression Line of a Scatter Plot {#recipe-id217}
----------------------------------------------
### Problem {-#problem-id217}
You are plotting pairs of data points, and you want to add a line that
illustrates their linear regression.
### Solution {-#solution-id217}
With `ggplot` there is no need to calculate the linear model first using the R `lm` function. We can instead use the `geom_smooth` function to calculate the linear regression inside of our `ggplot` call.
If our data is in a data frame `df` and the *x* and *y* data are in columns `x` and `y`, we plot the regression line like this:
```{r eval=FALSE}
ggplot(df, aes(x, y)) +
geom_point() +
geom_smooth(method = "lm",
formula = y ~ x,
se = FALSE)
```
The `se = FALSE` parameter tells `ggplot` not to plot the standard error bands around our regression line.
### Discussion {-#discussion-id217}
Suppose we are modeling the `strongx` dataset found in the `faraway` package. We can create a linear model using the built-in `lm` function in R. We can predict the variable `crossx` as a linear function of `energy`. First, let's look at a simple scatter plot of our data:
```{r strongx-scatter, fig.cap="strongx scatter plot"}
library(faraway)
data(strongx)
ggplot(strongx, aes(energy, crossx)) +
geom_point()
```
`ggplot` can calculate a linear model on the fly and then plot the regression line along with our data:
```{r one-step, fig.cap='Simple linear model ggplot'}
g <- ggplot(strongx, aes(energy, crossx)) +
geom_point()
g + geom_smooth(method = "lm",
formula = y ~ x)
```
We can turn the confidence bands off by adding the `se = FALSE` option, as shown in Figure \@ref(fig:one-step-nose):
```{r one-step-nose, fig.cap='Simple linear model ggplot without se'}
g + geom_smooth(method = "lm",
formula = y ~ x,
se = FALSE)
```
Notice that in the `geom_smooth` we use `x` and `y` rather than the variable names. `ggplot` has set the `x` and `y` inside the plot based on the aesthetic. Multiple smoothing methods are supported by `geom_smooth`. You can explore those, and other options in the help, by typing `**?geom_smooth**`.
If we had a line we wanted to plot that was stored in another R object, we could use `geom_abline` to plot the line on our graph. In the following example we pull the intercept term and the slope from the regression model `m` and add those to our graph in Figure \@ref(fig:slopeintercept):
```{r slopeintercept, fig.cap="Simple line from slope and intercept"}
m <- lm(crossx ~ energy, data = strongx)
ggplot(strongx, aes(energy, crossx)) +
geom_point() +
geom_abline(
intercept = m$coefficients[1],
slope = m$coefficients[2]
)
```
This produces a plot very similar to Figure \@ref(fig:one-step-nose). The `geom_abline` method can be handy if you are plotting a line from a source other than a simple linear model.
### See Also {-#see_also-id217}
See Chapter \@ref(LinearRegressionAndANOVA) for more about linear regression and the `lm` function.
Plotting All Variables Against All Other Variables {#recipe-id184}
--------------------------------------------------
### Problem {-#problem-id184}
Your dataset contains multiple numeric variables. You want to see
scatter plots for all pairs of variables.
### Solution {-#solution-id184}
`ggplot` does not have any built-in method to create pairs plots; however, the package `GGally` provides this functionality with the `ggpairs` function:
```{r eval=FALSE}
library(GGally)
ggpairs(df)
```
### Discussion {-#discussion-id184}
When you have a large number of variables, finding interrelationships
between them is difficult. One useful technique is looking at scatter
plots of all pairs of variables. This would be quite tedious if coded
pair-by-pair, but the `ggpairs` function from the package `GGally` provides an easy way to produce all those scatter
plots at once.
The `iris` dataset contains four numeric variables and one categorical
variable:
``` {r}
head(iris)
```
What is the relationship, if any, between the columns? Plotting
the columns with `ggpairs` produces multiple scatter plots, as seen in Figure \@ref(fig:ggpairsiris).
```{r ggpairsiris, fig.cap='ggpairs plot of iris data', warning=FALSE, message=FALSE}
library(GGally)
ggpairs(iris)
```
The `ggpairs` function is pretty, but not particularly fast. If you're just doing interactive work and want a quick peek at the data, the base R `plot` function provides faster output (see Figure \@ref(fig:basepairs)).
```{r basepairs, fig.cap='Base plot pairs plot'}
plot(iris)
```
While the `ggpairs` function is not as fast to plot as the Base R `plot` function, it produces density graphs on the diagonal and reports correlation in the upper triangle of the graph. When factors or character columns are present, `ggpairs` produces histograms on the lower triangle of the graph and boxplots on the upper triangle. These are nice additions to understanding relationships in your data.
Creating One Scatter Plot for Each Group {#recipe-id185}
-----------------------------------------------
### Problem {-#problem-id185}
Your dataset contains (at least) two numeric variables and a factor or character field defining a group. You
want to create several scatter plots for the numeric variables, with one
scatter plot for each level of the factor or character field.
### Solution {-#solution-id185}
We produce this kind of plot, called a _conditioning plot_, in `ggplot` by adding `facet_wrap` to our plot.
In this example we use the data frame `df`, which contains three columns: *x*, *y*, and *f*, with *f* being a factor (or a character string).
```{r, eval=FALSE}
ggplot(df, aes(x, y)) +
geom_point() +
facet_wrap( ~ f)
```
### Discussion {-#discussion-id185}
Conditioning plots (coplots) are another way to explore and illustrate the effect
of a factor or to compare different groups to each other.
The `Cars93` dataset contains 27 variables describing 93 car models as
of 1993. Two numeric variables are `MPG.city`, the miles per gallon in
the city, and `Horsepower`, the engine horsepower. One categorical
variable is `Origin`, which can be USA or non-USA according to where the
model was built.
Exploring the relationship between MPG and horsepower, we might ask: Is
there a different relationship for USA models and non-USA models?
Let's examine this as a facet plot:
```{r facet-cars, fig.cap='Cars data with facet'}
data(Cars93, package = "MASS")
ggplot(Cars93, aes(MPG.city, Horsepower)) +
geom_point() +
facet_wrap( ~ Origin)
```
The resulting plot in Figure \@ref(fig:facet-cars) reveals a few insights. If we really crave that 300-horsepower
monster, then we’ll have to buy a car built in the USA; but if we want
high MPG, we have more choices among non-USA models. These insights could be teased out of a statistical analysis, but the visual presentation reveals them much more quickly.
Note that using `facet` results in subplots with the same x- and y-axis ranges. This helps ensure that visual inspection of the data is not misleading because of differing axis ranges.
### See Also {-#see_also-id185}
The Base R graphics function `coplot` can accomplish very similar plots using only Base graphics.
Creating a Bar Chart {#recipe-id175}
--------------------
### Problem {-#problem-id175}
You want to create a bar chart.
### Solution {-#solution-id175}
A common situation is to have a column of data that represents a group and then another column that represents a measure about that group. This format is "long" data because the data runs vertically instead of having a column for each group.
Using the `geom_bar` function in `ggplot`, we can plot the heights as bars. If the data is already aggregated, we add `stat = "identity"` so that `ggplot` knows it needs to do no aggregation on the groups of values before plotting.
```{r, eval=FALSE}
ggplot(data = df, aes(x, y)) +
geom_bar(stat = "identity")
```
### Discussion {-#discussion-id175}
Let's use the cars made by Ford in the `Cars93` data in an example:
```{r fordcars, fig.cap="Ford cars bar chart"}
ford_cars <- Cars93 %>%
filter(Manufacturer == "Ford")
ggplot(ford_cars, aes(Model, Horsepower)) +
geom_bar(stat = "identity")
```
Figure \@ref(fig:fordcars) shows the resulting bar chart.
This example uses `stat = "identity"`, which assumes that the heights of your bars are conveniently stored as a value in one field with only one record per column. That is not always the case, however. Often you have a vector of numeric data and a
parallel factor or character field that groups the data, and you want to produce a bar
chart of the group means or the group totals.
Let's work up an example using the built-in `airquality` dataset, which contains daily temperature data for a single location for five months. The data frame has a numeric `Temp` column and `Month` and `Day` columns. If we want to plot the mean temp by month using `ggplot`, we don't need to precompute the mean; instead, we can have `ggplot` do that in the plot command logic. To tell `ggplot` to calculate the mean, we pass `stat = "summary", fun.y = "mean"` to the `geom_bar` command. We can also turn the month numbers into dates using the built-in constant `month.abb`, which contains the abbreviations for the months.
```{r aq1, fig.cap='Bar chart: Temp by month'}
ggplot(airquality, aes(month.abb[Month], Temp)) +
geom_bar(stat = "summary", fun.y = "mean") +
labs(title = "Mean Temp by Month",
x = "",
y = "Temp (deg. F)")
```
Figure \@ref(fig:aq1) shows the resulting plot. But you might notice the sort order on the months is alphabetical, which is not how we typically like to see months sorted.
We can fix the sorting issue using a few functions from `dplyr` combined with `fct_inorder` from the `forcats` tidyverse package. To get the months in the correct order, we can sort the data frame by `Month`, which is the month number. Then we can apply `fct_inorder`, which will arrange our factors in the order they appear in the data. You can see in Figure \@ref(fig:aq2) that the bars are now sorted properly.
```{r aq2, fig.cap='Bar chart properly sorted'}
library(forcats)
aq_data <- airquality %>%
arrange(Month) %>%
mutate(month_abb = fct_inorder(month.abb[Month]))
ggplot(aq_data, aes(month_abb, Temp)) +
geom_bar(stat = "summary", fun.y = "mean") +
labs(title = "Mean Temp by Month",
x = "",
y = "Temp (deg. F)")
```
### See Also {-#see_also-id175}
See Recipe \@ref(recipe-id177), ["Adding Confidence Intervals to a Bar Chart"](#recipe-id177), for
adding confidence intervals and Recipe \@ref(recipe-id176), ["Coloring a Bar Chart"](#recipe-id176), for adding color.
`?geom_bar` for help with bar charts in `ggplot`.
`barplot` for Base R bar charts or the `barchart` function in the `lattice` package.
Adding Confidence Intervals to a Bar Chart {#recipe-id177}
------------------------------------------
### Problem {-#problem-id177}
You want to augment a bar chart with confidence intervals.
### Solution {-#solution-id177}
Suppose you have a data frame `df` with columns `group`, which are group names; `stat`, which is a column of statistics; and `lower` and `upper`, which represent the corresponding limits for the confidence intervals. We can display a bar chart of `stat` for each `group` and its confidence intervals using the `geom_bar` combined with `geom_errorbar`.
```{r, include=FALSE}
# example data
x <- c(5, 10, 15, 20, 25)
df <- data.frame(
group = letters[1:5],
stat = x,
lower = x * .85,
upper = x * 1.20
)
```
```{r, confbars, fig.cap='Bar chart with confidence intervals'}
ggplot(df, aes(group, stat)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin = lower, ymax = upper), width = .2)
```
Figure \@ref(fig:confbars) shows the resulting bar chart with confidence intervals.
### Discussion {-#discussion-id177}
Most bar charts display point estimates, which are shown by the heights
of the bars, but rarely do they include confidence intervals. Our inner
statisticians dislike this intensely. The point estimate is only
half of the story; the confidence interval gives the full story.
Fortunately, we can plot the error bars using `ggplot`. The hard part is calculating the intervals. In the previous examples our data had a simple –15% and +20% interval. However, in
Recipe \@ref(recipe-id175), ["Creating a Bar Chart"](#recipe-id175), we calculated group means before
plotting them. If we let `ggplot` do the calculations for us, we can use the built-in `mean_se` along with the `stat_summary` function to get the standard errors of the mean measures.
Let's use the `airquality` data we used previously. First we'll do the sorted factor procedure (from the prior recipe) to get the month names in the desired order:
```{r }
aq_data <- airquality %>%
arrange(Month) %>%
mutate(month_abb = fct_inorder(month.abb[Month]))
```
Now we can plot the bars along with the associated standard errors as in Figure \@ref(fig:airqual):
```{r airqual, fig.cap='Mean temp by month with error bars'}
ggplot(aq_data, aes(month_abb, Temp)) +
geom_bar(stat = "summary",
fun.y = "mean",
fill = "cornflowerblue") +
stat_summary(fun.data = mean_se, geom = "errorbar") +
labs(title = "Mean Temp by Month",
x = "",
y = "Temp (deg. F)")
```
Sometimes you'll want to sort your columns in your bar chart in descending order based on their height. This can be a little bit confusing when you're using summary stats in `ggplot`, but the secret is to use `mean` in the `reorder` statement to sort the factor by the mean of the temp. Note that the reference to `mean` in `reorder` is not quoted, while the reference to `mean` in `geom_bar ` is quoted:
```{r airqual2, fig.cap='Mean temp by month in descending order'}
ggplot(aq_data, aes(reorder(month_abb, -Temp, mean), Temp)) +
geom_bar(stat = "summary",
fun.y = "mean",
fill = "tomato") +
stat_summary(fun.data = mean_se, geom = "errorbar") +
labs(title = "Mean Temp by Month",
x = "",
y = "Temp (deg. F)")
```
You may look at this example and the result in Figure \@ref(fig:airqual2) and wonder, "Why didn't they just use `reorder(month_abb, Month)` in the first example instead of that sorting business with `forcats::fct_inorder` to get the months in the right order?" Well, we could have. But sorting using `fct_inorder` is a design pattern that provides flexibility for more complicated things. Plus it's quite easy to read in a script. Using `reorder` inside the `aes` is a bit more dense and hard to read later. But either approach is reasonable.
### See Also {-#see_also-id177}
See Recipe \@ref(recipe-id123), ["Forming a Confidence Interval for a Mean"](#recipe-id123), for more about `t.test`.
Coloring a Bar Chart {#recipe-id176}
--------------------
### Problem {-#problem-id176}
You want to color or shade the bars of a bar chart.
### Solution {-#solution-id176}
With `gplot` we add the `fill =` call to our `aes` and let `ggplot` pick the colors for us:
```{r, eval=FALSE}
ggplot(df, aes(x, y, fill = group))
```
### Discussion {-#discussion-id176}
In `ggplot` we can use the `fill` parameter in `aes` to tell `ggplot` what field to base the colors on. If we pass a numeric field to `ggplot`, we will get a continuous gradient of colors; and if we pass a factor or character field to `fill`, we will get contrasting colors for each group. Here we pass the character name of each month to the `fill` parameter:
```{r colored-air, fig.cap='Colored monthly temp bar chart'}
aq_data <- airquality %>%
arrange(Month) %>%
mutate(month_abb = fct_inorder(month.abb[Month]))
ggplot(data = aq_data, aes(month_abb, Temp, fill = month_abb)) +
geom_bar(stat = "summary", fun.y = "mean") +
labs(title = "Mean Temp by Month",
x = "",
y = "Temp (deg. F)") +
scale_fill_brewer(palette = "Paired")
```
We define the colors in the resulting bar chart (Figure \@ref(fig:colored-air)) by calling `scale_fill_brewer(palette="Paired")`. The `"Paired"` color palette comes, along with many other color palettes, in the package `RColorBrewer`.
If we wanted to change the color of each bar based on the temperature, we can't just set `fill = Temp`—as might seem intuitive—because `ggplot` would not understand we want the mean temperature after the grouping by month. So the way we get around this is to access a special field inside of our graph called `..y..`, which is the calculated value on the y-axis. But we don't want the legend labeled `..y..` so we add `fill = "Temp"` to our `labs` call in order to change the name of the legend. The result is shown in Figure \@ref(fig:barsshaded).
```{r barsshaded, fig.cap='Bar chart shaded by value'}
ggplot(airquality, aes(month.abb[Month], Temp, fill = ..y..)) +
geom_bar(stat = "summary", fun.y = "mean") +
labs(title = "Mean Temp by Month",
x = "",
y = "Temp (deg. F)",
fill = "Temp")
```
If we want to reverse the color scale, we can just add a negative sign, `-`, in front of the field we are filling by: `fill= -..y..`, for example.
### See Also {-#see_also-id176}
See Recipe \@ref(recipe-id175), ["Creating a bar chart"](#recipe-id175), for creating a bar chart.
Plotting a Line from x and y Points {#recipe-id174}
-----------------------------------
### Problem {-#problem-id174}
You have paired observations in a data frame: (*x*~1~, *y*~1~), (*x*~2~, *y*~2~), ...,
(*x*~*n*~, *y*~*n*~). You want to plot a series of line segments that
connect the data points.
### Solution {-#solution-id174}
With `ggplot` we can use `geom_point` to plot the points:
```{r, eval=FALSE}
ggplot(df, aes(x, y)) +
geom_point()
```
Since `ggplot` graphics are built up, element by element, we can have both a point and a line in the same graphic very easily by having two geoms:
```{r eval=FALSE}
ggplot(df, aes(x , y)) +
geom_point() +
geom_line()
```
### Discussion {-#discussion-id174}
To illustrate, let's look at some example US economic data that comes with `ggplot2`. This example data frame has a column called `date`, which we'll plot on the x-axis and a field `unemploy`, which is the number of unemployed people.
```{r linechart, fig.cap="Line chart example"}
ggplot(economics, aes(date , unemploy)) +
geom_point() +
geom_line()
```
Figure \@ref(fig:linechart) shows the resulting chart, which contains both lines and points because we used both geoms.
### See Also {-#see_also-id174}
See Recipe \@ref(recipe-id171,) ["Creating a Scatter Plot"](#recipe-id171).
Changing the Type, Width, or Color of a Line {#recipe-id256}
--------------------------------------------
### Problem {-#problem-id256}
You are plotting a line. You want to change the type, width, or color of
the line.
### Solution {-#solution-id256}
`ggplot` uses the `linetype` parameter for controlling the appearance of lines:
- `linetype="solid"` or `linetype=1` (default)
- `linetype="dashed"` or `linetype=2`
- `linetype="dotted"` or `linetype=3`
- `linetype="dotdash"` or `linetype=4`
- `linetype="longdash"` or `linetype=5`
- `linetype="twodash"` or `linetype=6`
- `linetype="blank"` or `linetype=0` (inhibits drawing)
You can change the line characteristics by passing `linetype`, `col`, and/or `size` as parameters to the `geom_line`.
So if we want to change the line type to dashed, red, and heavy, we could pass the `linetype`, `col`, and `size` params to `geom_line`:
```{r eval=FALSE}
ggplot(df, aes(x, y)) +
geom_line(linetype = 2,
size = 2,
col = "red")
```
### Discussion {-#discussion-id256}
The example syntax shows how to draw one line and specify its style, width, or color. A common scenario involves drawing multiple lines, each with its own style, width, or color.
Let's set up some example data:
```{r}
x <- 1:10
y1 <- x**1.5
y2 <- x**2
y3 <- x**2.5
df <- data.frame(x, y1, y2, y3)
```
In `ggplot` this can be a conundrum for many users. The challenge is that `ggplot` works best with "long" data instead of "wide" data,
as was mentioned in the Introduction to this chapter.
Our example data frame has four columns of wide data:
``` {r}
head(df, 3)
```
We can make our wide data long by using the `gather` function from the core tidyverse package `tidyr`. In this example, we use `gather` to create a new column named `bucket` and put our column names in there while keeping our `x` and `y` variables.
```{r}
df_long <- gather(df, bucket, y, -x)
head(df_long, 3)
tail(df_long, 3)
```
Now we can pass `bucket` to the `col` parameter and get multiple lines, each a different color:
``` {r, multiplelines, fig.cap="Multiple line chart"}
ggplot(df_long, aes(x, y, col = bucket)) +
geom_line()
```
Figure \@ref(fig:multiplelines) shows the resulting graph with each variable represented in a different color.
It's straightforward to vary the line weight by a variable by passing a numerical variable to `size`:
```{r thickness, fig.cap='Thickness as a function of x'}
ggplot(df, aes(x, y1, size = y2)) +
geom_line() +
scale_size(name = "Thickness based on y2")
```
The result of varying the thickness with *x* is shown in Figure \@ref(fig:thickness).
### See Also {-#see_also-id256}
See Recipe \@ref(recipe-id174), ["Plotting a Line from x and y Points"](#recipe-id174), for plotting a
basic line.
Plotting Multiple Datasets {#recipe-id263}
--------------------------
### Problem {-#problem-id263}
You want to show multiple datasets in one plot.
### Solution {-#solution-id263}
We can add multiple data frames to a `ggplot` figure by creating an empty plot and then adding two different geoms to the plot: