-
Notifications
You must be signed in to change notification settings - Fork 40
/
Copy path03-exploringdata1.Rmd
1128 lines (841 loc) · 53.7 KB
/
03-exploringdata1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Exploring data #1
The video lectures for this chapter are embedded at relevant places in the text,
with links to download a pdf of the associated slides for each video.
You can also access [a full playlist for the videos for this chapter](https://www.youtube.com/playlist?list=PLuGPtwgRXxqJLQ2klnpaDuFiBkhbC3Ovk).
## Objectives
After this chapter, you should (know / understand / be able to ):
- Be able to load and use datasets from R packages
- Be able to describe and use logical vectors
- Understand how logical vectors check logical statements against other R vector(s) and store TRUE / FALSE values as 0 / 1 at a deeper level
- Be able to use the `dplyr` function `mutate` to create a logical vector as a new column in a dataframe and the `dplyr` function `filter` with that new column to filter a dataframe to a subset of rows
- Be able to use the bang operator (!) to reverse a logical vector
- Know what the "tidyverse" is and name some of its packages
- Be able to use some simple statistical functions (e.g., `min`, `max`, `mean`, `median`, `cor`, `summary`), including how to handle missing values when using these
- Be able to use the `dplyr` function `summarize` to summarize data, with and without grouping using `group_by`, including with special functions `n`, `n_distinct`, `first`, and `last`
- Understand the three basic elements of `ggplot` plots: data, aesthetics, and geoms
- Be able to create a `ggplot` object, set its data using `data = ...` and its aesthetics using `mapping = aes(...)`, and add on layers (including `geoms`) with `+`
- Be able to create some basic plots (e.g., scatterplots, boxplots, histograms) using `ggplot2` functions
- Understand the difference between setting an aesthetic by mapping it to a column of the dataframe versus setting it to a constant value
- Understand the difference between "statistical" geoms (e.g., histograms, boxplots) and geoms that add one geom element per dataframe observation (row)
<iframe width="768" height="480" src="https://www.youtube.com/embed/ntsCRNizqw4?list=PLuGPtwgRXxqJLQ2klnpaDuFiBkhbC3Ovk" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week3_part_1.pdf)
a pdf of the lecture slides for this video.
## Simple statistics functions
### Summary statistics
<iframe width="768" height="480" src="https://www.youtube.com/embed/Y5G9nYQr4c8?list=PLuGPtwgRXxqJLQ2klnpaDuFiBkhbC3Ovk" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week3_part_2.pdf)
a pdf of the lecture slides for this video.
To explore your data, you'll need to be able to calculate some simple statistics for vectors, including calculating the mean and range of continuous variables and counting the number of values in each category of a factor or logical vector.
Here are some simple statistics functions you will likely use often:
Function | Description
--------- | -----------------
`range()` | Range (minimum and maximum) of vector
`min()`, `max()` | Minimum or maximum of vector
`mean()`, `median()` | Mean or median of vector
`sd()` | Standard deviation of vector
`table()` | Number of observations per level for a factor vector
`cor()` | Determine correlation(s) between two or more vectors
`summary()` | Summary statistics, depends on class
All of these functions take, as the main argument, the vector or vectors for which you want the statistic. If there are missing values in the vector, you'll typically need to add an argument to say what to do with the missing values. The parameter name for this varies by function, but for many of these functions it's `na.rm = TRUE` or `use="complete.obs"`.
```{r echo = FALSE}
library(tidyverse)
library(faraway)
data("worldcup")
```
```{r}
mean(nepali$wt, na.rm = TRUE)
range(nepali$ht, na.rm = TRUE)
sd(nepali$ht, na.rm = TRUE)
table(nepali$sex)
```
Most of these functions take a single vector as the input. The `cor` function, however, calculates the correlation between vectors and so takes two or more vectors. If you give it multiple values, it will give the correlation matrix for all the vectors.
```{r}
cor(nepali$wt, nepali$ht, use = "complete.obs")
cor((nepali %>% select(wt, ht, age)), use = "complete.obs")
```
R supports object-oriented programming. Your first taste of this shows up with the `summary` function. For the `summary` function, R does not run the same code every time. Instead, R first checks what type of object was input to `summary`, and then it runs a function (*method*) specific to that type of object. For example, if you input a continuous vector, like the `ht` column in `nepali`, to `summary`, the function will return the mean, median, range, and 25th and 75th percentile values:
```{r}
summary(nepali$wt)
```
However, if you submit a factor vector, like the `sex` column in `nepali`, the `summary` function will return a count of how many elements of the vector are in each factor level (as a note, you could do the same thing with the `table` function):
```{r}
summary(nepali$sex)
```
The `summary` function can also input other data structures, including dataframes, lists, and special object types, like regression model objects. In each case, it performs different actions specific to the object type. Later in this section, we'll cover regression models, and see what the `summary` function returns when it is used with regression model objects.
### `summarize` function
You will often want to use these functions in conjunction with the `summarize` function in `dplyr`. For example, to create a new dataframe with the mean weight of children in the `nepali` dataset, you can use `mean` inside a `summarize` function:
```{r}
library(dplyr)
nepali %>%
summarize(mean_wt = mean(wt, na.rm = TRUE))
```
There are also some special functions that are particularly useful with `summarize` and other `dplyr` functions. For example, the `n` function will calculate the number of observations and the `first` function will return the first value of a column:
```{r}
nepali %>%
summarize(n_children =n(),
first_id = first(id))
```
See the "summary function" section of the [the RStudio Data Wrangling cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) for more examples of these special functions.
Often, you will be more interested in summaries within certain groupings of your data, rather than overall summaries. For example, you may be interested in mean height and weight by sex, rather than across all children, for the `nepali` data. It is very easy to calculate these grouped summaries using `dplyr`---you just need to group data using the `group_by` function (also a `dplyr` function) before you run the `summarize` function:
```{r}
nepali %>%
group_by(sex) %>%
summarize(mean_wt = mean(wt, na.rm = TRUE),
n_children =n(),
first_id = first(id))
```
```{block, type = "rmdnote"}
Don't forget that you need to save the output to a new object if you want to use it later. The above code, which creates a dataframe with summaries for Nepali children by sex, will only be printed out to your console if run as-is. If you'd like to save this output as an object to use later (for example, for a plot or table), you need to assign it to an R object.
```
## Factor vectors
<iframe width="768" height="480" src="https://www.youtube.com/embed/o7rqBnvpYjU?list=PLuGPtwgRXxqJLQ2klnpaDuFiBkhbC3Ovk" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week3_part_3.pdf)
a pdf of the lecture slides for this video.
## Data from a package
<iframe width="768" height="480" src="https://www.youtube.com/embed/o7rqBnvpYjU?list=PLuGPtwgRXxqJLQ2klnpaDuFiBkhbC3Ovk" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week3_part_4.pdf)
a pdf of the lecture slides for this video.
So far we've covered two ways to get data into R:
1. From flat files (either on your computer or online)
2. From binary file formats like SAS and Excel.
Many R packages come with their own data, which is very easy to load and use. For example, the `faraway` package, which complements Julian Faraway's book *Linear Models with R*, has a dataset called `worldcup` that I'll use for some examples and that you'll use for part of this week's in-course exercise. To load this dataset, first load the package with the data (`faraway`) and then use the `data()` function with the dataset name ("worldcup") as the argument to the `data` function:
```{r}
library(faraway)
data("worldcup")
```
Unlike most data objects you'll work with, datasets that are part of an R package will often have their own help files. You can access this help file for a dataset using the `?` operator with the dataset's name:
```{r, eval = FALSE}
?worldcup
```
This helpful will usually include information about the size of the dataset, as well as definitions for each of the columns.
To get a list of all of the datasets that are available in the packages you currently have loaded, run `data()` without an option inside the parentheses:
```{r, eval = FALSE}
data()
```
```{block, type = "rmdnote"}
If you run the `library` function without any arguments---`library()`---it works in a similar way. R will open a list of all the R packages that you have installed on your computer and can open with a `library` call.
```
For this chapter, we'll be working with a modified version of the `nepali` dataset from the `faraway` package. This gives data from a study of the health of a group of Nepalese children. Each observation is a single measurement for a child; there can be multiple observations per child. We'll use a modified version of this dataframe that limits it to the columns with the child's id, sex, weight, height, and age, and limited to each child's first measurement. To create this modified dataset, run the following code:
```{r}
library(dplyr)
library(faraway)
data(nepali)
nepali <- nepali %>%
# Limit to certain columns
select(id, sex, wt, ht, age) %>%
# Convert id and sex to factors
mutate(id = factor(id),
sex = factor(sex, levels = c(1, 2),
labels = c("Male", "Female"))) %>%
# Limit to first obs. per child
distinct(id, .keep_all = TRUE)
```
The first few rows of the data should now look like:
```{r}
nepali %>%
slice(1:4)
```
## Dates in R
<iframe width="768" height="480" src="https://www.youtube.com/embed/1ksBUmcXP0g?list=PLuGPtwgRXxqJLQ2klnpaDuFiBkhbC3Ovk" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week3_part_5.pdf)
a pdf of the lecture slides for this video.
As part of the data cleaning process, you may want to change the class of some
of the columns in the dataframe. For example, you may want to change a column
from a character to a date.
Here are some of the most common vector classes in R:
Class | Example
------------ | -----------------------------------
`character` | "Chemistry", "Physics", "Mathematics"
`numeric` | 10, 20, 30, 40
`factor` | Male [underlying number: 1], Female [2]
`Date` | "2010-01-01" [underlying number: 14,610]
`logical` | TRUE, FALSE
To find out the class of a vector (including a column in a dataframe -- remember
each column can be thought of as a vector), you can use `class()`:
```{r}
class(daily_show$date)
```
It is especially common to need to convert dates during the data cleaning
process, since date columns will usually be read into R as characters or
factors---you can do some interesting things with vectors that are in a Date
class that you cannot do with a vector in a character class.
To convert a vector to the `Date` class, if you'd like to only use base R, you
can use the `as.Date` function. I'll walk through how to use `as.Date`, since
it's often used in older R code. However, I recommend in your own code that you
instead use the `lubridate` package, which I'll talk about later in this
section.
To convert a vector to the `Date` class, you can use functions in the
`lubridate` package. This package has a series of functions based on the order
that date elements are given in the incoming character with date information.
For example, in "12/31/99", the date elements are given in the order of month
(**m**), day (**d**), year (**y**), so this character string could be converted
to the date class with the function `mdy`. As another example, the `ymd`
function from lubridate can be used to parse a column into a Date class,
regardless of the original format of the date, as long as the date elements are
in the order: year, month, day. For example:
```{r message = FALSE}
library("lubridate")
ymd("2008-10-13")
ymd("'08 Oct 13")
ymd("'08 Oct 13")
```
To convert the `date` column in the `daily_show` data into a Date
class, then, you can run:
```{r}
library(package = "lubridate")
class(x = daily_show$date) # Check the class of the 'date' column before mutating it
daily_show <- mutate(.data = daily_show,
date = mdy(date))
head(x = daily_show, n = 3)
class(x = daily_show$date) # Check the class of the 'date' column after mutating it
```
Once you have an object in the `Date` class, you can do things like plot by
date, calculate the range of dates, and calculate the total number of days the
dataset covers:
```{r eval = FALSE}
range(daily_show$date)
diff(x = range(daily_show$date))
```
We could have used these to transform the date in `daily_show`, using the following pipe chain:
```{r message = FALSE}
daily_show <- read_csv(file = "data/daily_show_guests.csv",
skip = 4) %>%
rename(job = GoogleKnowlege_Occupation,
date = Show,
category = Group,
guest_name = Raw_Guest_List) %>%
select(-YEAR) %>%
mutate(date = mdy(date)) %>%
filter(category == "Science")
head(x = daily_show, n = 2)
```
The `lubridate` package also includes functions to pull out certain elements of a date, including:
- `wday`
- `mday`
- `yday`
- `month`
- `quarter`
- `year`
For example, we could use `wday` to create a new column with the weekday of each show:
```{r}
mutate(.data = daily_show,
show_day = wday(x = date, label = TRUE)) %>%
select(date, show_day, guest_name) %>%
slice(1:5)
```
```{block, type = 'rmdwarning'}
R functions tend to use the timezone of **YOUR** computer's operating system by
default, or UTC, or GMT. You need to be careful when working with dates and
times to either specify the time zone or convince yourself the default behavior
works for your application.
```
## Logical vectors
<iframe width="576" height="360" src="https://www.youtube.com/embed/2t8gDG8croo" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week3_part_6.pdf)
a pdf of the lecture slides for this video.
Last week, you learned a lot about logical statements and how to use them with the `filter` function from the `dplyr` package. You can also use logical vectors, created with these logical statements, for a lot of other things. For example, you can use them directly in the square bracket indexing (`[..., ...]`) to pull out just the rows of a dataframe that meet a certain condition. For using logical statements in either context, it is helpful to understand a bit more about logical vectors.
When you run a logical statement on a vector, you create a logical vector the same length as the original vector:
```{r}
length(nepali$sex)
length(nepali$sex == "Male")
```
The logical vector (`nepali$sex == "Male"` in this example) will have the value `TRUE` at any position where the original vector (`nepali$sex` in this example) met the logical condition you tested, and `FALSE` anywhere else:
```{r}
head(nepali$sex)
head(nepali$sex == "Male")
```
You can "flip" this logical vector (i.e., change every `TRUE` to `FALSE` and vice-versa) using the *bang operator*, `!`:
```{r}
is_male <- nepali$sex == "Male" # Save this logical vector as the object named `is_male`
head(is_male)
head(!is_male)
```
The bang operator turns out to be very useful. You will often find cases where it's difficult to write a logical vector to get what you want, but fairly easy to write the inverse (find everything you don't want). One example is filtering down to non-missing values---the `is.na` function will return `TRUE` for any value that is `NA`, so you can use `!is.na()` to identify any non-missing values.
You can do a few cool things with a logical vector. For example, you can use it inside a `filter` function to pull out just the rows of a dataframe where `is_male` is `TRUE`:
```{r}
nepali %>%
filter(is_male) %>%
head()
```
Or, with `!`, just the rows where `is_male` is `FALSE`:
```{r}
nepali %>%
filter(!is_male) %>%
head()
```
You can also use `sum()` and `table()` with a logical vector to find out how many of the values in the vector are `TRUE` AND `FALSE`. You can use `sum` because R saves logical vectors at a basic level as `0` for `FALSE` and `1` for `TRUE`. Therefore, if you add up all the values in a logical vector, you're adding up the number of observations with the value `TRUE`.
In the example, you can use these functions to find out how many males and females are in the dataset:
```{r}
sum(is_male)
sum(!is_male)
table(is_male)
```
Note that you could also achieve the same thing with `dplyr` functions. For example, you could use `mutate` with a logical statement to create an `is_male` column in the `nepali` dataframe, then group by the new `is_male` column and count the number of observations in each group using `count`:
```{r}
library(dplyr)
nepali %>%
mutate(is_male = sex == "Male") %>%
group_by(is_male) %>%
count()
```
We will cover using `summarize`, including with data that has been grouped with `group_by`, later in this chapter.
<iframe width="576" height="360" src="https://www.youtube.com/embed/0_EpZQKWsow?list=PLuGPtwgRXxqJLQ2klnpaDuFiBkhbC3Ovk" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week3_part_7.pdf)
a pdf of the lecture slides for this video.
## Plots to explore data
<iframe width="576" height="360" src="https://www.youtube.com/embed/2E0MlcsfBmg" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week3_part_8.pdf)
a pdf of the lecture slides for this video.
Exploratory data analysis is a key step in data analysis and plotting your data in different ways is an important part of this process. In this section, I will focus on the basics of `ggplot2` plotting, to get you started creating some plots to explore your data.
This section will focus on making **useful**, rather than **attractive** graphs, since at this stage we are focusing on exploring data for yourself rather than presenting results to others. Next week, I will explain more about how you can customize ggplot objects, to help you make plots to communicate with others.
All of the plots we'll make today will use the `ggplot2` package (another member of the tidyverse!). If you don't already have that installed, you'll need to install it. You then need to load the package in your current session of R:
```{r}
# install.packages("ggplot2") ## Uncomment and run if you don't have `ggplot2` installed
library(ggplot2)
```
The process of creating a plot using `ggplot2` follows conventions that are a bit different than most of the code you've seen so far in R (although it is somewhat similar to the idea of piping I introduced in the last chapter). The basic steps behind creating a plot with `ggplot2` are:
1. Create an object of the `ggplot` class, typically specifying the **data** and some or all of the **aesthetics**;
2. Add on **geoms** and other elements to create and customize the plot, using `+`.
You can add on one or many geoms and other elements to create plots that range from very simple to very customized. This week, we'll focus on simple geoms and added elements, and then explore more detailed customization next week.
```{block type = "rmdwarning"}
If R gets to the end of a line and there is not some indication that the call is not over (e.g., `%>%` for piping or `+` for `ggplot2` plots), R interprets that as a message to run the call without reading in further code. A common error when writing `ggplot2` code is to put the `+` to add a geom or element at the beginning of a line rather than the end of a previous line-- in this case, R will try to execute the call too soon. To avoid errors, be sure to end lines with `+`, don't start lines with it.
```
### Initializing a ggplot object
The first step in creating a plot using `ggplot2` is to create a ggplot object. This object will not, by itself, create a plot with anything in it. Instead, it typically specifies the data frame you want to use and which aesthetics will be mapped to certain columns of that data frame (aesthetics are explained more in the next subsection).
Use the following conventions to initialize a ggplot object:
```{r, eval = FALSE}
## Generic code
object <- ggplot(dataframe, aes(x = column_1, y = column_2))
```
The data frame is the first parameter in a `ggplot` call and, if you like, you can use the parameter definition with that call (e.g., `data = dataframe`). Aesthetics are defined within an `aes` function call that typically is used within the `ggplot` call.
```{block type = "rmdnote"}
While the `ggplot` call is the place where you will most often see an `aes` call, `aes` can also be used within the calls to add specific geoms. This can be particularly useful if you want to map aesthetics differently for different geoms in your plot. We'll see some examples of this use of `aes` more in later sections, when we talk about customizing plots. The `data` parameter can also be used in geom calls, to use a different data frame from the one defined when creating the original ggplot object, although this tends to be less common.
```
### Plot aesthetics
**Aesthetics** are properties of the plot that can show certain elements of the data. For example, in Figure \@ref(fig:aesmapex), color shows (is mapped to) gender, x-position shows height, and y-position shows weight in a sample data set of measurements of children in Nepal.
```{r aesmapex, echo = FALSE, warning = FALSE, fig.width = 6, fig.height = 4, fig.align = "center", message = FALSE, fig.cap = "Example of how different properties of a plot can show different elements to the data. Here, color indicates gender, position along the x-axis shows height, and position along the y-axis shows weight. This example is a subset of data from the `nepali` dataset in the `faraway` package."}
library(dplyr)
data("nepali")
nepali %>%
tbl_df() %>%
distinct(id, .keep_all = TRUE) %>%
mutate(sex = factor(sex, levels = c(1, 2), labels = c("Male", "Female"))) %>%
ggplot(aes(x = ht, y = wt, color = sex)) +
geom_point() +
xlab("Height (cm)") + ylab("Weight (kg)")
```
```{block type = "rmdnote"}
Any of these aesthetics could also be given a constant value, instead of being mapped to an element of the data. For example, all the points could be red, instead of showing gender.
```
Which aesthetics are required for a plot depend on which geoms (more on those in a second) you're adding to the plot. You can find out the aesthetics you can use for a geom in the "Aesthetics" section of the geom's help file (e.g., `?geom_point`). Required aesthetics are in bold in this section of the help file and optional ones are not. Common plot aesthetics you might want to specify include:
```{r echo = FALSE}
aes_vals <- data.frame(aes = c("`x`", "`y`", "`shape`",
"`color`", "`fill`", "`size`",
"`alpha`", "`linetype`"),
desc = c("Position on x-axis",
"Position on y-axis",
"Shape",
"Color of border of elements",
"Color of inside of elements",
"Size",
"Transparency (1: opaque; 0: transparent)",
"Type of line (e.g., solid, dashed)"))
knitr::kable(aes_vals, col.names = c("Code", "Description"))
```
### Adding geoms
Next, you'll want to add one or more `geoms` to create the plot. You can add these with `+` after the `ggplot` statement to initialize the ggplot object. Some of the most common geoms are:
```{r echo = FALSE}
plot_funcs <- data.frame(type = c("Histogram (1 numeric variable)",
"Scatterplot (2 numeric variables)",
"Boxplot (1 numeric variable, possibly 1 factor variable)",
"Line graph (2 numeric variables)"),
ggplot2_func = c("`geom_histogram`",
"`geom_point`",
"`geom_boxplot`",
"`geom_line`"))
knitr::kable(plot_funcs, col.names = c("Plot type",
"ggplot2 function"))
```
### Constant aesthetics
<iframe width="576" height="360" src="https://www.youtube.com/embed/qiTPGzqYiOI" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week3_part_9.pdf)
a pdf of the lecture slides for this video.
Instead of mapping an aesthetic to an element of your data, you can use a constant value for it. For example, you may want to make all the points green, rather than having color map to gender:
```{r echo = FALSE, warning = FALSE, fig.align = "center", out.width = "0.6\\textwidth", message = FALSE, fig.width = 5, fig.height = 3}
nepali %>%
tbl_df() %>%
distinct(id, .keep_all = TRUE) %>%
mutate(sex = factor(sex, levels = c(1, 2), labels = c("Male", "Female"))) %>%
ggplot(aes(x = ht, y = wt)) +
geom_point(color = "darkgreen") +
xlab("Height (cm)") + ylab("Weight (kg)")
```
In this case, you'll define that aesthetic when you add the geom, outside of an `aes` statement. In R, you can specify the shape of points with a number. Figure \@ref(fig:shapeexamples) shows the shapes that correspond to the numbers 1 to 25 in the `shape` aesthetic. This figure also provides an example of the difference between color (black for all these example points) and fill (red for these examples). You can see that some point shapes include a fill (21 for example), while some are either empty (1) or solid (19).
```{r shapeexamples, echo = FALSE, fig.width = 5, fig.height = 3, fig.align = "center", fig.cap = "Examples of the shapes corresponding to different numeric choices for the `shape` aesthetic. For all examples, `color` is set to black and `fill` to red."}
x <- rep(1:5, 5)
y <- rep(1:5, each = 5)
shape <- 1:25
to_plot <- data_frame(x = x, y = y, shape = shape)
ggplot(to_plot, aes(x = x, y = y)) +
geom_point(shape = shape, size = 4, color = "black", fill = "red") +
geom_text(label = shape, nudge_x = -0.25) +
xlim(c(0.5, 5.5)) +
theme_void() +
scale_y_reverse()
```
If you want to set color to be a constant value, you can do that in R using character strings for different colors. Figure \@ref(fig:colorexamples) gives an example of some of the different blues available in R. To find links to listings of different R colors, google "R colors" and search by "Images".
```{r colorexamples, echo = FALSE, fig.width = 5, fig.height = 3, fig.align = "center", fig.cap = "Example of available shades of blue in R."}
x <- rep(0, 6)
y <- 1:6
color <- c("blue", "blue4", "darkorchid", "deepskyblue2",
"steelblue1", "dodgerblue3")
to_plot <- data_frame(x = x, y = y, color = color)
ggplot(to_plot, aes(x = x, y = y)) +
geom_point(color = color, size = 2) +
geom_text(label = color, hjust = 0, nudge_x = 0.05) +
theme_void() +
xlim(c(-1, 1.5)) +
scale_y_reverse()
```
### Useful plot additions
There are also a number of elements that you can add onto a `ggplot` object using `+`. A few that are used very frequently are:
```{r echo = FALSE}
plot_adds <- data.frame(add = c("`ggtitle`",
"`xlab`, `ylab`",
"`xlim`, `ylim`"),
descrip = c("Plot title",
"x- and y-axis labels",
"Limits of x- and y-axis"))
knitr::kable(plot_adds, col.names = c("Element", "Description"))
```
### Example dataset
For the example plots, I'll use a dataset in the `faraway` package called `nepali`. This gives data from a study of the health of a group of Nepalese children.
```{r}
library(faraway)
data(nepali)
```
I'll be using functions from `dplyr` and `ggplot2`, so those need to be loaded:
```{r message = FALSE, warning = FALSE}
library(dplyr)
library(ggplot2)
```
Each observation is a single measurement for a child; there can be multiple observations per child. I used the following code to select only the columns for child id, sex, weight, height, and age. I also used `distinct` to limit the dataset to only include one measurement for each chile, the child's first measurement in the dataset.
```{r message = FALSE}
nepali <- nepali %>%
select(id, sex, wt, ht, age) %>%
mutate(id = factor(id),
sex = factor(sex, levels = c(1, 2),
labels = c("Male", "Female"))) %>%
distinct(id, .keep_all = TRUE)
```
After this cleaning, the data looks like this:
```{r}
head(nepali)
```
### Histograms
<iframe width="576" height="360" src="https://www.youtube.com/embed/qz5SmXkOj_k?list=PLuGPtwgRXxqJLQ2klnpaDuFiBkhbC3Ovk" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week3_part_10.pdf)
a pdf of the lecture slides for this video.
Histograms show the distribution of a single variable. Therefore, `geom_histogram()` requires only one main aesthetic, `x`, the (numeric) vector for which you want to create a histogram. For example, to create a histogram of children's heights for the Nepali dataset (Figure \@ref(fig:nepalihist1)), run:
```{r, nepalihist1, fig.width = 4, fig.height = 3, message = FALSE, warning = FALSE, fig.align = "center", fig.cap = "Basic example of plotting a histogram with `ggplot2`. This histogram shows the distribution of heights for the first recorded measurements of each child in the `nepali` dataset."}
ggplot(nepali, aes(x = ht)) +
geom_histogram()
```
```{block type = "rmdnote"}
If you run the code with no arguments for `binwidth` or `bins` in `geom_histogram`, you will get a message saying "`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.". This message is just saying that a default number of bins was used to create the histogram. You can use arguments to change the number of bins used, but often this default is fine. You may also get a message that observations with missing values were removed.
```
You can add some elements to the histogram now to customize it a bit. For example (Figure \@ref()), you can add a figure title (`ggtitle`) and clearer labels for the x-axis (`xlab`). You can also change the range of values shown by the x-axis (`xlim`).
```{r, nepalihist2, fig.width = 4, fig.height = 3, fig.align = "center", message = FALSE, warning = FALSE, fig.cap = "Example of adding ggplot elements to customize a histogram."}
ggplot(nepali, aes(x = ht)) +
geom_histogram(fill = "lightblue", color = "black") +
ggtitle("Height of children") +
xlab("Height (cm)") + xlim(c(0, 120))
```
The geom `geom_histogram` also has special argument for setting the number of width of the bins used in the histogram. Figure \@ref(fig) shows an example of how you can use the `bins` argument to change the number of bins that are used to make the histogram of height for the `nepali` dataset.
```{r, nepalihist3, fig.width = 4, fig.height = 3, fig.align = "center", warning = FALSE, message = FALSE, fig.cap = "Example of using the `bins` argument to change the number of bins used in a histogram."}
ggplot(nepali, aes(x = ht)) +
geom_histogram(fill = "lightblue", color = "black",
bins = 40)
```
Similarly, the `binwidth` argument can be used to set the width of bins. Figure \@ref(fig:nepalihist4) shows an example of using this function to create a histogram of the Nepali children's heights with binwidths of 10 centimeters (note that this argument is set in the same units as the x variable).
```{r, nepalihist4, fig.width = 4, fig.height = 3, fig.align = "center", warning = FALSE, message = FALSE, fig.cap = "Example of using the `binwidth` argument to set the width of each bin used in a histogram."}
ggplot(nepali, aes(x = ht)) +
geom_histogram(fill = "lightblue", color = "black",
binwidth = 10)
```
### Scatterplots
A scatterplot shows how one variable changes as another changes. You can use the `geom_point` geom to create a scatterplot. For example, to create a scatterplot of height versus age for the Nepali data (Figure \@ref(fig:nepaliscatter1)), you can run the following code:
```{r nepaliscatter1, fig.width = 5, fig.height = 4, warning = FALSE, fig.align = "center", fig.cap = "Example of creating a scatterplot. This scatterplot shows the relationship between children's heights and weights within the nepali dataset."}
ggplot(nepali, aes(x = ht, y = wt)) +
geom_point()
```
Again, you can use some of the options and additions to change the plot appearance. For example, to add a title, change the x- and y-axis labels, and change the color and size of the points on the scatterplot (Figure \@ref(fig:nepaliscatter2)), you can run:
```{r nepaliscatter2, fig.width = 5, fig.height = 4, fig.align = "center", message = FALSE, warning = FALSE, fig.cap = "Example of adding ggplot elements to customize a scatterplot."}
ggplot(nepali, aes(x = ht, y = wt)) +
geom_point(color = "blue", size = 0.5) +
ggtitle("Weight versus Height") +
xlab("Height (cm)") + ylab("Weight (kg)")
```
You can also try mapping another variable in the dataset to the `color` aesthetic. For example, to use color to show the sex of each child in the scatterplot (Figure \@ref(fig:nepaliscatter3)), you can run:
```{r nepaliscatter3, fig.width = 5, fig.height = 4, fig.align = "center", message = FALSE, warning = FALSE, fig.cap = "Example of mapping color to an element of the data in a scatterplot."}
ggplot(nepali, aes(x = ht, y = wt, color = sex)) +
geom_point(size = 0.5) +
ggtitle("Weight versus Height") +
xlab("Height (cm)") + ylab("Weight (kg)")
```
### Boxplots
Boxplots can be used to show the distribution of a continuous variable. To create a boxplot, you can use the `geom_boxplot` geom. To plot a boxplot for a single, continuous variable, you can map that variable to `y` in the `aes` call, and map `x` to the constant `1`. For example, to create a boxplot of the heights of children in the Nepali dataset (Figure \@ref(fig:nepaliboxplot1)), you can run:
```{r nepaliboxplot1, fig.height = 4, fig.width = 4, warning = FALSE, fig.align="center", fig.cap = "Example of creating a boxplot. The example shows the distribution of height data for children in the nepali dataset."}
ggplot(nepali, aes(x = 1, y = ht)) +
geom_boxplot() +
xlab("")+ ylab("Height (cm)")
```
You can also create separate boxplots, one for each level of a factor (Figure \@ref(fig:nepaliboxplot2)). In this case, you'll need to include two aesthetics (`x` and `y`) when you initialize the ggplot object The `y` variable is the variable for which the distribution will be shown, and the `x` variable should be a discrete (categorical or TRUE/FALSE) variable, and will be used to group the variable. This `x` variable should also be specified as the grouping variable, using `group` within the aesthetic call.
```{r nepaliboxplot2, fig.height = 4, fig.width = 5, fig.align = "center", warning = FALSE, fig.cap = "Example of creating separate boxplots, divided by a categorical grouping variable in the data."}
ggplot(nepali, aes(x = sex, y = ht, group = sex)) +
geom_boxplot() +
xlab("Sex")+ ylab("Height (cm)")
```
## In-course Exercise Chapter 3
### Loading data from an R package
Pick one person to start sharing their screen.
The data we'll be using today is from a dataset called `worldcup` in the package
`faraway`. Load that data so you can use it on your computer (note: you will
need to load and install the `faraway` package to do this). Use the help file
for the data to find out more about the dataset. Use some basic functions, like
`head`, `tail`, `slice`, `colnames`, `str`, and `summary` to check out the data a bit
(if some of these you haven't seen before, remember you can always check their
helpfiles!). See if you can figure out:
- What variables are included in this dataset? (Check the column names.)
- What class is each column currently? In particular, which are numbers and
which are factors?
#### Example R code:
Load the `faraway` package using `library()` and then load the data using `data()`:
```{r}
## Uncomment the next line if you need to install the package
# install.packages("faraway")
library(faraway)
data("worldcup")
```
Check out the help file for the `worldcup` dataset to find out more about the
data. (Note: Only datasets that are parts of packages will have help files.)
```{r, eval = FALSE}
?worldcup
```
Check out the data a bit:
```{r}
str(worldcup)
head(worldcup)
tail(worldcup)
colnames(worldcup)
summary(worldcup)
```
### Exploring the data using simple statistics and `summarize`
Rotate to someone else to share their screen.
Then, try checking out the data using some basic commands for simple statistics,
like `mean()`, `range()`, `max()`, and `min()`, as well as the `summarize` and
`group_by` functions from the `dplyr` package. Try to answer the following
questions:
- What is the mean number of saves that players made?
- What is the mean number of saves just among the goalkeepers?
- Did players from any position other than goalkeeper make a save?
- How many players were there in each position?
- How many forwards were there on each team? Which team had the most shots in total among all its forwards?
- Which team(s) had the defender with the most tackles?
If you have extra time, continuing using the ["Data Wrangling"
cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)
to come up with some other ideas for how you can explore this data, and write up
and test code to do that.
#### Example R code:
To calculate the mean number of saves among all the players, use the `mean`
function, either by itself or within a `summarize` call:
```{r}
mean(worldcup$Saves)
worldcup %>%
summarize(mean_saves = mean(Saves))
```
There are a few ways to figure out the mean number of saves just among the
goalkeepers. One way is to filter the dataset to only goalies and then use
`summarize` to calculate the mean number of saves in this filtered subset of the
data:
```{r}
worldcup %>%
filter(Position == "Goalkeeper") %>%
summarize(mean_saves = mean(Saves))
```
The next question is if players from any position other than goalkeeper made a
save. One way to figure this out is to group the data by position and then
summarize the maximum number of saves. Based on this, it looks like there were
not saves from players in any position except goalie:
```{r}
worldcup %>%
group_by(Position) %>%
summarize(max_saves = max(Saves))
```
To figure out how many players were there in each position, you can can group
the data by position and then use the `count` function from `dplyr` to count the
number of observations in each group:
```{r}
worldcup %>%
group_by(Position) %>%
count()
```
For the next set of questions, you can filter the data to only Forwards, then
group by team to use `summarize` to count up the number of Forwards on each
team. You can also use the same `summarize` call to figure out the total number
of shots by all Forwards on each team. To figure out which team had the most
shots in total among all its forwards, you can use the `arrange` function to
re-order the data from the team with the most total shots to the least. It turns
out that Uruguay had the most shots by forwards on its team, with a total of 46
shots.
```{r}
worldcup %>%
filter(Position == "Forward") %>%
group_by(Team) %>%
summarize(n_forwards = n(),
total_forward_shots = sum(Shots)) %>%
arrange(desc(total_forward_shots))
```
To figure out which team(s) had the defender with the most tackles, you can
filter to only defenders and then use the `top_n` function to identify the
players with the top number of tackles. It turns out these players were on the
England, Germany, and Chile teams.
```{r}
worldcup %>%
filter(Position == "Defender") %>%
top_n(n = 1, wt = Tackles)
```
### Exploring the data using logical statements
Rotate to someone else to share their screen.
Then, try checking out the data using logical statements and some of the `dplyr`
code we covered in the last chapter (`filter` and `arrange`, for example), to help you
answer the following questions:
- What is the range of time that players spent in the game?
- Which player or players played the most time in this World Cup?
- How many players are goalies in this dataset?
- Create a new R object named `brazil_players` that is limited to the players in
this dataset that are (1) on the Brazil team and (2) not goalies.
If you have additional time, look over the "Data Manipulation" cheatsheet
available in RStudio's Help section. Make a list of questions you would like to
figure out from this example data, and start to plan out how you might be able
to answer those questions using functions from `dplyr`. Write the related code
and see if it works.
#### Example R code:
To figure out the range of time, you could use `arrange` twice, once with `desc`
and once without, to figure out the maximum and minimum values
```{r}
# Minimum time
arrange(worldcup, Time) %>%
select(Time) %>%
slice(1)
# Maximum time
arrange(worldcup, desc(Time)) %>%
select(Time) %>%
slice(1)
```
Later, we will learn about the `n()` function, which you can use within piped
code to represent the total number of rows in the dataframe. If you'd like to
get the full range of the `Time` column in one pipeline of code, you can use
`n()` as a reference within `slice`, to pull both the first and last rows of the
dataframe:
```{r}
arrange(worldcup, Time) %>%
select(Time) %>%
slice(c(1, n()))
```
Finally, you could also use `min()` and `max()` functions to get the minimum and
maximum values of the `Time` column in the `worldcup` dataframe (remember that
you can use the `dataframe$column_name` notation to pull a column from a
dataframe). Similarly, you there is a function called `range()` you could use to
find out the range of time these players played in the World Cup.
```{r}
range(worldcup$Time)
```
To figure out which player or players played for the most time, there are a few
approaches you can take. Here I'm showing two: (1) using `filter` from the
`dplyr` package to filter down to rows where where the `Time` for that row
equals the maximum play time that you determined from an earlier task (570
minutes); and (2) using the `top_n` function from `dplyr` to pick out the rows
with the maximum value (`n = 1`) of the `Time` column (see the help file for
`top_n` if you are unfamiliar with this function; we have not covered it in
class yet).
```{r}
worldcup %>%
filter(Time == 570)
worldcup %>%
top_n(n = 1, wt = Time)
```
*Note*: You may have noticed that you lost the players names when you did this
using the `dplyr` pipechain. That's because `dplyr` functions convert the data
to a dataframe format that does not include rownames. If you want to keep
players' names, you can use a function from the `tibble` package called
`rownames_to_column` to move those names from the rownames of the data into a
column in the dataframe. Use the `var` parameter of this function to specify
what you want the new column to be named. For example:
```{r}
library(tibble)
worldcup %>%
rownames_to_column(var = "Name") %>%
filter(Time == 570)
```
There are a few ways to figure out how many players are goalies in this dataset.
One way is to use `sum()` on a logical vector of whether the player's position
is "Goalkeeper":
```{r}
is_goalie <- worldcup$Position == "Goalkeeper"
sum(is_goalie)
```
Another way is to use `filter` from `dplyr`, along with a logical statement, to
filter the data to only players with the position of "Goalkeeper", and then pipe
that filtered subset into the `nrow` function to count the number of rows in the
filtered dataframe:
```{r}
worldcup %>%
filter(Position == "Goalkeeper") %>%
nrow()
```
Next, create a new R object named `brazil_players` that is limited to the
players in this dataset that are (1) on the Brazil team and (2) not goalies. You
can use a logical statement to filter to rows that meet both these conditions by
joing two logical statements in the `filter` function with an `&`:
```{r}
brazil_players <- worldcup %>%
filter(Team == "Brazil" & Position != "Goalkeeper")
head(brazil_players)
```
### Exploring the data using basic plots #1
Use some basic plots to check out this data. Try the following:
- Create a scatterplot of the `worldcup` data, where each point is a player, the x-axis shows the amount of time the player played in the World Cup, and the y-axis shows the number of passes the player had. Try writing the code both with and without "piping in" the data you want to plot into the `ggplot` function.
- Create the same scatterplot, but have each point in the scatterplot show that player's position using some aesthetic besides the x or y position (e.g., color, point shape). Add "rug plots" to the margins.
- Create a scatterplot of number of shots (x-axis) versus number of tackles (y-axis) for **just** players on one of the four teams that made the semi-finals (Spain, Netherlands, Germany, Uruguay). Use color to show player's position and shape to show player's team. (Hint: you will want to use some `dplyr` code to clean the data before plotting to do this.)
- Create a scatterplot of player time versus passes. Use color to show whether the player was on one of the top 4 teams or not. (Hint: Again, you'll want to use some `dplyr` code before plotting to do this.) For an extra challenge, also try adding each player's name on top of each point. (Hint: check out the `rownames_to_column` function from the `tibble` package to help with this.)
- Did you notice any interesting features of the data when you did any of the graphs in this section?
#### Example R code:
Create a scatterplot of `Time` versus `Passes`.
```{r, fig.align = "center", fig.width = 5, fig.height = 3}
# Without piping
ggplot(worldcup) +
geom_point(mapping = aes(x = Time, y = Passes))
# With piping
worldcup %>%
ggplot() +
geom_point(mapping = aes(x = Time, y = Passes))
```
Create the same scatterplot, but have each point in the scatterplot show that player's position.
```{r}
ggplot(worldcup,
mapping = aes(x = Time, y = Passes, color = Position)) +
geom_point() +
geom_rug()
```
Create a scatterplot of number of shots (x-axis) versus number of tackles (y-axis) for **just** players on one of the four teams that made the semi-finals (Spain, Netherlands, Germany, Uruguay). Use color to show player's position and shape to show player's team. For an extra challenge, also try adding each player's name on top of each point.
```{r}
worldcup %>%
rownames_to_column(var = "Name") %>%
filter(Team %in% c("Spain", "Netherlands", "Germany", "Uruguay")) %>%
ggplot() +
geom_point(aes(x = Shots, y = Tackles, color = Position, shape = Team)) +
geom_text(mapping = aes(x = Shots, y = Tackles,
color = Position, label = Name),
size = 2.5)
```
Create a scatterplot of player time versus passes. Use color to show whether the player was on one of the top 4 teams or not.
```{r}
worldcup %>%
mutate(top_4 = Team %in% c("Spain", "Netherlands", "Germany", "Uruguay")) %>%
ggplot() +
geom_point(aes(x = Time, y = Passes, color = top_4))
```
### Exploring the data using basic plots #2
Go back to the code you used in the previous section to create a scatterplot of the `worldcup` data, where each point is a player, the x-axis shows the amount of time the player played in the World Cup, and the y-axis shows the number of passes the player had. Try the following modifications: