-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathdata_manipulation_workshop_handout.rmd
1268 lines (822 loc) · 60.9 KB
/
data_manipulation_workshop_handout.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Data manipulation in R"
date: March 2020
author: "Ariel Muldoon"
output:
pdf_document:
toc: true
toc_depth: 4
urlcolor: blue
---
```{r setup, include = FALSE, message = FALSE}
options(width = 100)
library(knitr)
opts_chunk$set(comment = NA, tidy = FALSE, dev = "pdf")
```
# Introduction and background
```{r hex, echo = FALSE, out.width = "100px"}
knitr::include_graphics("hex_dplyr.png")
knitr::include_graphics("hex_tidyr.png")
```
Today we are going to be learning how to perform basic data manipulation tasks in R. While there are many options for tackling data manipulation problems in R (e.g., `apply` family, **data.table** package, functions `ave()` and `aggregate()`), we will be working with the **dplyr** and **tidyr** packages today. I find that these packages are approachable for people without a lot of programming background but are still quite fast when working with large datasets.
In this workshop, we will cover the following:
- Making summary datasets by group
- Filtering the dataset to include only rows that satisfy certain conditions
- Selecting only some columns/variables in a dataset
- Adding new variables/columns
- Sorting datasets based on variables
- Reshaping datasets
- Merging or *joining* two datasets
The workshop is broken up into three parts:
> In Part 1, we'll review functions from **dplyr** for basic data manipulation/munging/cleaning. We end with a chance for you to practice some of the functions we covered.
> In Part 2, you'll be introduced to the concept of *reshaping* datasets via **tidyr** functions. We'll do another practice exercise at the end of this section.
> In Part 3 we'll practice joining datasets using the `join` functions from **dplyr**.
## Where to find help
It is important to know where to go for help when you run into data manipulation problems. The first place to start is the help pages for the functions themselves; too often folks skip this step and end up in a time-consuming search that could have been avoided. Another place that I often go to find help is on the Stack Overflow website: http://stackoverflow.com/questions/tagged/r. I've given you the link to questions that are specifically R programming questions. You could also look for questions tagged with **dplyr** or **tidyr** or search all R-related questions using keywords or phrases.
The newer RStudio Community website, https://community.rstudio.com/, is another place to look for and ask for help that can be less intimidating than Stack Overflow.
Both of these packages are fairly young, and while they are stabilizing, some elements of the packages may still change. Functions we are using today, however, are functions that are already stable and likely won't change much through time.
Both packages have introductory vignettes that are useful.
The **Introduction to dplyr** vignette is updated as **dplyr** is updated, and is nice resource: https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html.
Also see the **Tidy data** vignette for some examples using **tidyr**: https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html.
The RStudio cheat sheets may also be helpful: https://www.rstudio.com/resources/cheatsheets/
## Getting started
### Check package version
The current version of **dplyr** is 0.8.3 and the current version of **tidyr** is 1.0.2.
You can use `packageVersion()` to check for the currently installed version of a package. Make sure you using current versions of both packages.
```{r packagevers}
packageVersion("dplyr")
packageVersion("tidyr")
```
If one of these packages isn't up to date, you need to re-install it. You can install via coding using, e.g., `install.packages("tidyr")` or via the RStudio Packages pane `Install` button. Remember that you do not need to install a package every time you use it, so don't make this code part of a script.
In between version releases, bugs are fixed and new issues addressed in the *development version* of a package. For these two packages, you can see the changes, check for known issues, and download the current development version via their Github repositories. For **dplyr** see https://github.com/tidyverse/dplyr and for **tidyr** see https://github.com/tidyverse/tidyr.
### Load packages
If all packages are up-to-date we can load **dplyr** and **tidyr** and get started.
```{r, message = FALSE}
library(dplyr)
library(tidyr)
```
### The `mtcars` dataset
In the first part of the workshop we will be using the `mtcars` dataset to practice data manipulation. This dataset comes with R, and information about this dataset is available in the R help files for the dataset (`?mtcars`).
We will be using both categorical and continuous variables from this dataset, including:
`mpg` (Miles per US gallon),
`wt` (car weight in 1000 lbs),
`cyl` (number of cylinders),
`am` (type of transmission),
`disp` (engine displacement),
`qsec` (quarter mile time), and
`hp` (horsepower).
Let's take a quick look at the first six lines (with `head()`) and structure (with `str()`) of this dataset. You should recognize that `cyl` and `am` (as well as others like `vs`) are categorical variables. However, they are considered numeric variables in the dataset since the categories are expressed with numbers.
```{r}
head(mtcars)
str(mtcars)
```
# Part 1: Functions for basic data manipulation
## Calculating summary statistics by group
We're going to start out today by learning how to calculate summary statistics by group. I start here because this is common task that I see folks struggle with in R. The task of calculating summaries by groups in R is referred to as a *split-apply-combine* task because we want to split the dataset into groups, apply a function to each split, and then combine the results back into a single dataset.
There are a variety of ways to perform such tasks in R. We will be using **dplyr** functions in this workshop but in the long run you may find you like the style of another method better.
### Using the `group_by()` function
With **dplyr**, the key to split-apply-combine tasks is *grouping*. We need to define which variable contains the groups that we want to summarize separately. We create a grouped dataset using the `group_by()` function.
Let's create a grouped dataset named `bycyl`, where we group `mtcars` by the `cyl` variable. The `cyl` variable is a categorical variable representing the number of cylinders a car has. This variable has 3 different levels, `4`, `6`, and `8`.
```{r group}
bycyl = group_by(mtcars, cyl)
```
We can see that the new object is a grouped dataset if we print the `head` of the dataset and see the `Groups` tag or see the class `grouped_df` in the object structure.
```{r}
head(bycyl)
str(bycyl)
```
### Using the `summarise()` function
Now that we have a grouped dataset, we can use it with the `summarise()` function to calculate summary statistics by group. Note that `summarize()` is an alternative spelling for the same function.
We'll start by calculating the mean engine displacement for each cylinder category. We will be working on the grouped dataset `bycyl` since we want summaries by groups.
Notice that the first argument of `summarise()` is the dataset we want summarized. This is true for most of the **dplyr** functions. We list the summary function and variable we want summarized as the second argument.
```{r summarise}
summarise( bycyl, mean(disp) )
```
Notice that we printed the summarized dataset but did not name the resulting object. This is what we will be doing for most of the workshop, as my goal is to show you what happens to the dataset after we manipulate it. You certainly can (and likely will want to) name your final datasets. We'll see some examples of naming the new objects once we are doing multiple data manipulation tasks at one time.
### Summarizing multiple variables in `summarise()`
We can summarize multiple variables or use different summary functions at once in `summarise()` by using commas to separate each new function/variable.
For example, we can calculate the mean of engine displacement and horsepower by cylinder category in the same function call.
```{r}
summarise( bycyl, mean(disp), mean(hp) )
```
### Naming the variables in `summarise()`
The default names for the new variables we've been calculating are sufficient for a quick summary but are not particularly convenient if we wanted to use the result for anything further in R. We can set variable names as we summarize.
Let's calculate the mean and standard deviation of engine displacement by cylinder category and name the new variables `mdisp` and `sdisp`, respectively.
```{r}
summarise( bycyl, mdisp = mean(disp), sdisp = sd(disp) )
```
### Grouping a dataset by multiple variables
Datasets can be grouped by multiple variables as well as by a single variable. This is common for studies with multiple factors of interest or with nested studies designs (e.g., plots nested in transects nested in sites).
Let's group `mtcars` by both `cyl` and `am` (transmission type) and then calculate the mean engine displacement. In the output you can see we calculated mean engine displacement for every factor combination, for a total of six rows (3 `cyl` categories and 2 `am` categories).
```{r}
byam.cyl = group_by(mtcars, cyl, am)
summarise( byam.cyl, mdisp = mean(disp) )
```
### Ungrouping a dataset
Looking at our last result, we can see the dataset is still grouped by the `cyl` variable (i.e., `cyl` is listed in "Groups"). If we are finished with our data manipulation it is best practice to *ungroup* the dataset. Trying to work with a dataset that is grouped when we don't want it to be can lead to unusual behavior. It is "safest" to make sure the final version of a dataset is ungrouped.
Ungrouping is done via the `ungroup()` function. Notice we no longer have any `Groups` listed in the output once we do this, as the result is no longer grouped by any variables.
```{r ungroup}
ungroup( summarise( byam.cyl, mdisp = mean(disp) ) )
```
### Summarizing multiple variables at once
When we want to summarize many variables in a dataset using the same function, we can use one of the *scoped variants* of `summarise()`. The scoped variants are `summarise_all()`, `summarise_at()`, and `summarise_if()`.
**Note: These scoped functions will still be available but will be superseded by `across()` in dplyr 1.0.0, which will be released in 2020.**
#### `summarise_all()`
The `summarise_all()` function is useful when we want to summarize every non-grouping variable in the dataset with the same function. We give the function we want to use for the summaries as the second argument, `.funs`.
Let's see how `summarise_all()` works by calculating the mean of every variable in `mtcars` for each cylinder category.
```{r}
summarise_all(bycyl, .funs = mean)
```
Note that we need to be careful with `summarise_all()`. We could have problems if trying to summarize both continuous and categorical variables in a single dataset and could end up with an error. All the variables in `mtcars` are currently numeric. What would happen if we made one of the variables a factor and tried to take the mean of every variable?
```{r}
mtcars$vs = factor(mtcars$vs)
```
R still does the averaging, but returns `NA` and warning messages for the `vs` column.
```{r}
summarise_all(bycyl, .funs = mean)
```
#### `summarise_at()`
We won't always want to summarize every column in a dataset, for reasons including having a mix of variable types. One option to only summarize some of the variables is to use `summarise_at()`, where we can list a subset of the columns that we want summaries for by name in the `.vars` argument.
You can list the variables to summarize within `vars()`.
```{r}
summarise_at(bycyl, .vars = vars(disp, wt), .funs = mean)
```
We can also drop out the variables we don't want summarized rather than writing out the ones we do want. For example, while all the variables in `mtcars` are read as numeric, some are actually categorical. If we don't want to treat them as continuous, we can drop them from the summary. Let's drop `am` and `vs` from our summary. We can do this by using the minus sign with the variable names inside `vars()`.
We will talk more about selecting and dropping specific variables later today when we talk about the `select()` function.
```{r}
summarise_at(bycyl, .vars = vars(-am, -vs), .funs = mean)
```
####`summarise_if()`
If we want to choose the columns we want to summarize using a logical *predicate* function, we can use `summarise_if()`. You can see on the help page that the predicate function is the second argument, `.predicate`, followed by the summary functions.
Here, we'll only summarize the numeric variables by using the predicate function `is.numeric()`. Using this, R checks if a column is numeric with `is.numeric()` and if the result is `TRUE` a summary of the column is made. If the result is `FALSE`, the variable is dropped from the output.
In this example, all variables except `vs` are numeric and will be summarized.
```{r}
summarise_if(bycyl, .predicate = is.numeric, .funs = mean)
```
### Summarizing many variables using multiple functions
If we want to summarize many variables with multiple functions, we pass all the functions we want to the `.funs` argument in a `list()`. The functions are listed with commas between them.
For example, maybe we want to calculate both the mean and the maximum for all numeric variables by group. The functions we use are `mean()` and `max()`.
While the only example we see today is using `summarise_if()`, this can be done in any of the `summarise_*` functions.
```{r}
summarise_if( bycyl,
.predicate = is.numeric,
.funs = list(mean, max) )
```
Notice that we get `fn1` and `fn2` appended to the variable name when using multiple functions. To control what name is appended you can assign names to each function within the `list()`.
```{r}
summarise_if( bycyl,
.predicate = is.numeric,
.funs = list(mn = mean, mx = max) )
```
### The `glimpse()` function for examining wide datasets
The **dplyr** package truncates how much of the dataset we see printed into the R Console. For very wide datasets like the one we just created, we can get a better idea of what the result looks like using `glimpse()`.
```{r}
glimpse( summarise_if( bycyl, .predicate = is.numeric,
.funs = list(mn = mean, mx = max) ) )
```
## Filtering datasets with `filter()`
Now we will cover functions for other common data manipulation tasks, starting with *filtering*. Filtering is about how many rows we want in the dataset, not about the number of columns. It involves making specific subsets of your data by removing unwanted rows. Rows to keep are chosen based on *logical conditions*.
For example, maybe we want to focus on a subset of the dataset that only involves cars with automatic transmissions. We can do this with the `filter()` function to *filter* the `mtcars` dataset to only those rows where `am` is `0`.
Like other **dplyr** functions, the dataset is the first argument in `filter()`. The subsequent arguments are the conditions that the filtered dataset should meet. Here, the condition is that cars must have automatic transmissions, or `am == 0` (note the *two* equals signs).
```{r filter}
filter(mtcars, am == 0)
```
The `filter()` function will always be used with logical operators such as `==` (testing for equality), `!=` (testing for inequality), `<` (less than), `is.na` (all `NA` values), `!is.na` (all values except `NA`), `>=` (greater than or equal to), etc.
If we wanted to filter out all cars that weigh more than 4000 lbs (i.e., 4 1000 lbs), we can keep only the rows where `wt <= 4`.
```{r, eval = FALSE}
filter(mtcars, wt <= 4)
```
```{r, echo = FALSE}
as.tbl( filter(mtcars, wt <= 4) )
```
Alternatively, we could achieve the same thing by choosing everything that is *not* greater than 4, `!wt > 4`. The exclamation point, `!`, is the *not* operator.
```{r, eval = FALSE}
filter(mtcars, !wt > 4)
```
```{r, echo = FALSE}
as.tbl( filter(mtcars, !wt > 4) )
```
### Filtering grouped datasets
We can filter grouped datasets, and the condition will be applied separately to each group. For example, maybe we want to keep only the rows where `wt` is greater than its cylinder category group mean.
Notice I switch to filtering the grouped dataset `bycyl` here.
```{r}
filter( bycyl, wt > mean(wt) )
```
### Filtering by multiple conditions
And, of course, we can filter datasets by multiple conditions at once. If we wanted to filter the dataset to only cars with automatic transmission (`am == 0`) *and* that have weights less than or equal to 4000 lbs (`wt <= 4`), we can include both conditions in `filter()` separated by a comma.
```{r}
filter(mtcars, am == 0, wt <= 4)
```
While we won't see it today, if you need a logical *OR* statement you will need the `|` symbol, found on the backslash key on your keyboard.
### Scoped variants of `filter()`
The **dplyr** package has `filter_all()`, `filter_at()`, and `filter_if()` verbs available. These would be useful if we wanted to apply the same filter to many columns of data.
These are often used in combination with the functions `any_vars()` or `all_vars()`. The "Examples" section of the help page is a good place to start to see worked examples.
## Selecting variables with `select()`
Keeping only a subset of the columns of a dataset is referred to as *selecting variables*. This might be for organizational reasons, where an analysis is focused on only some of many variables and so we want to create a dataset that contains only the variables of interest. Selecting is about how many columns we want to keep, not how many rows we have.
The **dplyr** function `select()` makes selecting columns very easy to do. We can keep or drop variables by name (although you can also use the index number) with straightforward code.
Let's *select* only the `cyl` variable from `mtcars` (printing just the first rows to save space in this document).
```{r select, eval = FALSE}
select(mtcars, cyl)
```
```{r select2, echo = FALSE}
as.tbl( select(mtcars, cyl) )
```
If we want to keep all variables between (and including) `cyl` and `vs`, we indicate that with the colon, `:`.
```{r, eval = FALSE}
select(mtcars, cyl:vs)
```
```{r, echo = FALSE}
as.tbl( select(mtcars, cyl:vs) )
```
If we want to keep only a few columns, we can separate the desired column names with a comma. Here we select only `cyl` and `vs`.
```{r, eval = FALSE}
select(mtcars, cyl, vs)
```
```{r, echo = FALSE}
as.tbl( select(mtcars, cyl, vs) )
```
### Using the special helper functions in `select()`
The `select()` function has several special functions to make variable selection even easier. See the help page for `select_helpers` for a list of all of these (`?select_helpers`).
These special functions include `starts_with()`, `contains()`, and `ends_with()`, among others. Such functions can be very useful if you have coded your variables names so that related variables contain the same letters or numbers.
We are going to start with an example `starts_with()`, where we select all variables with names that *start with* a lowercase `d`. Remember that R is case sensitive, so an uppercase `D` is different than a lowercase `d`.
```{r, eval = FALSE}
select( mtcars, starts_with("d") )
```
```{r, echo = FALSE}
as.tbl( select( mtcars, starts_with("d") ) )
```
Or we could keep all variables that *contain* a lowercase `a` anywhere in the variable name.
```{r, eval = FALSE}
select( mtcars, contains("a") )
```
```{r, echo = FALSE}
as.tbl( select( mtcars, contains("a") ) )
```
We've been choosing which variables we want to keep, but we could also choose which variables we want to drop like we did with `summarise_at()` earlier. We drop variables using the minus sign (`-`).
Drop the `gear` variable.
```{r, eval = FALSE}
select(mtcars, -gear)
```
```{r, echo = FALSE}
as.tbl( select(mtcars, -gear) )
```
Drop both the `gear` and `carb` variables.
```{r, eval = FALSE}
select(mtcars, -gear, -carb)
```
```{r, echo = FALSE}
as.tbl( select(mtcars, -gear, -carb) )
```
Drop all variables between and including `am` and `carb`. Notice that parentheses are needed around the variables to use `-` like this.
```{r, eval = FALSE}
select( mtcars, -(am:carb) )
```
```{r, echo = FALSE}
as.tbl( select( mtcars, -(am:carb) ) )
```
Drop variables that end with the letter "t".
```{r, eval = FALSE}
select( mtcars, -ends_with("t") )
```
```{r, echo = FALSE}
as.tbl( select( mtcars, -ends_with("t") ) )
```
The `select_helpers` can be used in other functions, as well. We would commonly use them in the scoped `*_at()` functions like `summarise_at()` to help pick the variables to use within the function. The `select()` function also has scoped variants available, `select_all()`, `select_at()`, and `select_if()`.
## Creating new variables with `mutate()`
In **dplyr**, we can use `mutate()` to create new variables and add them to the dataset as new columns. The new variable is the same length as the current dataset; in other words, it has the same number of rows as the original dataset. We will be making some new variables and adding them to `mtcars` to illustrate how this works.
Let's start by making a new variable called `disp.hp`, which is the sum of engine displacement (`disp`) and horsepower (`hp`).
As with the other **dplyr** functions, the dataset is the first argument of `mutate()`.
```{r mutate, eval = FALSE}
mutate(mtcars, disp.hp = disp + hp)
```
```{r mutate2, echo = FALSE}
as.tbl( mutate(mtcars, disp.hp = disp + hp) )
```
We can make multiple new variables at once, separating each new variable by a comma like we did in `summarise()`. A handy feature of `mutate()` is that we can work directly with the new variables we've made within the same function call. For example, we can first calculate `disp.hp` and then calculate a second variable that is half of `disp.hp` (`disp.hp` divided by 2). We can create other variables, as well, so we'll create the ratio of `qsec` and `wt` while we're at it.
```{r, eval = FALSE}
mutate(mtcars,
disp.hp = disp + hp,
halfdh = disp.hp/2,
qw = qsec/wt)
```
```{r, echo = FALSE}
as.tbl( mutate(mtcars,
disp.hp = disp + hp,
halfdh = disp.hp/2,
qw = qsec/wt) )
```
### Using `mutate()` with grouped datasets
We can work with grouped datasets when using `mutate()`. This is useful when we want to add a column of a summary statistic for each group to the existing dataset rather than making a summary dataset.
Let's create and add a new variable that is the mean horsepower for each cylinder category. Each car within a cylinder category will have the same value of mean horsepower, as `mutate()` always returns a new dataset that is the same length as the original.
Since this is a grouped operation we'll work with the grouped dataset `bycyl` we made earlier.
```{r}
mutate( bycyl, mhp = mean(hp) )
```
As you can see, the code for `mutate()` resembles the code for `summarise()`. While we will not see examples today, there are `mutate_all()`/`mutate_at()`/`mutate_if()` functions available that work much like the scoped variants of the `summarise()` function we saw earlier today.
There is also a function called `transmute()`, which creates new variables that are the same length as the current dataset like `mutate()` but only returns the new variables like `summarise()`.
## Sorting
There are some situations where you might want to sort your dataset by variables within the dataset. For example, if we want to pull out the first observation in each group from a time series we might sort the dataset first by time within group prior to filtering. We can sort datasets with **dplyr** using `arrange()`.
Here we'll start by sorting `mtcars` by `cyl`. By default we sort whatever variable we are sorting on from low to high (ascending order).
```{r arrange, eval = FALSE}
arrange(mtcars, cyl)
```
```{r arrange2, echo = FALSE}
as.tbl( arrange(mtcars, cyl) )
```
To sort datasets by variables in descending order (highest to lowest), we can use the minus sign (`-`) or the function `desc()` (which is from **dplyr**).
```{r arrange3, eval = FALSE}
arrange(mtcars, -cyl)
```
```{r arrange4, echo = FALSE}
as.tbl( arrange(mtcars, -cyl) )
```
```{r arrange5, eval = FALSE}
arrange( mtcars, desc(cyl) )
```
```{r arrange6, echo = FALSE}
as.tbl( arrange( mtcars, desc(cyl) ) )
```
To sort variables only within groups, we sort by the grouping variable first and then the other sorting variables. The `arrange()` function ignores `group_by()`; this is different than all the other **dplyr** verbs we've learned today.
Here's an example of within-group sorting, sorting each cylinder category from lowest to highest `wt`.
```{r arrange7}
arrange(mtcars, cyl, wt)
```
To sort by more variables, keep adding them in `arrange()`, separated by commas.
## Combining data manipulation tasks
When working with our own datasets we'll often want to do multiple data manipulation tasks in a row. Now that we've learned how to do different kinds of data manipulation, let's string multiple manipulations together.
We are going to:
1. Filter the `mtcars` dataset to only those cars with automatic transmissions;
2. Create a new variable that is the ratio of engine displacement and horsepower;
3. Calculate the mean of this new variable separately for each cylinder category.
### Using temporary objects
First we'll do this one step at a time, creating a new named object for each step. As a reminder, we haven't been naming objects as we practiced the functions above but instead were only printing results to the R Console. Now we're actually naming each object. I use `=` for assignment but you can also use `<-`.
The extra pair of parentheses I'm using prints the object so we can see what happens at each step.
```{r combined}
# Filter by automatic transmission
( filtcars = filter(mtcars, am == 0) )
# Create new variable in the filtered dataset
( ratio.cars = mutate( filtcars, hd.ratio = hp/disp) )
# Group by number of cylinders
grp.ratio = group_by(ratio.cars, cyl)
# Calculate mean of the new ratio variable by cylinder category
( sum.ratio = summarise( grp.ratio, mratio = mean(hd.ratio) ) )
```
The downside of this approach to multiple manipulations is that we had to make four objects when we really just wanted the final `sum.ratio` object. We have to think of names for each object at each step and we end up with a bunch of temporary objects in our R Environment.
### Nesting functions to avoid temporary objects
An alternative to temporary objects is to *nest* all the functions together. This means we put one function call within the next function call. Nesting allows us to avoid making any temporary objects but the resulting code is a bit hard to read. The code from nested functions is read inside out, where the first thing we do is also the most nested.
First, a simple example of nesting functions from work we did earlier, where we want to group the dataset by `cyl` and `am` and then calculate the mean of `disp`. Here's the same task via nesting. We put the `group_by()` function call within `summarise()`.
```{r nest}
summarise( group_by(mtcars, cyl, am), mdisp = mean(disp) )
```
Now the more complicated example, where we combined the series of data manipulation tasks. Note how the `filter()` is four functions deep in the code below.
```{r}
( sum.ratio = summarise( group_by( mutate( filter(mtcars, am == 0),
hd.ratio = hp/disp),
cyl),
mratio = mean(hd.ratio) ) )
```
### The pipe operator
Now that we are combining multiple data manipulation functions from **dplyr**, it's time to talk about the pipe operator. The pipe operator (`%>%`) represents a different coding style. The pipe allows us to perform a series of data manipulation steps in a long *chain* while avoiding all those temporary objects or difficult-to-read nested code.
In essence, the pipe operator *pipes* a dataset into a function as the first argument. One reason I've been pointing out to you that the **dplyr** functions have the dataset as the first argument is that this is one of the things that makes piping so easy with these functions.
You can think of the pipe as being pronounced "then", which we'll talk more about as we see some examples. Using the pipe is a bit hard to picture when you are first introduced to it, but things should start to get clearer once we see some code.
Let's start with a simple example. Remember when we grouped `mtcars` by `cyl` earlier?
```{r pipe}
bycyl = group_by(mtcars, cyl)
```
We read even this simple code "inside out". We see that we are grouping with `group_by()` and then if we read inside the function we see the dataset we are going to group. Let's write this same code using the pipe.
```{r}
bycyl = mtcars %>% group_by(cyl)
```
The code with the pipe is read from left to right. We see we are working with the `mtcars` dataset and *then* that we are grouping that dataset by `cyl`. The result is the same, but the code itself looks quite different.
Handily, we can keep piping through multiple functions in one long chain. Let's group `mtcars` by `cyl` and then calculate the mean `disp` of each group.
When working with pipes in a chain, it is standard to use a line break after each pipe with an indent for each subsequent function.
Aside: Stylistically, including white space in your code improves code readability. Think of writing a sentence without white space; it would be hard to read! Newer R users sometimes need to be reminded that white space rationing is not in effect. :-D It might seem clunky at first, but including white space quickly becomes natural and your code becomes much easier to read and understand.
```{r}
mtcars %>%
group_by(cyl) %>%
summarise( mdisp = mean(disp) )
```
Again, the above code is read from left to right. We see we are going to work with `mtcars`, then we group it by `cyl`, and then we calculate the mean `disp` of the grouped dataset. When you read it like this you can see why we might pronounce `%>%` as *then*.
#### Combining data manipulation tasks using the pipe operator
Let's go back to our combined data manipulation task we did a few minutes ago on `mtcars` and use piping instead of temporary objects or nesting.
```{r}
mtcars %>%
filter(am == 0) %>% # filter out the manual transmission cars
mutate(hd.ratio = hp/disp) %>% # make new ratio variable
group_by(cyl) %>% # group by number of cylinders
summarise(mratio = mean(hd.ratio) ) # calculate mean hd.ratio per cylinder category
```
We didn't assign a name to the final object. Let's do that now.
```{r}
sum.ratio = mtcars %>%
filter(am == 0) %>% # filter out the manual transmission cars
mutate(hd.ratio = hp/disp) %>% # make new ratio variable
group_by(cyl) %>% # group by number of cylinders
summarise(mratio = mean(hd.ratio) ) # calculate mean hd.ratio per cylinder category
```
#### Using the pipe operator with non-**dplyr** functions
The pipe operator can be used with functions outside the **dplyr** package, as well. If the first argument of the function is the dataset, the code looks exactly like what we've been doing. For example, we can use the pipe with the `head()` function from base R and get the first 10 rows of `mtcars`. The first argument of the `head()` function is the dataset.
```{r}
mtcars %>%
head(n = 10)
```
If the first argument of a function is *not* the dataset, we need to use the dot, `.`, to represent the dataset name in the function we are piping into. We can see this if we use the pipe operator with the `t.test()` function, which doesn't have `data` as the first argument.
Here we test for a difference in mean horsepower among transmission types based on the `mtcars` dataset. The dataset is piped to the `data` argument with the `.`.
```{r}
mtcars %>%
t.test(hp ~ am, data = .)
```
We generally wouldn't use piping in such a simple case, though, as we would use the data argument in `t.test()` directly. A more realistic example is if we wanted to filter the dataset before doing the test.
Let's filter `mtcars` to cars weighing less than or equal to 4000 lbs and then test if mean horsepower is different between transmission types.
```{r}
mtcars %>%
filter(wt <= 4) %>%
t.test(hp ~ am, data = .)
```
## Counting the number of rows in a group
Before we move on, I want to talk about one more function. The **dplyr** package has a built-in function, `n()`, for counting up the unique rows in a group. This is useful when making tables of summary statistics.
```{r n}
mtcars %>%
group_by(cyl) %>%
summarise( n = n(),
mdisp = mean(disp) )
```
Other useful functions that are related to `n()` are `count()` and `tally()` which can tally up number of rows per group in fewer steps. Take a look at the help page for those to see how they work.
This function can be used directly inside other functions, such as `filter()`, for removing rows based on the group total count. I'll keep on the rows of the dataset of `cyl` groups that have fewer than 10 observations. It turns out that this is true only for the `6` group.
To show best practice here I'll `ungroup()` at the end of the pipe chain.
```{r nfilter}
mtcars %>%
group_by(cyl) %>%
filter(n() < 10) %>%
ungroup()
```
The `n()` function can also be used when assigning index numbers within groups. I use this most often when my rows within groups aren't uniquely identified but I need them to be. This is especially useful if the group sizes aren't known or might vary.
In this example we'll also `select()` just the first three columns so we can easily see the new `index` column that we create. This column indexes from one to group size (`n()`) in each cylinder group.
```{r nmutate}
mtcars %>%
group_by(cyl) %>%
select(1:3) %>%
mutate( index = 1:n() ) %>%
ungroup()
```
We might want to add this index in based on the order of some variable in the dataset, not on the order the dataset is when we read it in. This is a case for `arrange()`.
Let's add the index based on the order of `disp` within each `cyl` category. We `arrange()` prior to creating the `index` variable.
```{r nmutate2}
mtcars %>%
arrange(cyl, disp) %>%
group_by(cyl) %>%
select(1:3) %>%
mutate( index = 1:n() ) %>%
ungroup()
```
## Practice data manipulation
So far we've covered a lot of material on data manipulation functions. Before going on to the next topic, I want to take some time to allow you to practice using some of the functions we've seen so far. I've set up two example problems below. Each example will take a different set of functions to solve.
### The `babynames` dataset
We'll be practicing using the `babynames` dataset. This can be found in package **babynames**. The current version of this package is `1.0.0`. If you do not have this package or it is not up to date, please install it. You can do this with the RStudio Packages pane Install button, or run the code `install.packages("babynames")`.
```{r packagevers2}
packageVersion("babynames")
```
Once the package is installed, load the package.
```{r babynames}
library(babynames)
```
The help page for `babynames` gives us some basic information on the dataset.
```{r babyhelp, eval = FALSE}
?babynames
```
The `babynames` dataset contains data from the United States Social Security Administration on the number and proportion of babies given a name each year from 1880 through 2017. Rare names (recorded less than 5 times) are excluded from the dataset in R. The annual proportion of babies given a name was calculated separately for male and female babies (`sex`).
The dataset has five variables, shown below.
```{r glimpsebaby}
glimpse(babynames)
head(babynames)
```
### Practice problem 1
The first practice problem involves filtering and sorting.
> **Which name was given to the largest number of babies in the year you were born?**
Once you find the answer
> **How many babies were given that name in 2017?**
You can check to see [how I approached this problem below](#answers-problem-1).
### Practice problem 2
The second practice problem involves filtering, grouping, and then summarizing the number of rows per group.
> **Calculate the total number of baby names for each level of the `sex` variable in the year you were born and in 2017.**
**Hint:** To use `filter()` with multiple values you'll need `%in%` instead of `==`. For example, if you wanted to filter to the years 1980 and 2015 you'd use `year %in% c(1980, 2015)` for the condition in `filter()`.
[Here's how I tackled this.](#answer-problem-2)
# Part 2: Reshaping datasets
We are going to switch gears now and talk about how to *reshape* datasets.
In this section, we will learn to take the information from the columns of a dataset and put that information on rows instead. This is an example of taking a *wide* dataset and making it *long*. We will also learn to take information from the rows of a dataset and put that information into columns instead. In other words, reshape a dataset from *long* to *wide*. None of this changes how much information we have, it just changes how the information is stored.
We will be learning to reshape using the **tidyr** package.
The current language of the **tidyr** package involves *pivoting*. To *pivot long* means to take a wide format dataset and transform it into a long dataset. To *pivot wide* means to take a long format dataset and make it wide. We'll see examples of these as we go along, which should help clear up any confusion with this new terminology.
We'll learn the basics of reshaping on what I call a *toy* dataset. A toy dataset is a set of fake data that we make to practice functions on. Small toy datasets are handy when you are learning a new function or trying to troubleshoot a data manipulation technique. We could use built-in datasets like `mtcars`, as well, but toy datasets are conveniently very small.
The dataset that we will create, `toy1`, will have six rows and five columns.
The first column contains the levels of some treatment (`trt`).
The second contains the identifier of individuals the treatment was applied to (`indiv`). These identifiers are repeated across treatments, so individual `1` in treatment `a` is different than individual `1` in treatment `b`. This means the combination of treatment and individual is the unique identifier for each row.
The last three columns are some quantitative measurement taken at three different times (`time1`, `time2`, and `time3`).
The shape of this toy dataset is one I commonly see for data from studies that take measurements through time.
I'm not going to walk through this code, but below you can see how I create this dataset. If you are interested in more information on how to get started simulating data in R, see my post [here](https://aosmith.rbind.io/2018/08/29/getting-started-simulating-data/).
This dataset `toy1` is in a *wide* format. It has 6 rows, and the quantitative values are stored in the 3 "time" columns for a total of 18 values.
```{r toy}
( toy1 = data.frame(indiv = rep(1:3, times = 2),
trt = rep( c("a", "b"), each = 3),
time1 = rnorm(n = 6),
time2 = rnorm(n = 6),
time3 = rnorm(n = 6) ) )
```
If we were going to analyze this dataset in R we would most likely need it to be in a long format. We want to keep the two columns containing the identifying information (`trt`, `indiv`), have a single column containing the information about the time of measurement (`time1`, `time2`, or `time3`), and a single column containing the values of the quantitative measurement. To *lengthen* a dataset from wide to long we use the `pivot_longer()` function.
## Wide to long with `pivot_longer()`
The **tidyr** package was built to be used with pipes, and the dataset is the first argument for its functions. In `pivot_longer()`, the first thing we do after defining the dataset we want to reshape is to list the columns that contain values we want to be combined into a single column in `cols`. We can use the `select_helpers` we learned earlier for this.
Once we pick the columns we are combining, we name the new "grouping" column that will contain the names of the columns we are combining with `names_to`. I will name this new column `time`. Note that when naming a column we need to use a *string*, meaning the name has to be in quotes.
Finally we need to name the new column of values using `values_to`. I'll name this column `measurement`. This is also done using a string.
We have the same amount of information in the newly long dataset below as we did in the wide dataset. We have 18 values, now stored in a single column. We changed the shape of the dataset, not the underlying data.
```{r long1}
toy1 %>%
pivot_longer(cols = time1:time3,
names_to = "time",
values_to = "measurement")
```
We'd better name this newly long-format object so we can use it in further examples. We'll use this long dataset to practice putting it back into wide format. This time I use `starts_with()` to choose the columns.
```{r long3}
toy1long = toy1 %>%
pivot_longer(cols = starts_with("time"),
names_to = "time",
values_to = "measurement")
```
## Long to wide with `pivot_wider()`
Now we can use the function `pivot_wider()` to *widen* the long dataset `toy1long` back to its original format. You might want to do this if, for example, you were going to take a dataset from an analysis done in R to graph in a program like SigmaPlot (which apparently often works best on wide datasets).
In the `pivot_wider()` function, we'll use the pair of arguments `names_from` and `values_from` after defining the dataset we want to reshape.
The `names_from` argument is where we list the column(s) that contains the values we will use as the new column names. We are referring to an existing column, so this can be done with *bare* names (i.e., without quotes around the variable names).
We list the column that contains the value(s) we will fill the new columns with using `values_from`.
```{r wide1}
toy1long %>%
pivot_wider(names_from = time,
values_from = measurement)
```
### Using multiple columns in `names_from`
In some cases we'll want to make a wide dataset with new column names based on multiple variables in the long dataset. In that case we can pass multiple variable names to `names_from`.
By default, the new column names will have an underscore (`_`) in them separating the information from the two variables. The new column names are based on the order the variables are listed in `names_from`.
Now we have a 3 row dataset with quantitative values stored in 6 columns: we still have our original 18 pieces of information.
```{r extrawide2}
toy1long %>%
pivot_wider(names_from = c(trt, time),
values_from = measurement)
```
We can change the symbol used in the new column names with `names_sep`. Here I also change the new column names by changing order the variables are listed in `names_from`.
```{r extrawide3}
toy1long %>%
pivot_wider(names_from = c(time, trt),
values_from = measurement,
names_sep = ".")
```
### Non-unique row identifiers in `pivot_wider()`
If the rows of the long dataset aren't uniquely identified when converting into a wide format you will get a warning message from `pivot_wider()`.
For example, if we were trying to spread `toy1long` but we only had the `trt` variable and not the `indiv` variable our rows wouldn't be uniquely identified. It is only the combination of `trt`, `indiv`, and `time` that uniquely identifies a row.
Let's remove `indiv` from the dataset using `select()`.
```{r notunique}
toy1long %>%
select(-indiv)
```
There are now multiple observations of each time for each `trt` category; our rows are not uniquely identified. Let's see what happens when we use `pivot_wider()` on this dataset without changing the code.
In particular, take a look at the warning messages. These messages contain useful information about what is in the output and why. The output dataset looks pretty different than what we've seen before because all 3 values for each `trt` and `time` were kept but placed into lists.
```{r notunique2, warning = TRUE}
toy1long %>%
select(-indiv) %>%
pivot_wider(names_from = time,
values_from = measurement)
```
If we really want to widen the dataset without `indiv`, we most likely want to summarize over the values for each `trt` and `time`. This can be done using the `values_fun` argument. This is what the message
> * Use `values_fn = list(measurement = summary_fun)` to summarise duplicates
was telling us.
I'll change the code to calculate the mean of the values in each `trt` and `time` using `values_fn`. When we summarize over multiple values we *do* change the total number of values in the dataset. We now have only 6 quantitative values in the output.
```{r notunique3}
toy1long %>%
select(-indiv) %>%
pivot_wider(names_from = time,
values_from = measurement,
values_fn = list(measurement = mean) )
```
## Practice reshaping
Before we move on to Part 3 of the workshop I want you take time to practice reshaping with `pivot_wider()` and `pivot_longer()`.
We will once again be working with the `babynames` dataset.
### Practice problem 3
The third practice problem is based off of our work from [practice problem 2](#practice-problem-2). We calculated the total number of baby names in the year we were born and in 2017 for each `sex`.
I didn't name the final object, but I need to in order to use it in this problem. I'll do that here, and print the result so I remember what it looked like.
```{r}
( numbaby_76_17 = babynames %>%
filter( year %in% c(1976, 2017) ) %>%
group_by(year, sex) %>%
summarise(n = n() ) %>%
ungroup() )
```
Using your summarized dataset from practice problem 2:
> **Reshape the dataset to a wide format. Make a dataset with a separate column for each `sex` containing the number of baby names in a given `year`.**
> **Now reshape the same dataset to different wide format. Make a dataset with a separate column for each `year` containing the number of baby names in a given `sex`.**
Finally, practice putting the dataset back in the original format.
> **Take the dataset that has `sex` as separate columns and put this back in the original format.**
You can see my approach [here](#answers-problem-3).
# Part 3: Joining two datasets together
The last topic we are going to cover today is merging or *joining*. For a variety of reasons, we might have data for a single analysis stored in separate datasets. Joining is the process of combining two datasets based on matching values in the columns you are using as the *unique identifiers*. The unique identifier variables are the variables in the dataset that tells the computer which rows in one dataset should be matched to the rows in another dataset.
There is a `merge()` function in base R, but we will be using some of the join functions from **dplyr** today, including `inner_join()`, `left_join()`, and `full_join()`.
Let's create two toy datasets to join together.
The first dataset (`tojoin1`) will contain counts of some species in three different treatment plots (`treat`) within different sites (`site`).
The second dataset (`tojoin2`) will contain an environmental variable, measured on the same plots and sites (`elev`). Both datasets are missing measurements from a treatment plot in site 3; the first dataset is missing treatment "c" and the second dataset is missing treatment "a".
The key to making a `data.frame()` like this is to make sure each variable is the same length as each other variable.
If we `set.seed()` to the same number we'll all get the same random numbers from `rpois()` and `rgamma()`.
```{r merge}
set.seed(16) # If I set the seed, we will all get the same random numbers
# This dataset is slightly unbalanced, as site 3
# doesn't have the "c" treatment count
( tojoin1 = data.frame(site = rep(1:3, each = 3, length.out = 8),
treat = rep(c("a", "b", "c"), length.out = 8),
count = rpois(8, 6) ) )
# This dataset is also slightly unbalanced,
# missing the elevation measurement from
# site 3 treatment "a"
( tojoin2 = data.frame(site = rep(1:3, length.out = 8),
treat = rep(c("b", "c", "a"), each = 3, length.out = 8),
elev = rgamma(8, 1000, 1) ) )
```
The unique identifier of each measurement in each dataset is a combination of `site` and `treat`; those are the variables that we will use to tell R which rows within the two datasets to combine into one.
## The inner join
Let's start our joining practice by joining these two datasets together using `inner_join()`.
See the help page, `?join`, to see a description of each type of join available in **dplyr**. In the documentation, you will see that every join involves two datasets, called `x` and `y`, to be joined. The `x` dataset is the first dataset you give to the `join` function and the `y` dataset is the second.
An *inner join* matches on the unique identifiers and returns only rows that are shared in both datasets.
From the documentation, an `inner_join()` will
> return all rows from x where there are matching values in y, and all columns from x and y
By default, `inner_join()` joins on all columns shared by the two datasets. When we use this default, we will get a message telling us which variables were used for joining when we run the code.
We'll name our new combined dataset `joined`, and print the result.
```{r innerjoin1}
( joined = inner_join(tojoin1, tojoin2) )
```
To make our code more explicit and easily understandable, we can also use the `by` argument to define which variables we want to join on. This is what I usually do.