-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcm006-dplyr_Excercise.Rmd
205 lines (135 loc) · 4.85 KB
/
cm006-dplyr_Excercise.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
title: 'cm006: `dplyr` Exercise'
output: html_notebook
editor_options:
chunk_output_type: inline
---
<!---The following chunk allows errors when knitting--->
```{r allow errors, echo = FALSE}
knitr::opts_chunk$set(error = TRUE)
```
**Optional, but recommended startup**:
1. Change the file output to both html and md _documents_ (not notebook).
2. `knit` the document.
3. Stage and commit the rmd, and knitted documents.
# Intro to `dplyr` syntax
Load the `gapminder` and `tidyverse` packages. Hint: `suppressPackageStartupMessages()`!
- This loads `dplyr`, too.
```{r load packages, warning = FALSE, message = FALSE}
# load your packages here:
library(gapminder)
library(tidyverse)
```
## `select()` (8 min)
1. Make a data frame containing the columns `year`, `lifeExp`, `country` from the gapminder data, in that order.
```{r}
select(gapminder, year,lifeExp,country)
```
2. Select all variables, from `country` to `lifeExp`.
```{r}
# This will work:
select(gapminder, country, continent, year, lifeExp)
# Better way:
select(gapminder, country:lifeExp)
```
3. Select all variables, except `lifeExp`.
```{r}
select(gapminder, country:year)
```
4. Put `continent` first. Hint: use the `everything()` function.
```{r}
select(gapminder, continent,everything())
```
5. Rename `continent` to `cont`.
```{r}
# compare
select(gapminder, cont=continent, everything())
rename(gapminder, continent=cont)
```
## `arrange()` (8 min)
1. Order by year.
```{r}
arrange(gapminder, year)
```
2. Order by year, in descending order.
```{r}
arrange(gapminder, desc(year))
```
3. Order by year, then by life expectancy.
```{r}
arrange(gapminder,year, lifeExp)
```
## Piping, `%>%` (8 min)
Note: think of `%>%` as the word "then"!
Demonstration:
Here I want to combine `select()` Task 1 with `arrange()` Task 3.
This is how I could do it by *nesting* the two function calls:
```{r nesting functions example, eval = F}
# Nesting function calls can be hard to read
arrange(select(gapminder, year, lifeExp, country), year, lifeExp)
```
Now using with pipes:
```{r}
# alter the below to include 2 "pipes"
arrange(select(gapminder, year, lifeExp, country) %>% arrange(year,lifeExp)) #this column selects year
```
# Resume lecture
Return to guide at section 6.7.
## `filter()` (10 min)
1. Only take data with population greater than 100 million.
## Shortcut for pipe operator CMD+SHIFT+M
## Discouraged --> gapminder[1:6,1:3] this is not reproducible and you would have to comment on why you are doing this
```{r}
gapminder %>%
filter(pop>100000000)
```
2. Your turn: of those rows filtered from step 1., only take data from Asia.
```{r}
gapminder %>%
filter(pop>100000000 & continent=='Asia')
gapminder %>%
filter(pop>100000000) %>% filter(continent=='Asia')
gapminder %>%
filter(pop>100000000,continent=='Asia')
```
3. Repeat 2, but take data from countries Brazil, and China.
```{r}
gapminder %>%
filter(pop>100000000,country == "Brazil"| country == "China")
```
## `mutate()` (10 min)
Let's get:
- GDP by multiplying GDP per capita with population, and
- GDP in billions, named (`gdpBill`), rounded to two decimals.
```{r}
gapminder %>%
mutate(gdpBill = round(gdpPercap*pop/1E9,digits=2))
```
Notice the backwards compatibility! No need for loops!
Try the same thing, but with `transmute` (drops all other variables).
```{r}
gapminder %>%
transmute(gdpBill = gdpPercap*pop/1E9)
```
The `if_else` function is useful for changing certain elements in a data frame.
Example: Suppose Canada's 1952 life expectancy was mistakenly entered as 68.8 in the data frame, but is actually 70. Fix it using `if_else` and `mutate`.
```{r}
gapminder %>%
mutate(lifeExp=if_else(country =="Canada"&year==1952, 70,lifeExp))
```
Your turn: Make a new column called `cc` that pastes the country name followed by the continent, separated by a comma. (Hint: use the `paste` function with the `sep=", "` argument).
```{r}
gapminder %>%
mutate(cc = paste(country,continent,sep = ","))
```
These functions we've seen are called __vectorized functions__.
## git stuff (Optional)
Knit, commit, push!
# Bonus Exercises
If there's time remaining, we'll practice with these three exercises. I'll give you 1 minute for each, then we'll go over the answer.
1. Take all countries in Europe that have a GDP per capita greater than 10000, and select all variables except `gdpPercap`. (Hint: use `-`).
2. Take the first three columns, and extract the names.
3. Of the `iris` data frame, take all columns that start with the word "Petal".
- Hint: take a look at the "Select helpers" documentation by running the following code: `?tidyselect::select_helpers`.
4. Convert the population to a number in billions.
5. Filter the rows of the iris dataset for Sepal.Length >= 4.6 and Petal.Width >= 0.5.
Exercises 3. and 5. are from [r-exercises](https://www.r-exercises.com/2017/10/19/dplyr-basic-functions-exercises/).