-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpart2-solution.Rmd
173 lines (120 loc) · 4.39 KB
/
part2-solution.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
---
title: "R Notebook"
output: html_notebook
---
# Part 2 Solutions
```{r}
### Load the libraries will we need
library(readr)
library(dplyr)
library(ggplot2)
```
Read the data and check
```{r}
gapminder <- read_csv("raw_data/gapminder.csv")
head(gapminder)
```
## In-class exercises
Create a subset of the data where the population less than a million in the year 2002
```{r}
filter(gapminder, pop < 1e6, year == 2002)
```
Create a subset of the data where the life expectancy is greater than 75 in the years prior to 1987
```{r}
filter(gapminder, lifeExp > 75, year < 1987)
```
Create a subset of the European data where the life expectancy is between 75 and 80 in the years 2002 or 2007.
```{r}
filter(gapminder, continent == "Europe", lifeExp > 75, lifeExp < 80 , year == 2002 | year == 2007)
```
Can also use the `between` function from `dplyr` and the `%in%` function
```{r}
filter(gapminder, continent == "Europe",
between(lifeExp, 75,80),
year %in% c(2002,2007))
```
Write a workflow to do the following:-
- Filter the data to include just observations from the year 2002
- Re-arrange the table so that the countries from each continent are ordered according to decreasing wealth. i.e. the wealthiest countries first
- Select all the columns apart from year
- Write the data frame out to a file in out_data/ folder
```{r}
# Less-efficient solution before pipes are introduced
# create out_data folder before we start (no warning given if it already exists)
dir.create("out_data", showWarnings = FALSE)
gapminder2 <- filter(gapminder, year == 2002)
gapminder3 <- arrange(gapminder2, continent, desc(gdpPercap))
gapminder4 <- select(gapminder3, -year)
write_csv(gapminder4, "out_data/gapminder_2002.csv")
```
Re-written using pipes
```{r}
filter(gapminder, year == 2002) %>%
arrange(continent, desc(gdpPercap)) %>%
select(-year) %>%
write_csv("out_data/gapminder_piped_2002.csv")
```
The violin plot is a popular alternative to the boxplot. Create a violin plot with geom_violin to visualise the differences in GDP between different continents.
```{r}
ggplot(gapminder, aes(x = continent, y = gdpPercap)) + geom_violin()
```
Create a subset of the gapminder data frame containing just the rows for your country of birth
```{r}
# don't forget that R is case-sensitive!
uk_data <- filter(gapminder, country == "United Kingdom")
```
Has there been an increase in life expectancy over time?
- visualise the trend using a scatter plot (geom_point), line graph (geom_line) or smoothed line (geom_smooth).
```{r}
ggplot(uk_data, aes(x = year, y = lifeExp)) + geom_point()
```
```{r}
ggplot(uk_data, aes(x = year, y = lifeExp)) + geom_line()
```
```{r}
ggplot(uk_data, aes(x = year, y = lifeExp)) + geom_smooth()
```
```{r}
## can combine all plots
ggplot(uk_data, aes(x = year, y = lifeExp)) + geom_point() + geom_smooth()
```
Note: this exercise could also make use of the piping technique
```{r}
filter(gapminder, country == "United Kingdom") %>%
ggplot(aes(x = year, y = lifeExp)) + geom_point() + geom_smooth()
```
What happens when you modify the geom_boxplot example to compare the gdp distributions for different years?
- Look at the message ggplot2 prints above the plot and try to modify the code to give a separate boxplot for each year
```{r}
# this is how we might expect the code to look like
ggplot(gapminder, aes(x = year, y = gdpPercap)) + geom_boxplot()
```
The previous output hints that you might want to group by year - otherwise it thinks that year is a numerical variable
```{r}
ggplot(gapminder, aes(x = year, y = gdpPercap, group=year)) + geom_boxplot()
```
You will often see this alternative of using the as.factor function to make year into a categorical variable.
```{r}
ggplot(gapminder, aes(x = as.factor(year), y = gdpPercap)) + geom_boxplot()
```
## Homework
### Task 1
Add an extra column; the first letter of each country name. Assigning a new variable on each line
```{r}
gapminder2 <- mutate(gapminder, FirstLetter = substr(country, 1,1))
gapminder3 <- filter(gapminder2, FirstLetter == "Z")
gapminder3
```
A more efficient solution
```{r}
gapminder %>%
mutate(FirstLetter = substr(country,1,1)) %>%
filter(FirstLetter == "Z")
```
### Task 2 - Heatmap of life expectancy
```{r}
## Get the European countries
filter(gapminder, continent == "Europe") %>%
## make heatmap. See the fill aesthetic to be life expectancy
ggplot(aes(x=year,y=country,fill=lifeExp)) + geom_tile()
```