-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcm010_tibble_joins.Rmd
145 lines (102 loc) · 3.24 KB
/
cm010_tibble_joins.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
title: "cm010 Exercises: Tibble Joins"
output:
html_document:
keep_md: true
theme: paper
---
## Requirements
You will need Joey's `singer` R package for this exercise. And to install that, you'll need to install `devtools`. Running this code in your console should do the trick:
```
install.packages("devtools")
devtools::install_github("JoeyBernhardt/singer")
```
Load required packages:
```{r, echo = FALSE, warning = FALSE, message = FALSE}
library(tidyverse)
library(singer)
knitr::opts_chunk$set(fig.width=4, fig.height=3, warning = FALSE, fig.align = "center")
```
<!---The following chunk allows errors when knitting--->
```{r allow errors, echo = FALSE}
knitr::opts_chunk$set(error = TRUE)
```
## Exercise 1: `singer`
The package `singer` comes with two smallish data frames about songs. Let's take a look at them (after minor modifications by renaming and shuffling):
```{r}
(time <- as_tibble(songs) %>%
rename(song = title))
```
```{r}
(album <- as_tibble(locations) %>%
select(title, everything()) %>%
rename(album = release,
song = title))
```
1. We really care about the songs in `time`. But, which of those songs do we know its corresponding album?
```{r}
time %>%
semi_join(album, by=c("song", "artist_name")) # these are unique identifiers
```
2. Go ahead and add the corresponding albums to the `time` tibble, being sure to preserve rows even if album info is not readily available.
```{r}
time %>%
inner_join(album, by=c("song", "artist_name"))
```
3. Which songs do we have "year", but not album info?
```{r}
time %>%
inner_join(album, by = "song")
```
4. Which artists are in `time`, but not in `album`?
```{r}
time %>%
anti_join(album, by = "artist_name")
```
5. You've come across these two tibbles, and just wish all the info was available in one tibble. What would you do?
```{r}
time %>%
full_join(album, by = c("song", "artist_name"))
```
## Exercise 2: LOTR
Load in the three Lord of the Rings tibbles that we saw last time:
```{r}
fell <- read_csv("https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Fellowship_Of_The_Ring.csv")
ttow <- read_csv("https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Two_Towers.csv")
retk <- read_csv("https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Return_Of_The_King.csv")
```
1. Combine these into a single tibble.
```{r}
bind_rows(fell, ttow,retk)
```
2. Which races are present in "The Fellowship of the Ring" (`fell`), but not in any of the other ones?
```{r}
fell %>%
anti_join(ttow, by = "Race") %>%
anti_join(retk, by = "Race")
```
## Exercise 3: Set Operations
Let's use three set functions: `intersect`, `union` and `setdiff`. We'll work with two toy tibbles named `y` and `z`, similar to Data Wrangling Cheatsheet
```{r}
(y <- tibble(x1 = LETTERS[1:3], x2 = 1:3))
```
```{r}
(z <- tibble(x1 = c("B", "C", "D"), x2 = 2:4))
```
1. Rows that appear in both `y` and `z`
```{r}
intersect(y, z)
inner_join(y,z)
```
2. You collected the data in `y` on Day 1, and `z` in Day 2. Make a data set to reflect that.
```{r}
bind_rows(
mutate(y, day = "Day 1"),
mutate(z, day = "Day 2")
)
```
3. The rows contained in `z` are bad! Remove those rows from `y`.
```{r}
setdiff(y, z)
anti_join(y, z)
```