-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path3_analyzing_data.Rmd
180 lines (119 loc) · 6.54 KB
/
3_analyzing_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
# R Bootcamp Tutorial
## Analyzing Data
In this tutorial, we'll dig into the dairy data to learn from it and produce some summaries. In this tutorial we'll cover:
1. Merges
2. Summary Statistics
3. Mutate
4. `group_by()` and `summarise()`
---
## Introduction
Now that the data is clean, we can analyze it. Let's repeat the steps from the cleaning tutorial, but without filtering for "Canada" as the location so we can get data from regions across the country. We will also remove columns including Canada as the location. Run the below code to generate the dataframe that we will analyze.
```{r, include = FALSE}
# Repeat data cleaning steps from lesson 2
library(cansim) # read in CODR/NDM tables
library(ggplot2)
library(tidyverse)
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
# Load cansim data
df <- get_cansim("32-10-0001-01")
for (col in colnames(df)){ # loop through all our columns
if (all(is.na(df[col]))) # check if NAs appear in the respective column in ALL rows
{print(col) # if so, tell us which column
df <- select(df, -(col)) # drop that column from our dataframe
}
}
# drop ID columns
df <- select(df, -(ends_with("ID")))
# Filter for total creamery butter stocks and remove the country totals
df <- filter(df, Commodity=="Creamery butter", GEO != "Canada", Stocks == "Total stocks")
```
## 1. Merges
Merges (also known as joins) allow you to join datasets (`x` and `y`) by mutual columns. Take for example, if you had a list of everyone's favorite flower and a separate list of everyone's addresses. If you wanted to generate a single list with which flower your should deliver where, **merging** the two lists by name would help.
There are 4 main types of joins:
1. Natural (also known as inner) join: Keeps only rows that match from both dataframes
2. Full (also known as outer) join: Keeps all rows from both dataframes
3. Left join: Includes all rows of data frame `x` and only those from `y` that match
4. Right join: Includes all rows from data frame `y` and only those from `x` that match.
![Join Types (R for Data Science)](../static/images/join_types.png)
Let's experiment with a **left join** on our dataset. If you take a look at `df`, you will notice that between certain years, the Atlantic Provinces are listed together rather than separately. To see how creamery production (`x`) has changed over time in the regions, let's assign all of the other province data a region. To do this, we will need a list of which provinces belong to which regions (`y`). For this, we can load in a province key (`prov_data.csv`). The file encoding argument avoids any special characters being loaded in when we import the file.
```{r}
# Load province data
prov_data_csv <- read.csv("../static/prov_data.csv", fileEncoding = 'UTF-8-BOM')
```
Now, we will do a left join with the `merge()` function. As was mentioned before, a left join will keep rows from `x` along with only the columns from `y` that match.
```{r}
# left join province info in with dplyr left_join
df_regions <- left_join(x = df,
y = prov_data_csv,
by = c("GEO" = "Province"))
```
#### Exercise 1 {style="color:blue;"}
Read the `dplyr` documentation and find out how to do a right join on the data instead. What happens?
```{r}
```
#### Exercise 2 {style="color:blue;"}
Congrats - you've mastered merges! However, there's one last puzzle to solve. There are some NA's in our `df_region` data. Can you find out where and why?
<!--It seems that the table was inconsistent in its capitalization of the provinces. We can sub those out with the proper versions with a `str_replace_all()` command.-->
Run the below code to fix this:
```{r}
# Make any NA columns in Regions carry over from GEO
for (i in 1:nrow(df_regions)){
if(is.na(df_regions$Region[i])){
df_regions$Region[i] <- df_regions$GEO[i]
}
}
# Fix capitalization errors in Regions
df_regions$Region <- str_replace_all(df_regions$Region, c("Atlantic provinces" = "Atlantic Provinces"))
```
## 2. Summary Statistics
Let's take a deep dive into the dairy data for Alberta over all the years.
```{r}
creamery_Alberta <- filter(df_regions, GEO == "Alberta")
```
You can do some summary stats on the dataframe to find the average (mean) and standard deviation of stocks in Alberta.
```{r}
# Get the mean of Alberta stocks
mean(creamery_Alberta$VALUE)
```
Note the error message that this command gives. You can ignore NA's by adding `na.rm = TRUE`
```{r}
# Get the mean of Alberta stocks
mean(creamery_Alberta$VALUE, na.rm = TRUE)
```
#### Exercise 3 {style="color:blue;"}
Use `sd()` to calculate the standard deviation in Alberta's creamery stocks.
## 3. Mutate
`mutate()` is a `dplyr` function that allows us to create new columns by modifying existing ones.
Let's make a column for year by grabbing the first 4 characters of the `REF_DATE` column in `df_regions`. As part the `dplyr` universe, `mutate()` can be used with pipe (`%>%`) syntax, as below:
You can think of the pipe as an "and then" statement
```{r}
# Make a year column
df_regions <- df_regions %>% mutate(year = substr(REF_DATE,1,4))
```
So, the above line reads as: take `df_regions` and then mutate `year` to be the first 4 characters of `REF_DATE`.
## 4. `group_by()` and `summarise()`
In our `df_regions` dataframe, we now wish to "collapse" the table and get metrics for entire regions. How would we calculate creamery stock per region per year? That's where `group_by()` comes along. `group_by()` can be chained along with `mutate()` with pipes to create some powerful summary tables. Check out the code below, which collapses our data by year and region.
```{r}
# Avg value by year
df_summary <- df_regions %>% group_by(year, Region) %>%
summarize(mean = mean(VALUE, na.rm = TRUE))
```
Let's save this dataframe to the static file to use in the next lesson
```{r}
write.csv(df_summary, file = "../static/df_summary.csv")
```
There are quite a few built in summary stats we may want to use:
- Center: `mean()`, `median()`
- Spread: `sd()`, `IQR()`, `mad()`
- Range: `min()`, `max()`,
- Position: `first()`, `last()`, `nth()`,
- Count: `n()`, `n_distinct()`
- Logical: `any()`, `all()`
We can stack multiple summary functions inside the same `summarize()` function:
```{r}
df_summary <- df_regions %>% group_by(year, Region) %>%
summarize(mean = mean(VALUE, na.rm = TRUE), sd = sd(VALUE, na.rm =TRUE))
```
#### Exercise 4 {style="color:blue;"}
Create a dataframe containing only data from Alberta that groups by month rather than year!
Hint: Use `filter()`, `summarise()`, and `mutate()`.