-
Notifications
You must be signed in to change notification settings - Fork 0
/
ch7_EDA.Rmd
330 lines (232 loc) · 14.3 KB
/
ch7_EDA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
---
title: "ch7_eda"
author: "jason grahn"
date: "8/16/2018"
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
#7 Exploratory Data Analysis
##7.1 Introduction
From the text:
"This chapter will show you how to use visualisation and transformation to explore your data in a systematic way... EDA is an iterative cycle. You:
1. Generate questions about your data.
2. Search for answers by visualising, transforming, and modelling your data.
3. Use what you learn to refine your questions and/or generate new questions.
During the initial phases of EDA you should feel free to investigate every idea that occurs to you... EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data. Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not. To do data cleaning, you’ll need to deploy all the tools of EDA: visualisation, transformation, and modelling."
###7.1.1 Prerequisites
primary use of dplyr and ggplot2 in this chapter, load the tidy!
```{r}
library(tidyverse)
```
###7.2 Questioning
"Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey"
EDA is here to understand the data! EDA is a creative process. There are two main types of questions:
1. What type of variation occurs within my variables?
2. What type of covariation occurs between my variables?
define some terms:
A *variable* is a quantity, quality, or property that you can measure.
A *value* is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.
An *observation* is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I’ll sometimes refer to an observation as a data point.
*Tabular data* is a set of values, each associated with a variable and an observation. Tabular data is tidy if each value is placed in its own “cell”, each variable in its own column, and each observation in its own row.
##7.3 Variation
"Variation is the tendency of the values of a variable to change from measurement to measurement... The best way to understand that pattern is to visualise the distribution of the variable’s values."
###7.3.1 Visualising distributions
visualization depends on categorical vs continuous data.
*categorical* data can take only take a set of values. these are typically factors or character vectors.
*categorical* distributions happen through bar charts where height equates to count values (want these manually? use dplyr::count() )
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut)) +
theme_light()
```
*continuous* values take infinite set of ordered values. these are typically numerical. histograms are great start for evaluating distribution of continuous data, where the height is the count of values that fit into a _bucket_. Manually computable by combining dplyr::count() and ggplot2::cut_width():
```{r}
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.5) +
theme_light()
```
"always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns."
###7.3.2 Typical values
http://r4ds.had.co.nz/exploratory-data-analysis.html
with bars and histograms, height denotes importance. Sometimes we want to look for short bars. Here are a few questions to help explore and identify problems:
* Which values are the most common? Why?
* Which values are rare? Why? Does that match your expectations?
* Can you see any unusual patterns? What might explain them?
```{r}
#As an example, the histogram below suggests several interesting questions:
# Why are there more diamonds at whole carats and common fractions of carats?
# Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak?
# Why are there no diamonds bigger than 3 carats?
smaller <- diamonds %>%
filter(carat < 3)
# use the smaller dataset to build a plot, universally map the x axis to carat
ggplot(data = smaller, mapping = aes(x = carat)) +
# make a histogram with a binwidth of .01
geom_histogram(binwidth = 0.01)
```
If we see clusters (and we do), this may mean subgroups are present; that we need to find another way to subset the data. this takes familiarity with the data and should propt us to look at relationships between variables.
* How are the observations within each cluster similar to each other?
* How are the observations in separate clusters different from each other?
* How can you explain or describe the clusters?
* Why might the appearance of clusters be misleading?
###7.3.3 Unusual values
Unusual doesn't explicitly mean Outliers, just OFTEN means outliers.
```{r}
#take the distribution of the y variable from the diamonds dataset. The only evidence of outliers is the unusually wide limits on the x-axis.
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
#scale_y_continuous(trans="log10")
# change the y-axis to make shorter columns more visible
coord_cartesian(ylim = c(0, 50))
```
```{r}
#This allows us to see that there are three unusual values: 0, ~30, and ~60. We pluck them out with dplyr:
unusual <- diamonds %>%
filter(y < 3 | y > 20) %>%
select(price, x, y, z) %>%
arrange(y)
unusual
```
if we have outliers, repeat the analysis with them removed to see if it changes the results.
"if they have a substantial effect on your results, you shouldn’t drop them without justification. You’ll need to figure out what caused them (e.g. a data entry error) and disclose that you removed them in your write-up."
###7.3.4 Exercises
1. Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.
2. Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)
3. How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?
4. Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?
##7.4 Missing values
Unusual values give you two options.
1. drop rows with strange values - not recommended. just because one is invalid doesn't mean they all are. And with low-quality data removing null variables might leave us with no more data!
2. alternative: replace unusual values with _missing_ values using `mutate()` + `ifelse()`.
```{r}
# make a new dataset from the diamonds data set
diamonds2 <- diamonds %>%
# modify 'y' to change anything less than 3 or greater than 20 with NA
mutate(y = ifelse(y < 3 | y > 20, NA, y))
```
`case_when()` can also be used. it's specifically useful inside mutate to create a new variable that relies on a complex combination of variables.
*Sometimes missing values are important!*
"For example, in nycflights13::flights, missing values in the dep_time variable indicate that the flight was cancelled... to compare the scheduled departure times for cancelled and non-cancelled times, make a new variable with is.na()."
```{r}
#using the flights dataset
nycflights13::flights %>%
#make new variables to find the difference between flights cancelled and those not.
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
#and show me a pretty picture of the trend
ggplot(mapping = aes(sched_dep_time)) +
geom_freqpoly(mapping = aes(colour = cancelled), binwidth = 1/4)
```
###7.4.1 Exercises
What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?
Missing values are removed from a histogram.
What does na.rm = TRUE do in mean() and sum()?
Using `na.rm = TRUE` in these identifies if the NA values are stripped before computation. FALSE indicates the NA values are left in place.
##7.5 Covariation
*Covariation* describes behavior between variables. Covariation is tendency for the values of two or more variables to vary together in a related way. Easiest way to spot is to visualize.
"Another alternative to display the distribution of a continuous variable broken down by a categorical variable is the *boxplot.* A boxplot is a type of visual shorthand for a distribution of values that is popular among statisticians. Each boxplot consists of:
* A box that stretches from the 25th percentile of the distribution to the 75th percentile, (the IQR). A line in the middle is the median. Boxplots intent to show spread of distribution and symmetry of that distribution about the median.
* points that display observations that fall more than 1.5 times the IQR from either edge of the box. These outlying points are unusual so are plotted individually. (Outliers!)
* A line (or whisker) that extends from each end of the box and goes to the farthest non-outlier point in the distribution. "
Ordered factors: `cut` is an ordered factor: fair is worse than good, which is worse than very good and so on. Many categorical variables don’t have such an intrinsic order, so you might want to reorder them to make a more informative display. One way to do that is with the `reorder()` function.
```{r}
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
#To make the trend easier to see, we can reorder class based on the median value of hwy:
ggplot(data = mpg) +
#reorder the X axis based on the median of each grouping in X
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy))
#coord_flip() will flip your X and Y axis which is good when you have long variable names in X.
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
coord_flip()
```
###7.5.1.1 Exercises
1. Use what you’ve learned to improve the visualisation of the departure times of cancelled vs. non-cancelled flights.
```{r}
#using the flights dataset
nycflights13::flights %>%
#make new variables to find the difference between flights cancelled and those not.
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
group_by(month, day, cancelled) %>%
ggplot(aes(cancelled, sched_dep_time)) +
geom_boxplot()
```
2. What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?
```{r}
diamonds %>%
ggplot(aes(cut, carat)) +
geom_boxplot()
diamonds %>%
ggplot(aes(carat, colour = cut)) +
geom_density(position = "dodge")
diamonds %>%
group_by(cut) %>%
summarise(cor(carat, price))
```
3. Install the ggstance package, and create a horizontal boxplot. How does this compare to using coord_flip()?
```{r}
require(ggstance)
diamonds %>%
ggplot(aes(cut, carat)) +
geom_boxplot() +
coord_flip()
diamonds %>%
ggplot(aes(carat, cut)) +
geom_boxploth()
#the difference isn't visual. the difference is in the geom. geom_boxplot*h* is automatically horizontal.
```
4. One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs cut. What do you learn? How do you interpret the plots?
```{r}
library(lvplot)
diamonds %>%
ggplot() +
geom_lv(aes(x = cut, y = price),
na.rm = TRUE,
outlier.colour = "red")
```
5. Compare and contrast geom_violin() with a facetted geom_histogram(), or a coloured geom_freqpoly(). What are the pros and cons of each method?
6. If you have a small dataset, it’s sometimes useful to use geom_jitter() to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does.
Lots of things here if we look up the ggbeeswarm package in github.
###7.5.2 Two categorical variables
###7.5.3 Two continuous variables
##7.6 Patterns and models
look for patterns (duh)
if you see a pattern, ask:
Could this pattern be due to coincidence (i.e. random chance)?
How can you describe the relationship implied by the pattern?
How strong is the relationship implied by the pattern?
What other variables might affect the relationship?
Does the relationship change if you look at individual subgroups of the data?
```{r}
#looking at old faithful data shows a relationship between eruption wait time and eruption time
ggplot(data = faithful) +
geom_point(mapping = aes(x = eruptions, y = waiting))
```
Patterns reveal covariation! If variation creates uncertainty, covariation reduces it. _IF_ covariation is due to causal relationship, use one value to control the other (linear modeling).
This code fits a model that predicts `price` from `carat` then finds residuals (diff between prediction and actual). Residuals show up price of diamond once the effect of carat is abstracted.
```{r model price and carat}
library(modelr) #use the modelR library
#build a linear model that predicts the log of Price based on the log of carat using the diamonds data
mod <- lm(log(price) ~ log(carat), data = diamonds)
#make a new df using the diamonds data
diamonds2 <- diamonds %>%
#add the residuals from the model
add_residuals(mod) %>%
#then square the residuals to make the resid variable
mutate(resid = exp(resid))
#then show me a scatter plot of carats against the residuals
ggplot(data = diamonds2) +
geom_point(mapping = aes(x = carat, y = resid))
```