-
Notifications
You must be signed in to change notification settings - Fork 81
/
Copy patheda_checklist.Rmd
358 lines (216 loc) · 21.4 KB
/
eda_checklist.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
# Exploratory Data Analysis Checklist
```{r,echo=FALSE}
knitr::opts_chunk$set(comment = NA, prompt = TRUE, collapse = TRUE,
error = TRUE, warning = TRUE, message = TRUE)
options(width = 50)
```
In this chapter we will run through an informal "checklist" of things to do when embarking on an exploratory data analysis. As a running example I will use a dataset on hourly ozone levels in the United States for the year 2014. The elements of the checklist are
1. Formulate your question
2. Read in your data
3. Check the packaging
4. Run `str()`
5. Look at the top and the bottom of your data
6. Check your "n"s
7. Validate with at least one external data source
8. Try the easy solution first
9. Challenge your solution
10. Follow up
## Formulate your question
Formulating a question can be a useful way to guide the exploratory data analysis process and to limit the exponential number of paths that can be taken with any sizeable dataset. In particular, a *sharp* question or hypothesis can serve as a dimension reduction tool that can eliminate variables that are not immediately relevant to the question.
For example, in this chapter we will be looking at an air pollution dataset from the U.S. Environmental Protection Agency (EPA). A general question one could as is
> Are air pollution levels higher on the east coast than on the west coast?
But a more specific question might be
> Are hourly ozone levels on average higher in New York City than they are in Los Angeles?
Note that both questions may be of interest, and neither is right or wrong. But the first question requires looking at all pollutants across the entire east and west coasts, while the second question only requires looking at single pollutant in two cities.
It's usually a good idea to spend a few minutes to figure out what is the question you're *really* interested in, and narrow it down to be as specific as possible (without becoming uninteresting).
For this chapter, we will focus on the following question:
> Which counties in the United States have the highest levels of ambient ozone pollution?
As a side note, one of the most important questions you can answer with an exploratory data analysis is "Do I have the right data to answer this question?" Often this question is difficult ot answer at first, but can become more clear as we sort through and look at the data.
## Read in your data
The next task in any exploratory data analysis is to read in some data. Sometimes the data will come in a very messy format and you'll need to do some cleaning. Other times, someone else will have cleaned up that data for you so you'll be spared the pain of having to do the cleaning.
We won't go through the pain of cleaning up a dataset here, not because it's not important, but rather because there's often not much generalizable knowledge to obtain from going through it. Every dataset has its unique quirks and so for now it's probably best to not get bogged down in the details.
Here we have a relatively clean dataset from the U.S. EPA on hourly ozone measurements in the entire U.S. for the year 2014. The data are available from the EPA's [Air Quality System web page](https://aqs.epa.gov/aqsweb/airdata/download_files.html). I've simply downloaded the zip file from the web site, unzipped the archive, and put the resulting file in a directory called "data". If you want to run this code you'll have to use the same directory structure.
The dataset is a comma-separated value (CSV) file, where each row of the file contains one hourly measurement of ozone at some location in the country.
**NOTE**: Running the code below may take a few minutes. There are 7,147,884 rows in the CSV file. If it takes too long, you can read in a subset by specifying a value for the `n_max` argument to `read_csv()` that is greater than 0.
```{r,eval=FALSE}
library(readr)
ozone <- read_csv("data/hourly_44201_2014.csv",
col_types = "ccccinnccccccncnncccccc")
```
```{r,read RDS file,echo=FALSE}
ozone <- readRDS("data/hourly_ozone.rds")
```
The `readr` package by Hadley Wickham is a nice package for reading in flat files *very* fast, or at least much faster than R's built-in functions. It makes some tradeoffs to obtain that speed, so these functions are not always appropriate, but they serve our purposes here.
The character string provided to the `col_types` argument specifies the class of each column in the dataset. Each letter represents the class of a column: "c" for character, "n" for numeric", and "i" for integer. No, I didn't magically know the classes of each column---I just looked quickly at the file to see what the column classes were. If there are too many columns, you can not specify `col_types` and `read_csv()` will try to figure it out for you.
Just as a convenience for later, we can rewrite the names of the columns to remove any spaces.
```{r}
names(ozone) <- make.names(names(ozone))
```
## Check the packaging
Have you ever gotten a present *before* the time when you were allowed to open it? Sure, we all have. The problem is that the present is wrapped, but you desperately want to know what's inside. What's a person to do in those circumstances? Well, you can shake the box a bit, maybe knock it with your knuckle to see if it makes a hollow sound, or even weigh it to see how heavy it is. This is how you should think about your dataset before you start analyzing it for real.
Assuming you don't get any warnings or errors when reading in the dataset, you should now have an object in your workspace named `ozone`. It's usually a good idea to poke at that object a little bit before we break open the wrapping paper.
For example, you can check the number of rows and columns.
```{r}
nrow(ozone)
ncol(ozone)
```
Remember when I said there were 7,147,884 rows in the file? How does that match up with what we've read in? This dataset also has relatively few columns, so you might be able to check the original text file to see if the number of columns printed out (`r ncol(ozone)`) here matches the number of columns you see in the original file.
## Run `str()`
Another thing you can do is run `str()` on the dataset. This is usually a safe operation in the sense that even with a very large dataset, running `str()` shouldn't take too long.
```{r}
str(ozone)
```
The output for `str()` duplicates some information that we already have, like the number of rows and columns. More importantly, you can examine the *classes* of each of the columns to make sure they are correctly specified (i.e. numbers are `numeric` and strings are `character`, etc.). Because I pre-specified all of the column classes in `read_csv()`, they all should match up with what I specified.
Often, with just these simple maneuvers, you can identify potential problems with the data before plunging in head first into a complicated data analysis.
## Look at the top and the bottom of your data
I find it useful to look at the "beginning" and "end" of a dataset right after I check the packaging. This lets me know if the data were read in properly, things are properly formatted, and that everthing is there. If your data are time series data, then make sure the dates at the beginning and end of the dataset match what you expect the beginning and ending time period to be.
You can peek at the top and bottom of the data with the `head()` and `tail()` functions.
Here's the top.
```{r}
head(ozone[, c(6:7, 10)])
```
For brevity I've only taken a few columns. And here's the bottom.
```{r}
tail(ozone[, c(6:7, 10)])
```
I find `tail()` to be particularly useful because often there will be some problem reading the end of a dataset and if you don't check that you'd never know. Sometimes there's weird formatting at the end or some extra comment lines that someone decided to stick at the end.
Make sure to check all the columns and verify that all of the data in each column looks the way it's supposed to look. This isn't a foolproof approach, because we're only looking at a few rows, but it's a decent start.
## Check your "n"s
In general, counting things is usually a good way to figure out if anything is wrong or not. In the simplest case, if you're expecting there to be 1,000 observations and it turns out there's only 20, you know something must have gone wrong somewhere. But there are other areas that you can check depending on your application. To do this properly, you need to identify some *landmarks* that can be used to check against your data. For example, if you are collecting data on people, such as in a survey or clinical trial, then you should know how many people there are in your study. That's something you should check in your dataset, to make sure that you have data on all the people you thought you would have data on.
In this example, we will use the fact that the dataset purportedly contains *hourly* data for the *entire country*. These will be our two landmarks for comparison.
Here, we have hourly ozone data that comes from monitors across the country. The monitors should be monitoring continuously during the day, so all hours should be represented. We can take a look at the `Time.Local` variable to see what time measurements are recorded as being taken.
```{r}
table(ozone$Time.Local)
```
One thing we notice here is that while almost all measurements in the dataset are recorded as being taken on the hour, some are taken at slightly different times. Such a small number of readings are taken at these off times that we might not want to care. But it does seem a bit odd, so it might be worth a quick check.
We can take a look at which observations were measured at time "00:01".
```{r,warning=FALSE,message=FALSE}
library(dplyr)
filter(ozone, Time.Local == "13:14") %>%
select(State.Name, County.Name, Date.Local,
Time.Local, Sample.Measurement)
```
We can see that it's a monitor in Franklin County, New York and that the measurements were taken on September 30, 2014. What if we just pulled all of the measurements taken at this monitor on this date?
```{r}
filter(ozone, State.Code == "36"
& County.Code == "033"
& Date.Local == "2014-09-30") %>%
select(Date.Local, Time.Local,
Sample.Measurement) %>%
as.data.frame
```
Now we can see that this monitor just records its values at odd times, rather than on the hour. It seems, from looking at the previous output, that this is the only monitor in the country that does this, so it's probably not something we should worry about.
Since EPA monitors pollution across the country, there should be a good representation of states. Perhaps we should see exactly how many states are represented in this dataset.
```{r}
select(ozone, State.Name) %>% unique %>% nrow
```
So it seems the representation is a bit too good---there are 52 states in the dataset, but only 50 states in the U.S.!
We can take a look at the unique elements of the `State.Name` variable to see what's going on.
```{r}
unique(ozone$State.Name)
```
Now we can see that Washington, D.C. (District of Columbia) and Puerto Rico are the "extra" states included in the dataset. Since they are clearly part of the U.S. (but not official states of the union) that all seems okay.
This last bit of analysis made use of something we will discuss in the next section: external data. We knew that there are only 50 states in the U.S., so seeing 52 state names was an immediate trigger that something might be off. In this case, all was well, but validating your data with an external data source can be very useful.
## Validate with at least one external data source
Making sure your data matches something outside of the dataset is very important. It allows you to ensure that the measurements are roughly in line with what they should be and it serves as a check on what *other* things might be wrong in your dataset. External validation can often be as simple as checking your data against a single number, as we will do here.
In the U.S. we have national ambient air quality standards, and for ozone, the [current standard](http://www.epa.gov/ttn/naaqs/standards/ozone/s_o3_history.html) set in 2008 is that the "annual fourth-highest daily maximum 8-hr concentration, averaged over 3 years" should not exceed 0.075 parts per million (ppm). The exact details of how to calculate this are not important for this analysis, but roughly speaking, the 8-hour average concentration should not be too much higher than 0.075 ppm (it can be higher because of the way the standard is worded).
Let's take a look at the hourly measurements of ozone.
```{r}
summary(ozone$Sample.Measurement)
```
From the summary we can see that the maximum hourly concentration is quite high (`r max(ozone$Sample.Measurement)` ppm) but that in general, the bulk of the distribution is far below 0.075.
We can get a bit more detail on the distribution by looking at deciles of the data.
```{r}
quantile(ozone$Sample.Measurement, seq(0, 1, 0.1))
```
Knowing that the national standard for ozone is something like 0.075, we can see from the data that
* The data are at least of the right order of magnitude (i.e. the units are correct)
* The range of the distribution is roughly what we'd expect, given the regulation around ambient pollution levels
* Some hourly levels (less than 10%) are above 0.075 but this may be reasonable given the wording of the standard and the averaging involved.
## Try the easy solution first
Recall that our original question was
> Which counties in the United States have the highest levels of ambient ozone pollution?
What's the simplest answer we could provide to this question? For the moment, don't worry about whether the answer is correct, but the point is how could you provide *prima facie* evidence for your hypothesis or question. You may refute that evidence later with deeper analysis, but this is the first pass.
Because we want to know which counties have the *highest* levels, it seems we need a list of counties that are ordered from highest to lowest with respect to their levels of ozone. What do we mean by "levels of ozone"? For now, let's just blindly take the average across the entire year for each county and then rank counties according to this metric.
To identify each county we will use a combination of the `State.Name` and the `County.Name` variables.
```{r}
ranking <- group_by(ozone, State.Name, County.Name) %>%
summarize(ozone = mean(Sample.Measurement)) %>%
as.data.frame %>%
arrange(desc(ozone))
```
Now we can look at the top 10 counties in this ranking.
```{r}
head(ranking, 10)
```
It seems interesting that all of these counties are in the western U.S., with 4 of them in California alone.
For comparison we can look at the 10 lowest counties too.
```{r}
tail(ranking, 10)
```
Let's take a look at one of the higest level counties, Mariposa County, California. First let's see how many observations there are for this county in the dataset.
```{r}
filter(ozone, State.Name == "California" & County.Name == "Mariposa") %>% nrow
```
Always be checking. Does that number of observations sound right? Well, there's 24 hours in a day and 365 days per, which gives us `r 24 * 365`, which is close to that number of observations. Sometimes the counties use alternate methods of measurement during the year so there may be "extra" measurements.
We can take a look at how ozone varies through the year in this county by looking at monthly averages. First we'll need to convert the date variable into a `Date` class.
```{r}
ozone <- mutate(ozone, Date.Local = as.Date(Date.Local))
```
Then we will split the data by month to look at the average hourly levels.
```{r}
filter(ozone, State.Name == "California" & County.Name == "Mariposa") %>%
mutate(month = factor(months(Date.Local), levels = month.name)) %>%
group_by(month) %>%
summarize(ozone = mean(Sample.Measurement))
```
A few things stand out here. First, ozone appears to be higher in the summer months and lower in the winter months. Second, there are two months missing (November and December) from the data. It's not immediately clear why that is, but it's probably worth investigating a bit later on.
Now let's take a look at one of the lowest level counties, Caddo County, Oklahoma.
```{r}
filter(ozone, State.Name == "Oklahoma" & County.Name == "Caddo") %>% nrow
```
Here we see that there are perhaps fewer observations than we would expect for a monitor that was measuring 24 hours a day all year. We can check the data to see if anything funny is going on.
```{r}
filter(ozone, State.Name == "Oklahoma" & County.Name == "Caddo") %>%
mutate(month = factor(months(Date.Local), levels = month.name)) %>%
group_by(month) %>%
summarize(ozone = mean(Sample.Measurement))
```
Here we can see that the levels of ozone are much lower in this county and that also three months are missing (October, November, and December). Given the seasonal nature of ozone, it's possible that the levels of ozone are so low in those months that it's not even worth measuring. In fact some of the monthly averages are below the typical method detection limit of the measurement technology, meaning that those values are highly uncertain and likely not distinguishable from zero.
## Challenge your solution
The easy solution is nice because it is, well, easy, but you should never allow those results to hold the day. You should always be thinking of ways to challenge the results, especially if those results comport with your prior expectation.
Now, the easy answer seemed to work okay in that it gave us a listing of counties that had the highest average levels of ozone for 2014. However, the analysis raised some issues. For example, some counties do not have measurements every month. Is this a problem? Would it affect our ranking of counties if we had those measurements?
Also, how stable are the rankings from year to year? We only have one year's worth of data for the moment, but we could perhaps get a sense of the stability of the rankings by shuffling the data around a bit to see if anything changes. We can imagine that from year to year, the ozone data are somewhat different randomly, but generally follow similar patterns across the country. So the shuffling process could approximate the data changing from one year to the next. It's not an ideal solution, but it could give us a sense of how stable the rankings are.
First we set our random number generator and resample the indices of the rows of the data frame with replacement. The statistical jargon for this approach is a bootstrap sample. We use the resampled indices to create a new dataset, `ozone2`, that shares many of the same qualities as the original but is randomly perturbed.
```{r}
set.seed(10234)
N <- nrow(ozone)
idx <- sample(N, N, replace = TRUE)
ozone2 <- ozone[idx, ]
```
Now we can reconstruct our rankings of the counties based on this resampled data.
```{r}
ranking2 <- group_by(ozone2, State.Name, County.Name) %>%
summarize(ozone = mean(Sample.Measurement)) %>%
as.data.frame %>%
arrange(desc(ozone))
```
We can then compare the top 10 counties from our original ranking and the top 10 counties from our ranking based on the resampled data.
```{r}
cbind(head(ranking, 10),
head(ranking2, 10))
```
We can see that the rankings based on the resampled data (columns 4--6 on the right) are very close to the original, with the first 7 being identical. Numbers 8 and 9 get flipped in the resampled rankings but that's about it. This might suggest that the original rankings are somewhat stable.
We can also look at the bottom of the list to see if there were any major changes.
```{r}
cbind(tail(ranking, 10),
tail(ranking2, 10))
```
Here we can see that the bottom 7 counties are identical in both rankings, but after that things shuffle a bit. We're less concerned with the counties at the bottom of the list, but this suggests there is also reasonable stability.
## Follow up questions
In this chapter I've presented some simple steps to take when starting off on an exploratory analysis. The example analysis conducted in this chapter was far from perfect, but it got us thinking about the data and the question of interest. It also gave us a number of things to follow up on in case we continue to be interested in this question.
At this point it's useful to consider a few followup questions.
1. **Do you have the right data?** Sometimes at the conclusion of an exploratory data analysis, the conclusion is that the dataset is not really appropriate for this question. In this case, the dataset seemed perfectly fine for answering the question of which counties had the highest levels of ozone.
2. **Do you need other data?** One sub-question we tried to address was whether the county rankings were stable across years. We addressed this by resampling the data once to see if the rankings changed, but the better way to do this would be to simply get the data for previous years and re-do the rankings.
3. **Do you have the right question?** In this case, it's not clear that the question we tried to answer has immediate relevance, and the data didn't really indicate anything to increase the question's relevance. For example, it might have been more interesting to assess which counties were in violation of the national ambient air quality standard, because determining this could have regulatory implications. However, this is a much more complicated calculation to do, requiring data from at least 3 previous years.
The goal of exploratory data analysis is to get you thinking about your data and reasoning about your question. At this point, we can refine our question or collect new data, all in an iterative process to get at the truth.