forked from geanders/RProgrammingForResearch
-
Notifications
You must be signed in to change notification settings - Fork 0
/
10-gettingcleaningdata3.Rmd
505 lines (376 loc) · 21 KB
/
10-gettingcleaningdata3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
# (PART) Part IV: Advanced {-}
# Entering and cleaning data #3
[Download](https://github.com/geanders/RProgrammingForResearch/raw/master/slides/CourseNotes_Week10new.pdf) a pdf of the lecture slides covering this topic.
```{r results = "hide", echo = FALSE, message = FALSE, warning = FALSE}
library(knitr)
library(dplyr)
library(readr)
```
## Pulling online data
### APIs
*APIs* are "Application Program Interfaces".
An API provides the rules for software applications to interact. In the case of open data APIs, they provide the rules you need to know to write R code to request and pull data from the organization's web server into your R session.
Often, an API can help you avoid downloading all available data, and instead only download the subset you need.
The basic strategy for using APIs from R is:
- Figure out the API rules for HTTP requests
- Write R code to create a request in the proper format
- Send the request using GET or POST HTTP methods
- Once you get back data from the request, parse it into an easier-to-use format if necessary
Start by reading any documentation available for the API. This will often give information on what data is available and how to put together requests.
```{r echo = FALSE, out.width = "1.1\\textwidth", fig.align = "center"}
knitr::include_graphics("figures/NASA_API_documentation.png")
```
Source: https://api.nasa.gov/api.html#EONET
Many organizations will require you to get an API key and use this key in each of your API requests. This key allows the organization to control API access, including enforcing rate limits per user. API rate limits restrict how often you can request data (e.g., an hourly limit of 1,000 requests per user for NASA APIs).
You should keep this key private. In particular, make sure you do not include it in code that is posted to GitHub.
The `riem` package, developed by Maelle Salmon and an ROpenSci package, is an excellent and straightforward example of how you can use R to pull open data through a web API.
This package allows you to pull weather data from airports around the world directly from the [Iowa Environmental Mesonet](https://mesonet.agron.iastate.edu).
To get a certain set of weather data from the Iowa Environmental Mesonet, you can send an HTTP request specifying a base URL, "https://mesonet.agron.iastate.edu/cgi-bin/request/asos.py/", as well as some parameters describing the subset of dataset you want (e.g., date ranges, weather variables, output format). \bigskip
Once you know the rules for the names and possible values of these parameters (more on that below), you can submit an HTTP GET request using the `GET` function from the `httr` package.
```{r echo = FALSE, fig.align = "center"}
knitr::include_graphics("figures/mesonet_example.png")
```
https://mesonet.agron.iastate.edu/cgi-bin/request/asos.py?station=DEN&data=sknt&year1=2016&month1=6&day1=1&year2=2016&month2=6&day2=30&tz=America%2FDenver&format=comma&latlon=no&direct=no&report_type=1&report_type=2
When you are making an HTTP request using the `GET` or `POST` functions from the `httr` package, you can include the key-value pairs for any query parameters as a list object in the `query` argurment of the function.
```{r}
library(httr)
meso_url <- paste0("https://mesonet.agron.iastate.edu/",
"cgi-bin/request/asos.py/")
denver <- GET(url = meso_url,
query = list(station = "DEN", data = "sped",
year1 = "2016", month1 = "6",
day1 = "1", year2 = "2016",
month2 = "6", day2 = "30",
tz = "America/Denver",
format = "comma"))
```
The `GET` call will return a special type of list object with elements that include the url you queried and the content of the page at that url:
```{r}
str(denver, max.level = 1, list.len = 6)
```
The `httr` package includes functions to pull out elements of this list object, including:
- `headers`: Pull out the header information
- `content`: Pull out the content returned from the page
- `status_code`: Pull out the status code from the `GET` request (e.g., 200: okay; 404: not found)
*Note:* For some fun examples of 404 pages, see https://www.creativebloq.com/web-design/best-404-pages-812505
You can use `content` from `httr` to retrieve the contents of the HTTP request we made. For this particular web data, the requested data is a comma-separated file, so you can convert it to a dataframe with `read_csv`:
```{r}
denver %>% content() %>%
read_csv(skip = 5, na = "M") %>%
slice(1:3)
```
The `riem` package wraps up this whole process, so you can call a single function to get in
the data you want from the API:
```{r}
library(riem)
denver_2 <- riem_measures(station = "DEN",
date_start = "2016-06-01",
date_end = "2016-06-30")
denver_2 %>% slice(1:3)
```
## Example R API wrapper packages
The `tigris` package is a very useful example of an API wrapper. It retrieves geographic
boundary data from the U.S. Census for a number of different geographies:
- Location boundaries
+ States
+ Counties
+ Blocks
+ Tracks
+ School districts
+ Congressional districts
- Roads
+ Primary roads
+ Primary and secondary roads
- Water
+ Area-water
+ Linear-water
+ Coastline
- Other
+ Landmarks
+ Military
## `tigris` package
The following plot is an example of the kinds of maps you can create using the `tigris`
package. This map comes from: *Kyle Walker. 2016. "tigris: An R Package to Access and Work
with Geographic Data from the US
Census Bureau". The R Journal.* This is a great article to read to find out more about
`tigris`.
```{r echo = FALSE, fig.align = "center"}
knitr::include_graphics("figures/tigris_example.png")
```
A number of other R packages also help you access and use data from the U.S. Census:
- `acs`: Download, manipulate, and present American Community Survey and Decennial data from the US Census (see "Working with the American Community Survey in R: A Guide to Using the
acs Package", a book available free online through the CSU library)
- `USABoundaries`: Historical and contemporary boundaries of the United States of America
- `idbr`: R interface to the US Census Bureau International Data Base API (e.g., populations of other countries)
The organization rOpenSci (https://ropensci.org) has the following mission:
> "At rOpenSci we are creating packages that allow access to data repositories through the R statistical programming environment that is already a familiar part of the workflow of many scientists. Our tools not only facilitate drawing data into an environment where it can readily be manipulated, but also one in which those analyses and methods can be easily shared, replicated, and extended by other researchers."
rOpenSci collects a number of packages for tapping into open data for research. These are
listed at https://ropensci.org/packages. Many of these packages are wrappers for APIs with
data useful for scientific research
Some examples (all descriptions from rOpenSci):
- `AntWeb`: Access data from the world's largest ant database
- `chromer`: Interact with the chromosome counts database (CCDB)
- `gender`: Encodes gender based on names and dates of birth
- `musemeta`: R Client for Scraping Museum Metadata, including The Metropolitan Museum of Art, the Canadian Science & Technology Museum Corporation, the National Gallery of Art, and the Getty Museum, and more to come.
- `rusda`: Interface to some USDA databases
- `webchem`: Retrieve chemical information from many sources. Currently includes: Chemical Identifier Resolver, ChemSpider, PubChem, and Chemical Translation Service.
As an example, one ROpenSci package, `rnoaa`, allows you to:
> "Access climate data from NOAA, including temperature and precipitation, as well as sea ice cover data, and extreme weather events"
It includes access to:
- Buoy data from the National Buoy Data Center
- Historical Observing Metadata Repository (HOMR))--- climate station metadata
- National Climatic Data Center weather station data
- Sea ice data
- International Best Track Archive for Climate Stewardship (IBTrACS)--- tropical cyclone tracking data
- Severe Weather Data Inventory (SWDI)
## `countyweather`
The `countyweather` package, developed by a student here at CSU, wraps the `rnoaa` package to
let you pull and aggregate weather at the county level in the U.S. For example, you can pull
all data from Miami during Hurricane Andrew:
```{r echo = FALSE}
knitr::include_graphics("figures/countyweather2.png")
```
When you pull the data for a county, the package also maps the contributing weather stations:
```{r echo = FALSE}
knitr::include_graphics("figures/countyweather1.png")
```
The USGS also has a very nice collection of R packages that wrap USGS open data APIs, which
can be accessed through: https://owi.usgs.gov/R/
> "USGS-R is a community of support for users of the R scientific programming language. USGS-R resources include R training materials, R tools for the retrieval and analysis of USGS data, and support for a growing group of USGS-R developers. "
USGS R packages include:
- `dataRetrieval`: Obtain water quality sample data, streamflow data, and metadata directly from either the USGS or EPA
- `EGRET`: Analysis of long-term changes in water quality and streamflow, including the water-quality method Weighted Regressions on Time, Discharge, and Season (WRTDS)
- `laketemps`: Lake temperature data package for Global Lake Temperature Collaboration Project
- `lakeattributes`: Common useful lake attribute data
- `soilmoisturetools`: Tools for soil moisture data retrieval and visualization
Here are some examples of other R packages that faciliate use of an API for open data:
- `twitteR`: Twitter
- `Quandl`: Quandl (financial data)
- `RGoogleAnalytics`: Google Analytics
- `WDI`, `wbstats`: World Bank
- `GuardianR`, `rdian`: The Guardian Media Group
- `blsAPI`: Bureau of Labor Statistics
- `rtimes`: New York Times
Find out more about writing API packages with this vignette for the httr package: https://cran.r-project.org/web/packages/httr/vignettes/api-packages.html. \bigskip
This document includes advice on error handling within R code that accesses data through an open API.
## Cleaning very messy data
One version of Atlantic basin hurricane tracks is available here: https://www.nhc.noaa.gov/data/hurdat/hurdat2-1851-2017-050118.txt. The data is not in a classic delimited format:
```{r echo = FALSE, fig.align = "center", out.width = "\\textwidth"}
knitr::include_graphics("figures/hurrtrackformat.png")
```
This data is formatted in the following way:
- Data for many storms are included in one file.
- Data for a storm starts with a shorter line, with values for the storm ID, name, and number of observations for the storm. These values are comma separated.
- Observations for each storm are longer lines. There are multiple observations for each storm, where each observation gives values like the location and maximum winds for the storm at that time.
Strategy for reading in very messy data:
1. Read in all lines individually.
2. Use regular expressions to split each line into the elements you'd like to use to fill columns.
3. Write functions and / or `map` calls to process lines and use the contents to fill a data frame.
4. Once you have the data in a data frame, do any remaining cleaning to create a data frame that is easy to use to answer research questions.
Because the data is not nicely formatted, you can't use `read_csv` or similar functions to read it in.
However, the `read_lines` function from `readr` allows you to read a text file in one line at a time. You can then write code and functions to parse the file one line at a time, to turn it into a dataframe you can use.
Note: Base R has `readLines`, which is very similar.
The `read_lines` function from `readr` will read in lines from a text file directly, without trying to separate into columns. You can use the `n_max` argument to specify the number of lines to read it. \bigskip
For example, to read in three lines from the hurricane tracking data, you can run:
```{r warning = FALSE, message = FALSE}
tracks_url <- paste0("http://www.nhc.noaa.gov/data/hurdat/",
"hurdat2-1851-2017-050118.txt")
hurr_tracks <- read_lines(tracks_url, n_max = 3)
hurr_tracks
```
The data has been read in as a vector, rather than a dataframe:
```{r warning = FALSE, message = FALSE}
class(hurr_tracks)
length(hurr_tracks)
hurr_tracks[1]
```
You can use regular expressions to break each line up. For example, you can use `str_split` from the `stringr` package to break the first line of the hurricane track data into its three separate components:
```{r}
library(stringr)
str_split(hurr_tracks[1], pattern = ",")
```
You can use this to create a list where each element of the list has the split-up version of a line of the original data. First, read in all of the data:
```{r warning = FALSE, message = FALSE}
tracks_url <- paste0("http://www.nhc.noaa.gov/data/hurdat/",
"hurdat2-1851-2017-050118.txt")
hurr_tracks <- read_lines(tracks_url)
length(hurr_tracks)
```
Next, use `map` with `str_split` to split each line of the data at the commas:
```{r}
library(purrr)
hurr_tracks <- purrr::map(hurr_tracks, str_split,
pattern = ",", simplify = TRUE)
hurr_tracks[[1]]
hurr_tracks[[2]][1:2]
```
Next, you want to split this list into two lists, one with the shorter "meta-data" lines and one with the longer "observation" lines. You can use `map_int` to create a vector with the length of each line. You will later use this to identify which lines are short or long.
```{r}
hurr_lengths <- map_int(hurr_tracks, length)
hurr_lengths[1:17]
unique(hurr_lengths)
```
You can use bracket indexing to split the `hurr_tracks` into two lists: one with the shorter lines that start each observation (`hurr_meta`) and one with the storm observations (`hurr_obs`). Use bracket indexing with the `hurr_lengths` vector you just created to make that split.
```{r}
hurr_meta <- hurr_tracks[hurr_lengths == 4]
hurr_obs <- hurr_tracks[hurr_lengths == 21]
```
```{r}
hurr_meta[1:3]
```
```{r}
hurr_obs[1:2]
```
Now, you can use `bind_rows` from `dplyr` to change the list of metadata into a dataframe. (You first need to use `as_tibble` with `map` to convert all elements of the list from matrices to dataframes.)
```{r message = FALSE}
library(dplyr); library(tibble)
hurr_meta <- hurr_meta %>%
purrr::map(as_tibble) %>%
bind_rows()
hurr_meta %>% slice(1:3)
```
You can clean up the data a bit more.
- First, the fourth column doesn't have any non-missing values, so you can get rid of it:
```{r}
unique(hurr_meta$V4)
```
- Second, the second and third columns include a lot of leading whitespace:
```{r}
hurr_meta$V2[1:2]
```
- Last, we want to name the columns.
```{r}
hurr_meta <- hurr_meta %>%
select(-V4) %>%
rename(storm_id = V1, storm_name = V2, n_obs = V3) %>%
mutate(storm_name = str_trim(storm_name),
n_obs = as.numeric(n_obs))
hurr_meta %>% slice(1:3)
```
Now you can do the same idea with the hurricane observations. First, we'll want to add storm identifiers to that data. The "meta" data includes storm ids and the number of observations per storm. We can take advantage of that to make a `storm_id` vector that will line up with the storm observations.
```{r}
storm_id <- rep(hurr_meta$storm_id, times = hurr_meta$n_obs)
head(storm_id, 3)
length(storm_id)
length(hurr_obs)
```
```{r warnings = FALSE, message = FALSE}
hurr_obs <- hurr_obs %>%
purrr::map(as_tibble) %>%
bind_rows() %>%
mutate(storm_id = storm_id)
hurr_obs %>% select(V1:V2, V5:V6, storm_id) %>% slice(1:3)
```
## In-course exercise
### Working with an API wrapper package
The `rplos` package provides a wrapper to the Public Library of Science (PLoS)'s API. PLOS
has a collection of academic journals spanning a variety of topics.
- Check out this page of documentation for this API: http://api.plos.org/solr/search-fields/
Look through the potential search terms.
- Use the `searchplos` function to search articles in the PLoS collection for the term
"West Nile". Pull the publication date, title, abstract, article type, subject,
and journal of each
matching article
and save the result to an R object called `wn_papers`. You may find it helpful to look at the
examples in the helpfile for `searchplos` or the tutorial available at:
https://ropensci.org/tutorials/rplos_tutorial/
- The object returned by `searchplos` will be a list with two top levels, `meta` and `data`.
Confirm that this is true for the `wn_papers` object you created. Look at the `meta`
part of the list (you can use `$` indexing to pull this out). How many articles were found
with "West Nile" in them? Does the query seem case-sensitive (i.e., do you get the same
number of papers when you query "west nile" rather than "West Nile")?
- Re-run your query (save the results to `wn_papers_titles`) looking only for papers with
"West Nile" in the title of the paper. How many papers are returned by this query?
- By default the limit of the number of papers returned by a query will be 10. You
can change this (to a certain degree) by using the `limit` option.
For the call you ran to create `wn_papers_titles`, set the limit to
the number of articles that match this query, as identified in the `meta` element of the
first run of the call. Check the number of rows in the `data` element that was returned to
make sure it has the same number of rows as the number of articles that match the query.
- Create a plot of the number of articles published per year. Use color to show which
articles are Research Articles versus other types of articles.
- Determine which journals have published these articles and the number of articles published
in each journal. You may notice that sometimes "PLoS" is used and sometimes "PLOS". See if you
can fix this in R to get the count of articles per journal without this capitalization
difference causing problems.
- Explore the list of packages on ROpenSci, those through the USGS, and those related to
the U.S. Census. See if you can identify any other packages that might provide access to
data relevant to West Nile virus.
#### Example R code
```{r}
library(rplos)
wn_papers <- searchplos(q = "West Nile",
fl = c("publication_date", "title", "journal", "subject",
"abstract", "article_type"))
```
Confirm the structure of the returned object:
```{r}
str(wn_papers)
```
Look at the meta data:
```{r}
wn_papers$meta
```
Query for just articles with "West Nile" in the title:
```{r}
wn_papers_titles <- searchplos(q = "title:West Nile",
fl = c("publication_date", "title", "journal", "subject",
"abstract", "article_type"))
```
Look at the metadata:
```{r}
wn_papers_titles$meta
```
Re-run the query using an appropriate limit to get all the matching articles:
```{r}
wn_papers_titles <- searchplos(q = "title:West Nile",
fl = c("publication_date", "title", "journal", "subject",
"abstract", "article_type"),
limit = 190)
```
Check the number of rows:
```{r}
nrow(wn_papers_titles$data)
```
Create a plot of the number of articles in the PLoS journals with "West Nile" in the title
by year:
```{r message = FALSE, warning = FALSE}
library(lubridate)
library(dplyr)
library(ggplot2)
wn_papers_titles$data %>%
mutate(publication_date = ymd_hms(publication_date),
pub_year = year(publication_date),
research_article = article_type == "Research Article") %>%
group_by(pub_year, research_article) %>%
count() %>%
ggplot(aes(x = pub_year, y = n, fill = research_article)) +
geom_col() +
labs(x = "Year of publication",
y = "Number of articles published",
fill = "Research article") +
ggtitle("West Nile articles published in PLoS journals",
subtitle = "Based on articles with 'West Nile' in the title")
```
Determine which journals have published these articles and the number of articles published
in each journal:
```{r}
library(forcats)
library(stringr)
wn_papers_titles$data %>%
mutate(journal = str_replace(journal, "PLOS", "PLoS")) %>%
group_by(journal) %>%
count() %>%
ungroup() %>%
filter(!is.na(journal)) %>%
arrange(desc(n))
```
### Cleaning very messy data
With your groups, create an R script that does all the steps described so far to pull the
messy hurricane tracks data from online and clean it.
Then try the following further cleaning steps:
- Select only the columns with date, time, storm status, location (latitude and longitude), maximum sustained winds, and minimum pressure and renames them
- Create a column with the date-time of each observation, in a date-time class
- Clean up the latitude and longitude so that you have separate columns for the numeric values and for the direction indicator (e.g., N, S, E, W)
- Clean up the wind column, so it gives wind speed as a number and `NA` in cases where wind speed is missing
- If you have time, try to figure out what the `status` abbreviations stand for. Create a new factor column named `status_long` with the status spelled out.