This repository has been archived by the owner on Apr 16, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathaccessing-data.Rmd
681 lines (526 loc) · 31 KB
/
accessing-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
```{r}
i <- 1
chapter_number <- 11
source("_common.R")
```
```{r}
library(tidyverse)
library(tidycensus)
library(tigris)
options(tigris_use_cache = TRUE)
library(googlesheets4)
gs4_deauth()
```
# Accessing Data {#accessing-data-chapter}
So far, we’ve imported data into projects from CSV files. Many online datasets allow you to export data as a CSV, but before you do so, you should look for packages to automate your data access. If you can eliminate the manual steps involved in fetching data, your analysis and reporting will be more accurate. You’ll also be able to efficiently update your report when the data changes.
R offers many ways to automate the process of accessing online data. In this chapter, I’ll discuss two such approaches: using the `googlesheets4` package to fetch data directly from Google Sheets and using the `tidycensus` package to access data from the United States Census Bureau. You’ll learn how to connect your R Markdown project to Google so you can automatically download data when a Google Sheet updates. Then you’ll explore working with two large census datasets, the Decennial Census and the American Community Survey, and practice visualizing them.
## Importing Data from Google Sheets with `googlesheets4` {-}
By using the `googlesheets4` package to access data directly from Google Sheets, you can avoid having to manually download data, copy it into your project, and adjust your code so it imports that new data each time you want to update a report. This package lets you write code that automatically fetches new data directly from Google Sheets. Whenever you need to update your report, you can simply run your code to refresh the data. In addition, if you work with Google Forms, you can pipe your data into Google Sheets, completely automating the workflow from data collection to data import.
Using this package can help you manage complex datasets that update frequently. For example, in her role at the Primary Care Research Institute at the University of Buffalo, Megan Harris used it for a research project about people affected by opioid use disorder. The data came from a variety of surveys, all of which fed into a jumble of Google Sheets. Using `googlesheets4`, she was able to bring all of her data into one place and use R to put them to use. Data that had once been largely unused because accessing it was so complicated came to inform research on opioid use disorder.
This section demonstrates how the `googlesheets4` package works using a fake dataset about video game preferences that Harris created to replace her opioid survey data (which, for obvious reasons, is confidential).
### Connecting to Google {-}
Begin by installing the `googlesheets4` package by running `install.packages("googlesheets4")`. Next, you’ll need to connect to your Google account. To do this, run the `gs4_auth()` function in the console. If you have more than one Google account, select the account that has access to the Google Sheet you want to work with. Once you do so, you should see a screen that looks like Figure \@ref(fig:tidyverse-access-r).
```{r results='asis'}
print_nostarch_file_name(file_type_to_print = "png")
```
```{r tidyverse-access-r, out.width="100%", fig.cap="The screen asking for authorization to access your Google Sheets data"}
knitr::include_graphics(here::here("assets/tidyverse-access-r.png"))
```
```{r results='asis'}
save_image_for_nostarch(here::here("assets/tidyverse-access-r.png"))
```
Check the box next to **See, edit, create, and delete all your Google Sheets spreadsheets**. This will ensure that R is able to access data from your Google Sheets account. Hit **Continue**, and you’ll be given the message *Authentication complete. Please close this page and return to R*. The `googlesheets4` package will now save your credentials so that you can use them in the future without having to reauthenticate.
### Reading Data from a Sheet {-}
Now that we’ve connected R to our Google account, we can import data. We’ll import the fake data that Meghan Harris created about video game preferences (you can access it at https://data.rwithoutstatistics.com/google-sheet). Figure \@ref(fig:video-game-survey-data) shows what it looks like in Google Sheets.
```{r results='asis'}
print_nostarch_file_name(file_type_to_print = "png")
```
```{r video-game-survey-data, out.width="100%", fig.cap="The video game data in Google Sheets"}
knitr::include_graphics(here::here("assets/video-game-survey-data.png"))
```
```{r results='asis'}
save_image_for_nostarch(here::here("assets/video-game-survey-data.png"))
```
The `googlesheets4` package has a function called `read_sheet()` that allows you to pull in data directly from a Google Sheet. Import the data by passing the spreadsheet’s URL to the function:
```{r echo = TRUE}
library(googlesheets4)
survey_data_raw <- read_sheet("https://docs.google.com/spreadsheets/d/1AR0_RcFBg8wdiY4Cj-k8vRypp_txh27MyZuiRdqScog/edit?usp=sharing")
```
Take a look at the `survey_data_raw` object to confirm that the data was imported. Using the `glimpse()` function from the `dplyr` package can make it easier to read:
```{r eval = FALSE, echo = TRUE}
library(tidyverse)
survey_data_raw %>%
glimpse()
```
The `glimpse()` function, which creates one output row per variable, shows that we’ve indeed imported the data directly from Google Sheets:
```{r}
survey_data_raw %>%
glimpse()
```
Once we have the data in R, we can use the same workflow we always do when creating reports with R Markdown.
### Using the Data in R Markdown {-}
The following code is taken from an R Markdown report that Meghan Harris made to summarize the video games data. You can see the YAML, the `setup` code chunk, a code chunk that loads packages, and the code to import data from Google Sheets:
````{verbatim echo = TRUE, eval = FALSE, lang = "markdown"}
---
title: "Video Game Survey"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = FALSE,
warning = FALSE,
message = FALSE
)
```
```{r}
library(tidyverse)
library(janitor)
library(googlesheets4)
library(gt)
```
```{r}
# Import data from Google Sheets
survey_data_raw <- read_sheet("https://docs.google.com/spreadsheets/d/1AR0_RcFBg8wdiY4Cj-k8vRypp_txh27MyZuiRdqScog/edit?usp=sharing")
```
````
The R Markdown document resembles those discussed in previous chapters except for the way we’re importing the data. Because we’re bringing it in directly from Google Sheets, there’s no risk of, say, accidentally reading in the wrong CSV. Automating this step reduces the risk of error.
The next code chunk cleans the `survey_data_raw` object, saving the result as `survey_data_clean`:
````{verbatim echo = TRUE, eval = FALSE, lang = "markdown"}
```{r}
# Clean data
survey_data_clean <- survey_data_raw %>%
clean_names() %>%
mutate(participant_id = as.character(row_number())) %>%
rename(
age = how_old_are_you,
like_games = do_you_like_to_play_video_games,
game_types = what_kind_of_games_do_you_like,
favorite_game = whats_your_favorite_game
) %>%
relocate(participant_id, .before = age) %>%
mutate(age = factor(age, levels = c("Under 18", "18-24", "25-34", "35-44", "45-54", "55-64", "Over 65")))
```
````
Here, we use the `clean_names()` function from the `janitor` package to make the variable names easier to work with. We then add a `participant_id` variable using the `row_number()` function, which adds a consecutively increasing number to each row. (I also make it a character using the `as.character()` function.) Next, I change several variable names with `rename()` before finally using `mutate()` to make the `age` variable into a data structure known as a *factor*; doing so will make sure the `age` variable shows up in the right order in our charts. I then use `relocate()` to put `participant_id` before the `age` variable. We can now take a look at our `survey_data_clean` data frame using the `glimpse()` function again:
```{r}
survey_data_clean <- survey_data_raw %>%
clean_names() %>%
mutate(participant_id = as.character(row_number())) %>%
rename(
age = how_old_are_you,
like_games = do_you_like_to_play_video_games,
game_types = what_kind_of_games_do_you_like,
favorite_game = whats_your_favorite_game
) %>%
relocate(participant_id, .before = age) %>%
mutate(age = factor(age, levels = c("Under 18", "18-24", "25-34", "35-44", "45-54", "55-64", "Over 65")))
survey_data_clean %>%
glimpse()
```
The rest of the report uses this data to highlight a variety of statistics:
````{verbatim echo = TRUE, eval = FALSE, lang = "markdown"}
# Respondent Demographics
```{r}
# Calculate number of respondents
number_of_respondents <- nrow(survey_data_clean)
```
We received responses from `r number_of_respondents` respondents. Their ages are below.
```{r}
survey_data_clean %>%
select(participant_id, age) %>%
gt() %>%
cols_label(
participant_id = "Participant ID",
age = "Age"
) %>%
tab_style(
style = cell_text(weight = "bold"),
locations = cells_column_labels()
) %>%
cols_align(
align = "left",
columns = everything()
) %>%
cols_width(
participant_id ~ px(200),
age ~ px(700)
)
```
# Video Games
We asked if respondents liked video games. Their responses are below.
```{r}
survey_data_clean %>%
count(like_games) %>%
ggplot(aes(
x = like_games,
y = n,
fill = like_games
)) +
geom_col() +
scale_fill_manual(values = c(
"No" = "#6cabdd",
"Yes" = "#ff7400"
)) +
labs(
title = "How Many People Like Video Games?",
x = NULL,
y = "Number of Participants"
) +
theme_minimal(base_size = 16) +
theme(
legend.position = "none",
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
axis.title.y = element_blank(),
plot.title = element_text(
face = "bold",
hjust = 0.5
)
)
```
````
These sections calculate the number of survey respondents, then put this in the text using inline R code; create a table that shows the respondents broken down by age group; and generate a graph that shows how many respondents like video games. Figure \@ref(fig:video-game-report) shows the resulting report.
```{r results='asis'}
print_nostarch_file_name(file_type_to_print = "png")
```
```{r video-game-report, out.width="100%", fig.cap="The rendered video game report"}
knitr::include_graphics(here::here("assets/video-game-report.png"))
```
```{r results='asis'}
save_image_for_nostarch(here::here("assets/video-game-report.png"))
```
We can re-run the code at any point to bring in updated data. Our survey had five responses today, but if we run it again tomorrow and it has additional responses, they will be included in the import. If you used Google Forms to run your survey and saved the results to a Google Sheet, you could produce this up-to-date report simply by clicking the Knit button in RStudio.
### Importing Only Certain Columns {-}
In the previous sections, we read the data of the entire Google Sheet. It is, however, possible to import only a section of a Sheet. For example, the survey data we imported includes a `timestamp` column. This variable is added automatically whenever someone submits a Google Form that pipes data into a Google Sheet, but we don’t use it in our analysis, so we could get rid of it.
To do this, use the `range` argument in the `read_sheet()` function when importing the data. This argument lets us specify a range of data to bring. It uses the syntax you may have used to select columns in Google Sheets. In this example, I use `range = "Sheet1!B:E"` to import columns B through E (but not A, which contains the timestamp) before adding `glimpse()`.
```{r eval = FALSE, echo = TRUE}
read_sheet(
"https://docs.google.com/spreadsheets/d/1AR0_RcFBg8wdiY4Cj-k8vRypp_txh27MyZuiRdqScog/edit?usp=sharing",
range = "Sheet1!B:E"
) %>%
glimpse()
```
Running this code produces output without the `timestamp` variable:
```{r}
read_sheet(
"https://docs.google.com/spreadsheets/d/1AR0_RcFBg8wdiY4Cj-k8vRypp_txh27MyZuiRdqScog/edit?usp=sharing",
range = "Sheet1!B:E"
) %>%
glimpse()
```
There are a number of other functions in the `googlesheets4` package that you can use. For example, if you ever need to write your output back to a Google Sheet, the `write_sheet()` function is there to help. To explore other functions in the package, check out its documentation website at https://googlesheets4.tidyverse.org/index.html.
## Accessing Census Data with `tidycensus` {-}
If you’ve ever worked with data from the United States Census Bureau, you know what a hassle it can be. Usually, the process involves visiting the Census Bureau website, searching the website for the data you need, downloading it, and then analyzing it in your tool of choice. This pointing and clicking gets very tedious over time.
Kyle Walker, a geographer at Texas Christian University, and Matt Herman (creator of the Westchester COVID-19 website discussed in Chapter \@ref(websites-chapter)) developed a package to automate the process of bringing Census Bureau data into R: the `tidycensus` package. With `tidycensus`, a user can write just a few lines of code to get data about, say, the median income in all counties in the United States.
This section shows you how the `tidycensus` package works using examples from two datasets to which it provides access: the Decennial Census administered every 10 years and the annual American Community Survey. We’ll also show you how to use the data from these two sources to perform additional analysis and make maps by accessing geospatial and demographic data simultaneously.
### Connecting to the Census Bureau with an API Key {-}
Begin by installing `tidycensus` using `install.packages("tidycensus")`. To use `tidycensus`, you must get an application programming interface (API) key from the Census Bureau. *API keys* are like passwords that allow online services to determine whether you are authorized to access data.
You can obtain this key, which is free, by going to https://api.census.gov/data/key_signup.html and entering your details. Once you receive the key by email, you need to put it in a place where tidycensus can find it. The `census_api_key()` function does this for you, so after loading the `tidycensus` package, run the function as follows, replacing 123456789 with your actual API key:
```{r eval = FALSE, echo = TRUE}
library(tidycensus)
census_api_key("123456789", install = TRUE)
```
The `install = TRUE` argument will save your API key in your *.Renviron* file, which is designed for storing confidential information like API keys. The package will look for your API key there in the future so that you don’t have to reenter it every time you use the package.
Now you can use `tidycensus` to access data. The most common of these are the Decennial Census and the American Community Survey. You can find a discussion of other datasets you can access in Chapter 2 of Kyle Walker’s book *Analyzing US Census Data: Methods, Maps, and Models in R*.
### Working with Decennial Census Data {-}
The Census Bureau puts out many datasets, several of which you can access using dedicated `tidycensus` functions. Let’s access data from the 2020 Decennial Census about the Asian population in each state using the `get_decennial()` function with three arguments:
```{r eval = FALSE, echo = TRUE}
get_decennial(
geography = "state",
variables = "P1_006N",
year = 2020
)
```
The geography argument tells `get_decennial()` to access data at the state level. In addition to the 50 states, it will return for the District of Columbia and Puerto Rico. There are many other geographies, including county, census tract, and more. The `variables` argument specifies the variable or variables we want to access. Here, `P2_002N` is the variable name for the total Asian population. We’ll discuss how to identify other variables you may want to use in the next section. Lastly, `year` specifies the year from which we want to access data. We’re using data from the 2020 Census.
Running this code returns the following:
```{r}
get_decennial(
geography = "state",
variables = "P1_006N",
year = 2020
)
```
The resulting data frame has four variables. `GEOID` is the geographic identifier given by the Census Bureau for the state. Each state has a geographic identifier, as do all counties, census tracts, and other geographies. `NAME` is the name of each state, and `variable` is the name of the variable we passed to the `get_decennial()` function. Lastly, `value` is the numeric value for the state and variable in each row. In this case, it represents the total Asian population in each state.
#### Identifying Census Variable Values {-}
To pass a specific variable to `get_decennial()`, you have to first look it up. Let’s say we want to calculate the Asian population as a percentage of all people in each state. To do that, we’d first need to retrieve the variable for the state’s total population.
The `tidycensus` package has a function called `load_variables()` that shows all of the variables from a Decennial Census. Run it with the `year` argument set to 2020 and `dataset` set to `pl`. This should pull data from so-called redistricting summary data files, which public law 94-171 requires the census to produce every 10 years:
```{r eval = FALSE, echo = TRUE}
load_variables(
year = 2020,
dataset = "pl"
)
```
Running this code returns the name, label (a description), and concept (a category) of all variables available to us:
```{r}
load_variables(
year = 2020,
dataset = "pl"
)
```
By looking at this list, you can see that the variable `P1_001N` gives us the total population.
#### Using Multiple Census Variables {-}
Now that we know which variables we need, we can use the `get_decennial()` function again with two variables at once:
```{r eval = FALSE, echo = TRUE}
get_decennial(
geography = "state",
variables = c("P1_001N", "P1_006N"),
year = 2020
) %>%
arrange(NAME)
```
We add `arrange(NAME)` after `get_decennial()` so that the results are sorted by state name, allowing us to see that we have both variables for each state:
```{r}
get_decennial(
geography = "state",
variables = c("P1_001N", "P1_006N"),
year = 2020
) %>%
arrange(NAME)
```
If you’re working with multiple census variables, however, you might have trouble remembering what names like `P1_001N` and `P1_006N` mean. Fortunately, we can adjust the code in the call to `get_decennial()` to give these variables more meaningful names using the following syntax:
```{r eval = FALSE, echo = TRUE}
get_decennial(
geography = "state",
variables = c(
total_population = "P1_001N",
asian_population = "P1_006N"
),
year = 2020
) %>%
arrange(NAME)
```
Within the `variables` argument, we give the name we want our variables to have, followed by the equal sign and the original variable name. We can rename multiple variables by putting them within the `c()` function, as I’ve done here. It should now be much easier to see which variables we’re working with:
```{r}
get_decennial(
geography = "state",
variables = c(
total_population = "P1_001N",
asian_population = "P1_006N"
),
year = 2020
) %>%
arrange(NAME)
```
Instead of `P1_001N` and `P1_006N`, we have `total_population` and `asian_population`. Much better!
### Analyzing Census Data {-}
Now we have the data we need to calculate the Asian population in each state as a percentage of the total. To do this, we add a few things to the code from the previous section:
```{r echo = TRUE, eval = FALSE}
get_decennial(
geography = "state",
variables = c(
total_population = "P1_001N",
asian_population = "P1_006N"
),
year = 2020
) %>%
arrange(NAME) %>%
group_by(NAME) %>%
mutate(pct = value / sum(value)) %>%
ungroup() %>%
filter(variable == "asian_population")
```
We use `group_by(NAME)` to create one group for each state because we want to calculate the Asian population percentage in each state (not for the entire United States). We then use `mutate()` to calculate each percentage, taking the value in each row and dividing it by the `total_population` and `asian_population` rows for each state. We use `ungroup()` to remove the state-level grouping. We use `filter()` to show only the Asian population percentage.
When we run this code, we see both the Asian population and the Asian population as a percentage of the total population in each state:
```{r}
get_decennial(
geography = "state",
variables = c(
total_population = "P1_001N",
asian_population = "P1_006N"
),
year = 2020
) %>%
arrange(NAME) %>%
group_by(NAME) %>%
mutate(pct = value / sum(value)) %>%
ungroup() %>%
filter(variable == "asian_population")
```
This is one way to calculate the Asian population as a percentage of the total population in each state, but it’s not the only way.
#### Using a Summary Variable {-}
Kyle Walker knew that calculating summaries like we’ve just done would be a common use case for `tidycensus`. To calculate, say, the Asian population as a percentage of the whole, you need to have a numerator (the Asian population) and denominator (the total population). So, to simplify things, he gives us the `summary_var` argument that we can use within `get_decennial()` to import the total population as a separate variable. Instead of putting `P1_001N` (total population) in the `variables` argument, we can use it with the `summary_var` argument as follows.
```{r eval = FALSE, echo = TRUE}
get_decennial(
geography = "state",
variables = c(asian_population = "P1_006N"),
summary_var = "P1_001N",
year = 2020
) %>%
arrange(NAME)
```
This returns a nearly identical data frame to what we got earlier, except that the total population is now a separate variable, rather than additional rows for each state:
```{r}
get_decennial(
geography = "state",
variables = c(asian_population = "P1_006N"),
summary_var = "P1_001N",
year = 2020
) %>%
arrange(NAME)
```
With our data in this new format, we can calculate the Asian population as a percentage of the whole by dividing the `value` variable by the `summary_value` variable. Then we drop the `summary_value` variable because we no longer need it after doing our calculation:
```{r eval = FALSE, echo = TRUE}
get_decennial(
geography = "state",
variables = c(asian_population = "P1_006N"),
summary_var = "P1_001N",
year = 2020
) %>%
arrange(NAME) %>%
mutate(pct = value / summary_value) %>%
select(-summary_value)
```
The resulting output is identical to the output of the previous section:
```{r}
get_decennial(
geography = "state",
variables = c(asian_population = "P1_006N"),
summary_var = "P1_001N",
year = 2020
) %>%
arrange(NAME) %>%
mutate(pct = value / summary_value)
```
How you choose to calculate summary statistics is up to you; `tidycensus` makes it easy to do either way.
#### Visualizing American Community Survey Data {-}
Once you’ve accessed data using the `tidycensus` package, you can do whatever you want with it. Let’s practice analyzing and visualizing survey data using the American Community Survey. This survey, which is conducted every year, differs from the decennial Census in two major ways: It is given to a sample of people rather than the entire population, and it includes a wider range of questions.
Despite these differences, we can access data from the American Community Survey in a manner nearly identical to how we access Decennial Census data. Instead of `get_decennial()`, we use the function `get_acs()`, but the arguments we pass to these functions are the same. In the following example, I’ve identified a variable I’m interested in (`B01002_001`, which shows the median age) and use it to get this data for each state:
```{r eval = FALSE, echo = TRUE}
get_acs(
geography = "state",
variables = "B01002_001",
year = 2020
)
```
Here is what the output looks like:
```{r}
get_acs(
geography = "state",
variables = "B01002_001",
year = 2020
)
```
You should notice two differences in the output from `get_acs()` compared to that from `get_decennial()`. First, instead of the value column, `get_acs()` produces a column called `estimate`. It also produces an additional column called `moe`, for the margin of error. We see these changes because the American Community Survey is given to a sample of the population. As a result, we must extrapolate values from that sample to the population as a whole, and with such an estimate comes a margin of error.
In the state-level data, the margins of error are relatively low, but if you get data from smaller geographies, they tend to be higher. In cases where your margins of error are high relative to your estimates, you should interpret results with caution, as there is greater uncertainty about how well the data represents the population as a whole.
#### Making Charts {-}
Here is how we could take the data on median age and pipe it into ggplot to create a bar chart:
```{r eval = FALSE, echo = TRUE}
get_acs(
geography = "state",
variables = "B01002_001",
year = 2020
) %>%
ggplot(aes(
x = estimate,
y = NAME
)) +
geom_col()
```
We import data with the `get_acs()` function and then pipe this directly into ggplot. We put states (which use the variable `NAME`) on the y axis and median age (`estimate`) on the x axis. A simple `geom_col()` creates the bar chart, shown in Figure \@ref(fig:median-age-chart).
```{r results='asis'}
print_nostarch_file_name()
```
```{r median-age-chart, fig.height = 6, fig.cap = "A bar chart showing the median age in each state"}
get_acs(
geography = "state",
variables = "B01002_001",
year = 2020
) %>%
ggplot(aes(
x = estimate,
y = NAME
)) +
geom_col()
```
```{r results='asis'}
save_figure_for_nostarch(figure_height = 6)
```
This chart is nothing special, but the fact that it takes just six lines of code to create most definitely is.
#### Making Population Maps with the `geometry` Argument {-}
Kyle Walker, the creator of `tidycensus`, also created the `tigris` package for working with geospatial data. As a result, these packages are tightly integrated. Within the `get_acs()` function, you can set the `geometry` argument to `TRUE` to receive both demographic data from the Census Bureau and geospatial data from `tigris`:
```{r eval = FALSE, echo = TRUE}
get_acs(
geography = "state",
variables = "B01002_001",
year = 2020,
geometry = TRUE
)
```
If we take a look at the resulting data, we can see that it has the metadata and `geometry` column of the simple features objects that we saw in Chapter \@ref(maps-chapter).
```{r}
get_acs(
geography = "state",
variables = "B01002_001",
year = 2020,
geometry = TRUE
)
```
We can see that the geometry type is MULTIPOLYGON, which you learned about in Chapter \@ref(maps-chapter). We can pipe this data into ggplot to make a map with the following code:
```{r eval = FALSE, echo = TRUE}
get_acs(
geography = "state",
variables = "B01002_001",
year = 2020,
geometry = TRUE
) %>%
ggplot(aes(fill = estimate)) +
geom_sf() +
scale_fill_viridis_c()
```
Here, we import the data with the `get_acs()` function before piping it into the `ggplot()` function. We set the estimate variable to use for the `fill` aesthetic property; that is, the fill color of each state will vary depending on the median age of its residents. Then we use `geom_sf()` to draw the map. The `scale_fill_viridis_c()` function gives us a colorblind-friendly palette.
The resulting map, seen in Figure \@ref(fig:median-age-map-bad), is less than ideal because the Aleutian Islands in Alaska cross the 180-degree line of longitude, also known as the *international date line*. As a result, most of Alaska appears on one side of the map while a small part appears on the other side. What’s more, both Hawaii and Puerto Rico are hard to see.
```{r results='asis'}
print_nostarch_file_name()
```
```{r median-age-map-bad, fig.cap = "A hard-to-read map showing median age by state"}
get_acs(
geography = "state",
variables = "B01002_001",
year = 2020,
geometry = TRUE
) %>%
ggplot(aes(fill = estimate)) +
geom_sf() +
scale_fill_viridis_c()
```
```{r results='asis'}
save_figure_for_nostarch(figure_height = 5)
```
To fix these problems, load the tigris package, then use the `shift_geometry()` function to move Alaska, Hawaii, and Puerto Rico into places where they’ll be more easily visible:
```{r eval = FALSE, echo = TRUE}
library(tigris)
get_acs(
geography = "state",
variables = "B01002_001",
year = 2020,
geometry = TRUE
) %>%
shift_geometry(preserve_area = FALSE) %>%
ggplot(aes(fill = estimate)) +
geom_sf() +
scale_fill_viridis_c()
```
We set the `preserve_area` argument to `FALSE` to shrink the giant state of Alaska and make Hawaii and Puerto Rico larger. Although the state sizes in the map won’t be precise, the map will be easier to read, as you can see in Figure \@ref(fig:median-age-map-good).
```{r results='asis'}
print_nostarch_file_name()
```
```{r median-age-map-good, fig.cap = "An easier-to-read map showing median age by state"}
get_acs(
geography = "state",
variables = "B01002_001",
year = 2020,
geometry = TRUE
) %>%
shift_geometry(preserve_area = FALSE) %>%
ggplot(aes(fill = estimate)) +
geom_sf() +
scale_fill_viridis_c()
```
```{r results='asis'}
save_figure_for_nostarch(figure_height = 5)
```
We’ve made a map that shows the median age by state. As an exercise, try making the same map for all 3,000 counties by changing the `geography` argument to `"county"`. Other geographies include region, tract (for census tracts), place (for census-designated places, more commonly known as towns and cities), congressional district, and more. Chapter 2 of Kyle Walker’s book *Analyzing US Census Data: Methods, Maps, and Models in R* discusses the various geographies available. There are also many more arguments in both the `get_decennial()` and `get_acs()` functions. We’ve shown only a few of the most common. If you want to learn more, Walker’s book is a great resource.
## Packages like `googlesheets4` and `tidycensus` Make it Easy to Access Data {-}
This chapter explored two packages that use APIs to access data directly from its source. The `googlesheets4` package lets you import data from a Google Sheet. It’s particularly useful when you’re working with survey data, as it makes it easy to update your reports when new results come in. If you don’t work with Sheets, you could use similar packages to fetch data from Excel365 (`Microsoft365R`), Qualtrics (`qualtRics`), Survey Monkey (`surveymonkey`), and other sources.
If you work with US census data, the `tidycensus` package is a huge timesaver. Rather than having to manually download data from the Census Bureau website, you can use it to write R code that accesses it automatically, making it ready for analysis and reporting. Because of its integration with `tigris`, you can also easily map this demographic data.
If you’re looking for census data from other countries, Walker’s *Analyzing US Census Data* book gives examples of packages that can help. There are R packages to bring census data from Canada (`cancensus`), Kenya (`rKenyaCensus`), Mexico (`mxmaps` and `inegiR`), Europe (`eurostat`), and other countries. Before hitting the download button in your data collection tool of choice, it’s worth looking for a package that can import that data directly into R.