-
Notifications
You must be signed in to change notification settings - Fork 24
/
Copy pathR Crash Course.Rmd
1094 lines (792 loc) · 60 KB
/
R Crash Course.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "A (Very Brief) Introduction to R and .Rmd"
author: |
| Philip Waggoner
| University of Chicago
output: pdf_document
urlcolor: blue
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval = FALSE)
```
## Introductory Remarks
In this document, you will get a very high-level, quick introduction to the R language and environment. Though we will mostly focus on "Base R", we will touch on other approaches to data management and analysis in R, notably the Tidyverse. More on that later. So today, our crash course will cover:
1. R and R Studio
2. Functional Programming
3. Tidy Data Wrangling
4. Key Functions for Modeling
5. Simulations (via `do()` and `for`)
6. R Markdown
7. Feature Engineering for Missing Data
## Basics of R and RStudio
Let's Look at the basics of getting into R via RStudio, mostly for the benefit of those who have never used R or RStudio (this is, again, per the course goals, to ensure everyone is on the same page moving forward). This will be a 2 minute demo. Reach out if questions persist.
## Functional Programming
For the basis of our discussion, we will walk through writing user-defined functions (hereafter, *functions*). We will go through the following relatively quickly. But for those less familiar, I have included a lot more details throughout the document, much of which is the same in my book, which was recommended reading for the day (see the course files on Canvas for the free PDF of the book).
*Conditional logic* is the foundation of functions. The expressions `if` and `if else` are key building blocks. The syntax starts with `if`, and then a value to be tested is supplied in parentheses, followed by braces, which include the statement to be expressed.
A simple case: evaluating whether a supplied value is positive or not.
```{r}
x <- -1
if(x > 0){
print("Positive number")
}
```
A simple case with `if else`.
```{r}
x <- -5
if(x > 0){
print("Positive")
} else {
print("Negative")
}
```
Redefine x yourself and give it a try...
```{r}
x <- 5 # e.g...
if(x > 0){
print("Positive")
} else {
print("Negative")
}
```
Though simple, `if` and `if else` are core to writing functions and programming in R. But note, we haven't written a function yet, in case that was unclear. So let's do that now.
Functions operate on the same principle of preferring `sum(2,2,2,2)` in base R to the more laborious `(2 + 2 + 2 + 2)`. E.g., rather than typing: `(3^2) # run`, `(4^2) # run`, `(5^2) # run`, and so on, a function would streamline this process significantly, thereby *engaging* in algorithmic thinking. The syntax starts with defining a new object, and then specifying the function with an argument supplied in parentheses. Then, within braces, there is a statement to be evaluated, and the result is returned as output.
Here is a simple function for squaring some value. The problem is we need squared values, and the solution requires use to calculate their square, and then return the squared value. We store this in a function to allow for replication, updating, and so on; i.e., we are explicitly adapting *algorithmic thinking *in a programming framework.
```{r}
sq <- function(x) {
sqn <- x^2
return(sqn)
}
```
With the function defined by the user (hence the name), we can call the function to see if it worked properly.
```{r}
sq(2)
```
We can update to allow for x *and* y values this time. Instead of just squaring our supplied value, we are now allowing for raising any value to any power. As such, we change the name of the original function from `sq` to `exp` for "exponent."
```{r}
exp <- function(x, y) {
expn <- x^y
return(expn)
}
```
Try a simple case: raise 2 to the power of 4.
```{r}
exp(2, 4)
```
We can make it more descriptive using both `print()` and `paste()`, the latter of which allows us to paste words together to be returned along with our output, which is especially useful when writing R packages.
```{r}
exp <- function(x, y) {
expn <- x^y
print(paste(x,"raised to the power of", y, "is", expn))
return(expn) # optional
}
```
We can also assign default values, which are values you don't have to specify, but *could* change if you want. Default values also make the default value *optional*. Note, we are continuing to redefine our `exp` function from earlier. If you wanted to leave the original intact, you would simply need to change the function object's name to the left of the assignment operator, `<-`.
```{r}
exp <- function(x, y = 2) {
expn <- x^y
print(paste(x,"raised to the power of", y, "is", expn))
}
```
From here, we can make a few calls to the function to see everything come together.
Try with one value...
```{r}
exp(3)
```
Now with two values...
```{r}
exp(3,1)
```
Now, let's do something a little more useful: write a function to calculate temperature in Celsius, given a supplied Fahrenheit value. Note the letter inside the defined function doesn't matter, e.g., `x` or `f` in this case. Just make sure to use the letter/name consistently in the statement to be expressed within the function's definition
```{r}
celsius <- function(f) {
c <- ((f - 32) * 5) / 9
return(c)
}
```
With the function defined, we can either supply individual Fahrenheit values, or a vector of Fahrenheit values; the function can handle both. Let's store a vector of Fahrenheit values in the object `fahrenheit` and test out the function (**note**: if we supply a vector of values, we should get a vector of values returned as output).
```{r}
fahrenheit <- c(60, 65, 70, 75, 80, 85, 90, 95, 100)
celsius(fahrenheit)
```
Try it with a single value to see for yourself.
```{r}
celsius(4)
```
From here, we can include all kinds of innovations to complicate a function, e.g., laying statements to do different things, defensive coding (catching bugs and helping locate those bugs or violations of the statements), and so on. Here is a quick overview of some of these extensions.
Combine several logical statements into a single function.
```{r}
# define
pnz <- function(x) {
if (x > 0) {
n <- "Positive"
}
else if (x < 0) {
n <- "Negative"
}
else {
n <- "Zero"
}
return(n)
}
# call
pnz(4)
pnz(-3)
pnz(0)
```
Next, we can also debug, which is a key piece of writing *defensive* code. For example, we can tell a function to `stop` if something in the code is wrong or missing, or we could print `warning` messages if things aren't where they should be, but we don't want to stop the function entirely and throw an error message.
So, let's put all of these pieces together that we have learned so far and replicate a function to calculate the Herfindahl-Hirschman Index, which is a measure of concentration (often used as a proxy for competition). This is from an R package I wrote, [`hhi`](https://CRAN.R-project.org/package=hhi), and serves as a useful case study applying all of these ideas of algorithmic thinking in the applied context of writing effective user defined functions for a computational social science task (Waggoner 2018).
```{r}
#' Calculate Herfindahl-Hirschman Index Scores
#'
#' @usage hhi(x, "s")
#' @param x Name of the data frame
#' @param s Name of the vector (variable) from the data frame, x, corresponding with market share values
#' @return hhi A measure of market concentration
#' @note The vector of "share" values == total share of individual firms (e.g., df$s <- c(35, 40, 5, 4)
#' @note 0 is perfect competitiveness and 10,000 is a perfect monopoly
hhi <- function(x, s){
if(!is.data.frame(x)) {
stop('"x" must be data frame\n',
'You have provided an object of class: ', class(x)[1])
}
shares <- try(sum(x[ ,s]))
if(shares < 100 | shares > 100) warning('shares, "s", do not sum to 100')
d <- x[ ,s]
if(!is.numeric(d)) {
stop('"s" must be numeric vector\n',
'You have provided an object of class: ', class(d)[1])
}
if (any(d < 0)) {
stop('vector "s" must contain only positive values')
}
hhi <- sum(d^2)
return(hhi)
}
```
With the function defined, as well as parameters defined, we can create some fake "firm" data as well as the share of the market each retains, and then calculate the competitiveness (or concentration) of this hypothetical market.
```{r}
a <- c(1,2,3,4) # firm id
b <- c(20,30,40,10) # market share of each firm (totaling 100%)
x <- data.frame(a,b) # create data frame
hhi(x, "b")
```
On your own: I encourage you to alter the stored values and test out some of the functions warning and error messages. Then, feel free to go back and change or even add new error or warning messages (e.g., allow for a tibble *and* a data frame, perhaps).
## Data Wrangling
Now, we will going to cover some core concepts on working with data in R.
Data "munging" or "wrangling" is the process of getting your data into the form you need for analysis (i.e., data management). Even though most of the datasets we use in this course will be cleaned given the large amount of ground we will cover in a relatively short period of time, these wrangling and functional programming techniques should still help in your own work.
Let's now work with some social science data, the [2016 American National Election Study](https://electionstudies.org/data-center/anes-2016-pilot-study/) (ANES or NES) pilot data. It has many features, including a battery of feeling thermometers (i.e., preferences on a scale from 1-100), which we will cover more later. Start by using the `here` package to load the data.
```{R eval=FALSE, echo=TRUE, message=FALSE, warning=FALSE}
library(here)
library(tidyverse)
NESdta <- read_csv(here("data", "anes_2016.csv"))
```
Most of the functions we will cover are from the `dplyr` package, which is one of the main components of the tidyverse, in line with the tidy philosophy of *consistent syntax*. We will cover the following functions:
* `select()`: choose specific variables
* `filter()`: filter data by values
* `group_by()`: group data by categorical values
* `summarise()`: create summaries of variables and data
* `mutate()`: create new variables
* `replace()`: changes particular values within a variable
As with almost any dataset, the NES data has many more features than we could ever really plan on using in a single analysis. So, we might want to limit our set to include those of greatest interest. Using `select()`, let's create a new object called `NESdta_short`, which includes a subset of key variables/features.
```{R eval=FALSE, echo=TRUE, message=FALSE, warning=FALSE}
# Select particular variables
NESdta_short <- NESdta %>%
select(fttrump, pid3, birthyr, gender, ftobama, state, pid7)
head(NESdta_short,
n = 5)
```
This creates a new tibble (a tidyverse data frame) that only includes seven variables: `fttrump` - a "feeling thermometer" where people rate their feelings of then primary candidate Donald Trump from 0 to 100; `pid3` - a three point rating of political identity, where 1 means Democrat, 2 means Independent, and 3 means Republican; `birthyr` - the respondent's year of birth, which we will use to establish their age; the respondent's `gender`, which is 1 if male and 2 if female; the feeling thermometer for then-President Barack Obama (`ftobama`), again from 0 to 100; `state` - Census code for US states; and `pid7` - party ID on a seven point scale (e.g., "lean Republican").
`select()` can be combined with other functions like `starts_with()`, `ends_with()`, and `contains()` to select more than one variable at a time. We can also use the `:` to select more than one variable that are consecutive, e.g., for all of the feeling thermometers, and we know that they all start with the prefix `ft`, we could simply put the following.
```{r}
# Select using starts_with()
NESdta %>%
select(starts_with("ft"))
# Or...
# Selecting using a range (if we new the natural order)
NESdta %>%
select(ftobama:ftsci)
# Or multiple features at a time
NESdta %>%
select(pid3, birthyr, gender, starts_with("ft"))
# Or remove a feature
NESdta_short <- NESdta_short %>%
select(-pid7)
NESdta_short
```
Next, `filter()` works similar to `select()`, but instead of selecting columns by their _names_, `filter()` allows you to select rows by their _values_. If, for example, we wanted to find only the respondents who gave Donald Trump the highest possible rating, we could do this and a bunch of other layered expressions using `filter()`.
```{r}
# Select only those respondents who give Trump a 100
NESdta_short %>%
filter(fttrump == 100)
# Or select only those respondents who give Trump a 100 and Obama a 1
NESdta_short %>%
filter(fttrump == 100 & ftobama == 1)
# Or select only those respondents who give Trump a 100 or give Obama a 100
NESdta_short %>%
filter(fttrump == 100 | ftobama == 100)
# Or combine and select only those respondents who both Trump AND Obama 100s or 1s.
NESdta_short %>%
filter((fttrump == 100 & ftobama == 100) | (fttrump == 1 & ftobama == 1))
# Or select only those respondents who give Trump an approval rating greater than 50
NESdta_short %>%
filter(fttrump > 50)
# Or select those with higher than average approval of Trump
NESdta_short %>%
filter(fttrump > mean(fttrump, na.rm = TRUE))
```
You may have noticed the symbol in a few of the previous commands or in other work/classes, `%>%`. For those who might be unfamiliar, this is called a **pipe**, which is read as "then," and it was originally developed by Stefan Milton Bache for the R package `magrittr`. The pipe allows you to declare at the beginning the data on which you want to work and stack/pass a number of operations onto that data without having to declare the data you want to use each time you issue a command. This has many other benefits, but chiefly it allows your code to be more consistent and *modular*. This will come in handy for tons of stuff like merging code, fitting a model, visualizing results and so on, all in a single chunk of code.
For example, to show how the pipe allows you to stack commands, let's look at the next two other key munging functions - `group_by()` and `summarise()`.
Suppose we think Republicans will be most positively disposed to Donald Trump as a candidate, followed by Independents, and then by Democrats. We can use the `group_by()` function to tell R what groups we want to make, and follow this with the `summarise()` command to create the needed summaries.
```{r}
NESdta_short %>%
group_by(pid3) %>%
summarise(average_fttrump = mean(fttrump, na.rm = TRUE),
median_fttrump = median(fttrump, na.rm = TRUE),
variance_fttrump = var(fttrump, na.rm = TRUE),
stddev_fttrump = sd(fttrump, na.rm = TRUE),
max_fttrump = max(fttrump, na.rm = TRUE),
min_fttrump = min(fttrump, na.rm = TRUE),
n_fttrump = n()) %>%
ungroup()
```
Note that since we are done with these groups, we include `ungroup()` at the end to ensure that the grouping does not persist past these commands.
The `group_by()` command is also useful in a number of other conditions. Suppose we are interested in finding the respondents of each gender who give him a higher than average score than other people of the same gender. In this case, we can group respondents by gender and then `filter()` respondents who rate higher than the average in *each* group.
```{r}
# Filter data to only include those who give Trump higher than average approval by gender.
NESdta_short %>%
group_by(gender) %>%
filter(fttrump > mean(fttrump, na.rm = T))
# Or, summarizing mean approval of Trump by party and gender
NESdta_short %>%
group_by(pid3, gender) %>%
summarise(average_fttrump = mean(fttrump, na.rm = TRUE),
n_fttrump = n()) %>%
ungroup()
```
Another task that you will often find yourself doing is creating new features. The tidy approach for this is `mutate()`. Start with a simple transformation. `birthyr` does not directly represent the concept we really want, *age*. To get `age`, we need to create a new variable that calculates the age of a respondent by subtracting their birth year from the year of the survey.
```{r}
# Create a new variable giving the respondent's age
NESdta_short <- NESdta_short %>%
mutate(age = 2016 - birthyr)
# Or create a new variable giving the square of respondent's age
NESdta_short <- NESdta_short %>%
mutate(age2 = age^2)
# Now, get rid of it
NESdta_short <- NESdta_short %>%
mutate(age2 = NULL)
```
We just summarized support for then-candidate Donald Trump by party affiliation. But what if we want these summaries to be a part of the `NESdta_short` dataset? `mutate()` helps out with this too, along with redefining our data object, `NESdta_short`.
```{r}
# Creating a new variable using group_by() and mutate()
NESdta_short <- NESdta_short %>%
group_by(pid3) %>%
mutate(average_fttrump = mean(fttrump, na.rm = TRUE)) %>%
ungroup()
# Or calculate deviation from the mean
NESdta_short <- NESdta_short %>%
mutate(deviation_fttrump = fttrump - average_fttrump)
```
When you inspect the output, note that while the feeling thermometers for Donald Trump and Barack Obama are only supposed to go from 0 to 100, the summary statistics said the maximum values were `998`. Many datasets try not to leave blank spaces or mix strings and numeric values. So, instead, they represent missing values by highly improbable numeric values, e.g., `998` or `-999`. We need to change these to `NA` instead, which we will handle last in today's class when we get to missing data and feature engineering.
To do so, we will combine `mutate()` with `replace()`. `replace()` takes three values as input. The first is the variable on which we are making the replacement, the second is a logical test. This can be read as, "Where the variable is..." For example, the second part of the first replacement asks it to make the replacement where the variable `fttrump` is greater than 100. As you can see, within the `mutate()` function, we have asked for our original variable to be equal to the specified replacement (i.e., we have _redefined_ the original variable to drop these nonsensical values).
```{r}
# Using replace() to recode values
NESdta_short <- NESdta_short %>%
mutate(fttrump = replace(fttrump, fttrump > 100, NA),
ftobama = replace(ftobama, ftobama == 998, NA))
glimpse(NESdta_short)
skimr::skim(NESdta_short) # tidy summary
```
Another variable we will likely want to change is the `state` variable. Right now, it is using Census codes that represent the states, but we will probably want strings with the state names as well. We can look up the numbers associated with each state in the ANES (which I have done for you... *you're welcome*) and create a new variable called `state_name` that contains the character name of the state. To do so, we will use `case_when()`, which allows us to change a large number of values within a variable.
```{r}
# Create a new variable called state_name with the string names of states
NESdta_short <- NESdta_short %>%
mutate(state_name = case_when(state == 1~"Alabama",
state == 2~"Alaska",
state == 4~"Arizona",
state == 5~"Arkansas",
state == 6~"California",
state == 8~"Colorado",
state == 9~"Connecticut",
state == 10~"Delaware",
state == 11~"District of Columbia",
state == 12~"Florida",
state == 13~"Georgia",
state == 15~"Hawaii",
state == 16~"Idaho",
state == 17~"Illinois",
state == 18~"Indiana",
state == 19~"Iowa",
state == 20~"Kansas",
state == 21~"Kentucky",
state == 22~"Louisiana",
state == 23~"Maine",
state == 24~"Maryland",
state == 25~"Massachusetts",
state == 26~"Michigan",
state == 27~"Minnesota",
state == 28~"Mississippi",
state == 29~"Missouri",
state == 30~"Montana",
state == 31~"Nebraska",
state == 32~"Nevada",
state == 33~"New Hampshire",
state == 34~"New Jersey",
state == 35~"New Mexico",
state == 36~"New York",
state == 37~"North Carolina",
state == 38~"North Dakota",
state == 39~"Ohio",
state == 40~"Oklahoma",
state == 41~"Oregon",
state == 42~"Pennsylvania",
state == 44~"Rhode Island",
state == 45~"South Carolina",
state == 46~"South Dakota",
state == 47~"Tennessee",
state == 48~"Texas",
state == 49~"Utah",
state == 50~"Vermont",
state == 51~"Virginia",
state == 53~"Washington",
state == 54~"West Virginia",
state == 55~"Wisconsin",
state == 56~"Wyoming"))
```
A final note: you might have noticed the double equal sign, `==`, which is used for comparison. It returns a value of `TRUE` if the item on the left-hand side is equal to the item on the right-hand side, and `FALSE` otherwise. A type of logical command you will find yourself using a lot is `ifelse(condition, outcome if true, outcome if false)`. For example, re: the `gender` variable, suppose we want to recode gender (currently 2 = female and 1 = male) to be more descriptive and also on the more common `{0,1}` scale. Using `mutate()` and `ifelse()` (from base R), we create a new `female` feature.
```{r}
NESdta_short <- NESdta_short %>%
mutate(female = ifelse(gender == 2, 1, 0))
sample_n(NESdta_short,
size = 5)
```
## Key Functions for Modeling
Here are some key functions and concepts that especially come in handy for modeling tasks. Start with visualization. But to bring this into our framework and to start simply to ensure we are on the same page, `ggplot2` is going to be your best friend, allowing for excellent, flexible visualizations. For example, we can take a look at a histogram of feelings toward Obama (`ftobama`).
```{R eval=FALSE, echo=TRUE, warning=FALSE}
NESdta_short %>%
ggplot() +
geom_histogram(aes(x = ftobama),
binwidth = 2,
color = "white",
fill = "steelblue") +
theme_minimal() +
labs(x = "Feelings toward Obama (0-100)",
y = "Count of Respondents",
title = "Feelings toward Obama in 2016")
```
Next is the ability to map or iteratively pass functions to values stored in vectors. We will use the `purrr` package for this if we are interested in bypassing the use of loops.
```{r}
# first create a narrower subset
NESdta_small <- NESdta_short %>%
select(female, fttrump, birthyr)
# inspect a small random sample
sample_n(NESdta_small, 5)
```
With our smaller subset created, let's now `split()` the data into groups, to more easily demonstrate `map()`.
```{r}
fems <- split(NESdta_small,
NESdta_small$female)
fems
# now map the base R function `head()` to the split object to inspect the first 2 observations in each group
map(fems, head, n = 2)
# Or, explore how many rows in each group
map(fems, nrow)
```
Importantly, the basic `map()` function will always return a list, which is a mixed data type. If you are working with a specific type of data or want a specific type to condition the map function, there are a series of more specific map functions, `map_*()` that can be used. For example, the raw count of rows is an integer, so we could use `map_int()` to get the same result, but ensuring it's a real valued integer.
```{r}
map_int(fems, nrow)
# double check to make sure the returned value is an integer, not a list
is.integer(map_int(fems, nrow))
is.integer(map(fems, nrow))
```
Now for are a few extensions of map, which will come in handy in the coming days.
Rather than using `split()` from base R, the tidyverse version of this is to "nest" via `nest()` or `unnest()`. We might be interested in using `nest()` by female to return a data frame into a so-called "list-column" (thanks to Jenny Bryan; see the package documentation for `purrr` for more), which can host multiple data types in a single vector. Basically, this means we can nest an entire data frame, e.g., within a row or column.
First, create a new object, `new_nes`, with tibbles for each level of female. In substantive terms, the syntax means that we want two rows for each level of `female`, and a data frame/tibble for the other two features in the `NESdta_small` data set, nested for each level of `female`.
```{r}
(new_nes <- nest(NESdta_small, -female))
```
We can complicate this by adding a new feature (or many features if you want): number of rows (`n_row`) for each level. To reverse, we can simply `unnest()`.
```{r}
new_nes %>%
mutate(n_row = map_int(data, nrow))
# Now, reverse via unnest()
unnest(new_nes, data)
```
Finally, let's take a look at some key functions for splitting data into training and testing sets. There are many ways to do this. We will cover a few: `tidymodels` (via `rsample`), `caret`, and base R.
First, `tidymodels`.
```{r}
library(tidymodels)
library(tidyverse)
set.seed(1234)
split_tidy <- initial_split(NESdta_short,
prop = 0.8) # default is 75%
# take a look...
split_tidy
# create
train_tidy <- training(split_tidy)
test_tidy <- testing(split_tidy)
# double check
nrow(train_tidy)/nrow(NESdta_short)
# viz
tidy <- train_tidy %>%
ggplot(aes(x = fttrump)) +
geom_line(stat = "density") +
geom_line(data = test_tidy,
stat = "density",
col = "red") +
labs(title = "Via Tidymodels") +
theme_minimal()
```
Now, let's try with `caret`.
```{r}
library(caret)
set.seed(1234)
# need to drop NAs for caret (also, can't read tibbles when indexing)
NESdta_short <- NESdta_short %>%
na.omit() %>%
as.data.frame()
split_caret <- createDataPartition(NESdta_short$fttrump,
p = 0.8,
list = FALSE)
# then use basic index notation for creating the splits
train_caret <- NESdta_short[split_caret, ]
test_caret <- NESdta_short[-split_caret, ]
# viz again
caret <- train_caret %>%
ggplot(aes(x = fttrump)) +
geom_line(stat = "density") +
geom_line(data = test_caret,
stat = "density",
col = "red") +
labs(title = "Via Caret") +
theme_minimal()
```
Finally, let's see base R.
```{r}
set.seed(1234)
split_baseR <- sample(nrow(NESdta_short),
size = 0.8*nrow(NESdta_short))
train_baseR <- NESdta_short[split_baseR, ]
test_baseR <- NESdta_short[-split_baseR, ]
# viz again
base_r <- train_baseR %>%
ggplot(aes(x = fttrump)) +
geom_line(stat = "density") +
geom_line(data = test_baseR,
stat = "density",
col = "red") +
labs(title = "Via Base R") +
theme_minimal()
```
Now, we can view side by side. Should look pretty similar, though not exact given that we are randomly defining these sets.
```{r}
library(patchwork)
(tidy + caret + base_r)
```
## Simulations
A key aspect of working with data is the use of simulations, which are more explicit versions of mapping as we previously covered.
Simulations are incredibly powerful tools allowing for creation of virtually any context you wish, e.g.,
- testing theories under alternate constraints
- distributional assumptions
- real vs. predictions/forecasts, etc.
Let's start with the `do()` approach and a hypothetical baseball question and then move to the more widely used and flexible `for` loop.
Many factors have been said to affect a baseball player's performance at the plate (i.e., the number of hits a player gets). Throughout a single season consisting of 6 months, most players will range from doing well to doing poorly at the plate, with few players ever performing at a high level for the *duration* of such a long season. The reality of the extreme difficulty of hitting baseballs that travel around 95 miles per hour across only 60 feet and 6 inches (the distance from the pitcher's mount to home plate) makes a lot of people interested in forecasting future hits, or how players will perform over the course of the season.
> Thus, the problem becomes: *can we predict monthly hits for the average player in the context of an entire season?*
Despite the many models out there that assert they have figured out how to accurately forecast this quantity of interest, though, it turns out that player hits probably look more like random draws from a Gaussian distribution, with a mean number of hits per season for the average player around 165, the monthly average around 28 hits, and a wide standard deviation of about 12 given the extreme difficulty of hitting baseballs as well as the many other factors that go into player performance. In short, average player hits per month are *incredibly hard to forecast*, as they are virtually random and also vary widely.
The point that you basically can't predict player hits with any degree of reliability, we will take advantage of a simulation to produce several hypothetical seasons worth of hits for the average player, given these known truths (again, *this is hypothetical in the specifics, but probably mirrors the reality of forecasting baseball hits*). Specifically, we will use the information we have -- mean player hits, hits standard deviation, and a Gaussian data generating process -- to make predictions about the future.
The punch line: **the predicted hits vary so widely, it's virtually impossible to predict player hits with any degree of reliability or certainty**. But let's see this in action.
Quick caveat: `do()` is on its way out as it basically maps functions to other values. So this is not too commonly used. but it's useful to show a tidy-friendly approach to simulation, which will help `for` loops make more sense hopefully.
```{R eval=TRUE, echo=TRUE, warning = FALSE, message = FALSE}
library(tidyverse)
# constants
hits_true <- 28 # ~165/season
hits_se <- 12 # "true" variance
n_sims <- 5 # number of times to simulate
# function for calculating a season's worth (6 months) of monthly hit samples based on true values above
baseball_fun <- function(true_mean, true_sd, months = 6, mu = 0, Iteration = 1) { # takes 5 inputs
season <- rep(true_mean, months) + rnorm(months, # simulate draws from a normal dist
mean = mu * (1:months), # allows means to vary around the true population mean + some randomness
sd = true_sd) # sets variance of each simulated season
return(data.frame(hits_number = season, # df to store our simulated values
month = as.factor(1:months),
Iteration = Iteration))
}
# now create a df for parameters for sims (variance around mean = sd; and sim number)
#
baseball <- data.frame(sd = c(0, rep(hits_se, n_sims)), # allows us to apply the SD to each iteration of the season --> 0 for "true" because we know that value already and don't want to sample it
Iteration = c("True (Average) Season", paste("Simulated Season", 1:n_sims)))
# simulate
set.seed(123)
baseball_simulation <- baseball %>%
group_by(Iteration) %>%
do(baseball_fun(true_mean = hits_true,
true_sd = .$sd, # this and iteration come from placeholder dataframe (hence the pipe)
Iteration = .$Iteration))
# viz
baseball_simulation %>%
ggplot(aes(x = month,
y = hits_number,
color = Iteration, # "season"
fill = Iteration)) +
geom_hline(yintercept = hits_true,
linetype = 2) +
geom_bar(stat = "identity") +
facet_wrap(~ Iteration) +
labs(y = "Count of Hits",
x = "Month") +
theme_minimal()
```
Sure enough, simulating only 5 seasons based on the ground truth data we started with, there seem to be *no clear patterns*, suggesting that player hits are pretty random and tough to predict.
Now let's give simulations a try with `for` loops.
Suppose we drew a random sample of 50 respondents' self-reported political ideologies on a 7 point scale, where 1 was extremely liberal and 7 was a extremely conservative. The mean of that sample was `3.32`, suggesting the average person in this sample sees herself as generally moderate, or in the middle of the distribution of political ideology.
Here is the code setting this up:
```{r}
sample_ideology <- c(3, 1, 2, 4, 4, 6, 1, 3, 2, 6,
1, 7, 3, 1, 4, 3, 4, 4, 1, 6,
7, 5, 7, 1, 1, 3, 2, 4, 1, 7,
1, 2, 1, 4, 6, 3, 2, 3, 1, 4,
1, 6, 3, 4, 5, 4, 1, 7, 2, 2)
mean(sample_ideology)
```
Now, suppose we wanted to simulate the reported sample ideology to see whether this random sample was reflective of the broader American population. However, we aren't exactly sure how many times to do this such that it will accurately reflect the population of interest.
To get traction on this question, the central limit theorem and law of large numbers can help us out. To see how the *shape* of the distribution over many samples (central limit theorem) and how the *location* of these distributions change (law of large numbers), we can use a series of `for` loops, and **plot the different distributions to see when and where (and whether) the samples converge on the underlying population**.
`for` loops allow for iterating some calculation or function over many different observations. This is a simulation.
The syntax of `for` loops is similar to functions, where they begin with "for" and then start with some value in a sequence in parentheses. Then, within the braces, there is a statement to be evaluated.
For each chunk below, we start by creating an empty vector in which to store our simulated values (e.g., `sm1`). We then loop sampling with replacement based on the initially-drawn sample (`sample_ideology`), and then take the mean of that iteration and store the mean in the empty vector, `sm*`. We will then plot each simulation and compare side by side via another tidyverse-friendly package, `patchwork`.
```{r}
library(tidyverse)
# N = 5
sm1 <- rep(NA, 5)
for(i in 1:5){
samp <- sample(sample_ideology, 30, replace = TRUE)
sm1[i] <- mean(samp)
}
# N = 20
sm2 <- rep(NA, 20)
for(i in 1:20){
samp <- sample(sample_ideology, 30, replace = TRUE)
sm2[i] <- mean(samp)
}
# N = 50
sm3 <- rep(NA, 50)
for(i in 1:50){
samp <- sample(sample_ideology, 30, replace = TRUE)
sm3[i] <- mean(samp)
}
# N = 100
sm4 <- rep(NA, 100)
for(i in 1:100){
samp <- sample(sample_ideology, 30, replace = TRUE)
sm4[i] <- mean(samp)
}
# N = 500
sm5 <- rep(NA, 500)
for(i in 1:500){
samp <- sample(sample_ideology, 30, replace = TRUE)
sm5[i] <- mean(samp)
}
# N = 1500
sm6 <- rep(NA, 1500)
for(i in 1:1500){
samp <- sample(sample_ideology, 30, replace = TRUE)
sm6[i] <- mean(samp)
}
# N = 3500
sm7 <- rep(NA, 3500)
for(i in 1:3500){
samp <- sample(sample_ideology, 30, replace = TRUE)
sm7[i] <- mean(samp)
}
# N = 7000
sm8 <- rep(NA, 7000)
for(i in 1:7000){
samp <- sample(sample_ideology, 30, replace = TRUE)
sm8[i] <- mean(samp)
}
# Now plot each simulation
library(patchwork)
p1 <- quickplot(sm1, geom = "histogram", main = "N=5", bins = 30) +
theme_minimal() +
geom_vline(xintercept = 3.32, linetype="dashed", color = "red")
p2 <- quickplot(sm2, geom = "histogram", main = "N=20", bins = 30) +
theme_minimal() +
geom_vline(xintercept = 3.32, linetype="dashed", color = "red")
p3 <- quickplot(sm3, geom = "histogram", main = "N=50", bins = 30) +
theme_minimal() +
geom_vline(xintercept = 3.32, linetype="dashed", color = "red")
p4 <- quickplot(sm4, geom = "histogram", main = "N=100", bins = 30) +
theme_minimal() +
geom_vline(xintercept = 3.32, linetype="dashed", color = "red")
p5 <- quickplot(sm5, geom = "histogram", main = "N=500", bins = 30) +
theme_minimal() +
geom_vline(xintercept = 3.32, linetype="dashed", color = "red")
p6 <- quickplot(sm6, geom = "histogram", main = "N=1500", bins = 30) +
theme_minimal() +
geom_vline(xintercept = 3.32, linetype="dashed", color = "red")
p7 <- quickplot(sm7, geom = "histogram", main = "N=3500", bins = 30) +
theme_minimal() +
geom_vline(xintercept = 3.32, linetype="dashed", color = "red")
p8 <- quickplot(sm8, geom = "histogram", main = "N=7000", bins = 30) +
theme_minimal() +
geom_vline(xintercept = 3.32, linetype="dashed", color = "red")
# piece ggplot objects together with the patchwork package
p1 +
p2 +
p3 +
p4 +
p5 +
p6 +
p7 +
p8
```
## R markdown (`.Rmd`)
Now that we know how to do some basic tasks in R, let's talk about R markdown a bit. Basically, R markdown is a document editor and generator allowing for well-formatted documents with code and output directly inserted. This document was created with R markdown. Perhaps the biggest value of R markdown is replication. The document will not compile if the inserted code has errors or is missing something, even as small as a single comma in an expression. So while often frustrating, R markdown (and any other markdown languages, e.g., yaml, or document/code editors such as Jupyter Notebooks) ensures that whatever is written in the document can be successfully run and re-run and re-run...
And you can (and should) use R markdown from *within* R Studio. To start a new `.Rmd` file, first select the icon for creating a new document in the upper left portion of the screen. Then, in the drop-down menu, select `R Markdown...`. It will populate the editor pane with a template including some basic sample code. And you're off!
Once you have finished your document and would like to format, simply select `Knit` at the top menu bar in R Studio. You can either select the type of output document (PDF, HTML, etc.) or prespecify in the metadata at the outset of the document, e.g., `output: html_document` or `output: pdf_document` or `output: word_document`, and so on.
**You will be required to submit assignments in this class via some replicable format**. The easiest way is an R markdown document with code and evaluations directly inserted. We will get to this more throughout the Quarter.
### Exploring R Markdown
Though there is much more to R markdown syntax, we will cover just a few basics here.
First, formatting text. Though less intuitive than Microsoft Word, at least formatting in R markdown is more straightforward than formatting text in LaTeX. To italicize, either use `*` or `_` before and after the *section to be emphasized*. For bold, use `**` before and **after**. For footnotes, use `^` with the following text in brackets, like^[this]. And to strike through, use `~~` before and after the text to ~~strike through~~.
Second, ordered and unordered lists, and block quotes. Lists are simple. For unordered, use either `*` or `-` and then a space, and then the text (if no space, it will think you mean to italicize). For ordered lists, simply write the number followed by a period and a space and then start writing. For nested lists, indent once (or twice) on the next line and use `+` followed by a space.
An unordered list:
* item 1
+ sub
+ sub
* item 2
* item 3
+ sub
And an ordered list:
1. item 1
+ sub
+ sub sub
+ sub
2. item 2
3. item 3
+ sub
You can also include block quotes within text using `>`. For example, here is a line of text that is in my document. But now we should break to emphasize a great quote:
> Here is the great quote. Notice also that when you start a block quote in R markdown, the text is green
Third, code chunks. R markdown allows you to directly insert (and thus evaluate) code via "code chunks", both in R as well as a variety of other programming languages into the document. To start a chunk, type ``` and then the name of the language in braces, e.g., R. First, of course, we have R,
```{R echo = FALSE, eval = FALSE}
print("Hello, World!")
```
You can also insert bash (shell/UNIX) code,
```{bash echo = FALSE, eval = FALSE}
osx:~ $ echo "Hello, world!"
```
Or Python,
```{python echo = FALSE, eval = FALSE}
print('Hello, world!') # same as R
```
Or C++,
```{Rcpp echo = FALSE, eval = FALSE}
#include <iostream>
using namespace std;
int main()
{
cout << "Hello, World!";
return 0;
}
```
And so on...
Finally, equations, links, and tables. In line equations are denoted by `$` like in LaTeX, as TeX is also supported by R markdown. And also like TeX, use `\` to call mathematical symbols or Greek letters.^[Here is a great resource for mathematical operators, Greek letters, etc. in R markdown: <https://www.calvin.edu/~rpruim/courses/s341/S17/from-class/MathinRmd.html>] For example, `$\sum_{i=1}^{n} \lambda (x_i-\bar{x})$` gives you $\sum_{i=1}^{n} \lambda (x_i-\bar{x})$. Use double `$$` to separate equations from text and center them, e.g., $$\sum_{i=1}^{n} \lambda (x_i-\bar{x})$$
To link to external sites, put the linked text in brackets, `[]`, followed immediately by the url in parentheses `()`. So, e.g., `[R Studio](https://www.rstudio.com/)` produces [R Studio](https://www.rstudio.com/).
And for tables, simply use `-----` for rows and `|` for columns. For example,
here | is | a | table | with | seven | columns
-----|----|----|-----|----|----|----
and | it | also | has | three | rows | ...
here | is | the | third | row | ...| ...
# Feature Engineering for Missing Data
The concept of *feature engineering*, though fancy sounding, really just means to create or transform features for some reason or future task. The **reason** could be overcoming limitations in data, and the **future task** could be for a modeling requirement. Learning how to approach contexts necessitating feature engineering at some level requires intimately knowing and thinking carefully about the structure of your data. These skills will serve you well in *any* computational modeling context.
More specifically for our purposes, given the many contexts requiring unique feature engineering solutions, we will only scratch the surface here by covering feature engineering in the context of missing data. This is meant to get you started down the right path when you encounter one of the most common and most frustrating issues in working with social science data. Even still, I strongly encourage you to think about thorough data exploration and feature engineering from the start of *any* future project.
With that, and to conclude today's content, we will cover the following couple of topics. Some of this discussion was adapted from the excellent Kuhn and Johnson 2019 [book](http://www.feat.engineering/) on feature engineering and selection, which you were to skim for today's class. I **strongly** encourage you to dive deeper into that book and the supporting code. Both will serve you as excellent resources for a long time.
* Diagnosing & Working with Missing Data
* Essential Feature Engineering for Missing Data
If left undetected or unaddressed, missing data can plague any project. And in the social sciences, this is an extremely common occurrence as we are usually dealing with humans, surveys, and other data structures produced by flawed people or institutions. Compared to other realms of data production like computational statistics/simulated data that often result in complete datasets, the social sciences must anticipate that at least some of the data they collect or use will be missing or flawed in some important way.
Yet, though frustrating, missing data is certainly *not* an insurmountable obstacle.
We will cover the basics of how to diagnose and address missing data, as a common step prior to the "engineering" part of feature engineering. Although missing data and feature engineering are often addressed together as we are doing today, these two don't necessarily have to go hand-in-hand. For example, we may detect missing data, but leave it unaddressed for some project-specific reason. Or alternatively, we may be interested in engineering new features from data, but entirely separate from the question of missingness. Still, researchers most commonly correct for missing data using feature engineering techniques. So we will progress accordingly.
We will first discuss common types of missing data, and thus how best to think about missingness. This step will help with developing steps needed to address the issue. Then, with the structure of missingness in mind, we will proceed to exploring (mostly visually) patterns of missingness. Intimately tied with these two steps, we will conclude with discussion and application of some common solutions to address missing data.
**Structure**
In thinking about the structure of missingness in data, we are really interested in uncovering the *cause* behind the missingness. This revelation will allow for more targeted feature engineering solutions. Some of the more common causes of missingness in data include:
1. Recording errors
2. Unique instances
3. Missing at random; *See more discussion of the nuance of types of missing data at random in the Little and Rubin (2019) paper*
The first cause of missingness has to do with errors in measurements of the original data collection effort. This can include structural error on the part of the data collectors, or it could be due to something as simple and clerical as data that got lost during the translation of field records, e.g., digitization. Regardless of how this happened, a fix for this type of missingness is entirely dependent on the scope. Solutions could range from going back out in the field to correct the errors, or something like encoding categories (more in a bit). Either way, it's important to first *inspect* the data to understand whether this type of missingness is likely or not.
The second cause of missingness is case-based, where unique observations are missing recorded values, but at a smaller scale then the previous cause. The difference in scope will translate to difference in solutions, as we will see throughout this course. This is what Little and Rubin (2019) call NMAR ("not missing at random"; descriptive and creative to be sure...). These are often the most difficult cases to address, because there is some structural, underlying phenomenon that has led to the data being missing in some detectable pattern. Simple imputation methods may or may not suffice. Even if the values could be imputed, there is no guarantee that they are reflecting any substance about the individual observation, because that value was, at some level, *deterministically* missing. This is honest reporting of your process, and also act as a signal that you are ethically and carefully thinking about the underlying data generating process (as opposed to flippantly manipulating data, assuming all is well, when perhaps it is not).
The final major cause of missing data is missing at random. This is often (and generally) considered the easiest type of missingness to correct for, mostly because there is no underlying driver behind the missingness. The probability of missing completely at random version is the same for all observations in the set. This is compared to the partially missing at random version, which suggests that the pattern of missingness is dependent on the sampled and thus observed data, but not on the unobserved data from the underlying population.
In short, always think very carefully about your data, and *never ignore missing cases*. If you do ultimately decide to leave missing data as is in the full set, be prepared to report on your process of reaching this decision.
**Exploring missingness**
Suppose we have carefully thought about what could be driving these patterns at a more abstract level, we would now be interesting in exploring and inspecting our data to see exactly how these patterns take shape.
As a first step, I strongly recommend getting right to the point and inspecting where missing values are concentrated. To do so, I really like the `naniar` package, authored and contributed to by some excellent programmers and statisticians (e.g., Hadley Wickham, Di Cook, Nick Tierney, and so on). The reason I use this package as a starting place for much of my work is captured succinctly in the package description:
> `naniar` is a package to make it easier to summarise and handle missing values in R. It strives to do this in a way that is as consistent with tidyverse principles as possible.
Let's walk through some of the basics of exploration via `naniar`. To do so, we will use the `congress` data, which includes legislator-level features from the 109th and 110th Congresses (i.e., 2005-2009).
```{r}
library(naniar)
library(tidyverse)
library(here)
congress <- read_csv(here("data", "congress_109_110.csv"))
# cumulative missingness over cases/observations
congress %>%
gg_miss_case_cumsum()
# missing data by feature
congress %>%