R Crash Course.Rmd

---
title: "A (Very Brief) Introduction to R and .Rmd"
author: | 
  | Philip Waggoner
  | University of Chicago
  | pdwaggoner@uchicago.edu
output: pdf_document
urlcolor: blue
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval = FALSE)
```

## Introductory Remarks

In this document, you will get a very high-level, quick introduction to the R language and environment. Though we will mostly focus on "Base R", we will touch on other approaches to data management and analysis in R, notably the Tidyverse. More on that later. So today, our crash course will cover:

1. R and R Studio
2. Functional Programming
3. Tidy Data Wrangling
4. Key Functions for Modeling
5. Simulations (via `do()` and `for`)
6. R Markdown
7. Feature Engineering for Missing Data

## Basics of R and RStudio

Let's Look at the basics of getting into R via RStudio, mostly for the benefit of those who have never used R or RStudio (this is, again, per the course goals, to ensure everyone is on the same page moving forward). This will be a 2 minute demo. Reach out if questions persist. 

## Functional Programming

For the basis of our discussion, we will walk through writing user-defined functions (hereafter, *functions*). We will go through the following relatively quickly. But for those less familiar, I have included a lot more details throughout the document, much of which is the same in my book, which was recommended reading for the day (see the course files on Canvas for the free PDF of the book). 

*Conditional logic* is the foundation of functions. The expressions `if` and `if else` are key building blocks. The syntax starts with `if`, and then a value to be tested is supplied in parentheses, followed by braces, which include the statement to be expressed. 

A simple case: evaluating whether a supplied value is positive or not.

```{r}
x <- -1 

if(x > 0){
  print("Positive number")
}
```

A simple case with `if else`.

```{r}
x <- -5

if(x > 0){
  print("Positive")
} else {
  print("Negative")
}
```

Redefine x yourself and give it a try...

```{r}
x <- 5 # e.g...

if(x > 0){
  print("Positive")
} else {
  print("Negative")
}
```

Though simple, `if` and `if else` are core to writing functions and programming in R. But note, we haven't written a function yet, in case that was unclear. So let's do that now. 

Functions operate on the same principle of preferring `sum(2,2,2,2)` in base R to the more laborious `(2 + 2 + 2 + 2)`. E.g., rather than typing: `(3^2) # run`, `(4^2) # run`, `(5^2) # run`, and so on, a function would streamline this process significantly, thereby *engaging* in algorithmic thinking. The syntax starts with defining a new object, and then specifying the function with an argument supplied in parentheses. Then, within braces, there is a statement to be evaluated, and the result is returned as output. 

Here is a simple function for squaring some value. The problem is we need squared values, and the solution requires use to calculate their square, and then return the squared value. We store this in a function to allow for replication, updating, and so on; i.e., we are explicitly adapting *algorithmic thinking *in a programming framework.

```{r}
sq <- function(x) {
  sqn <- x^2
  return(sqn)
}
```

With the function defined by the user (hence the name), we can call the function to see if it worked properly. 

```{r}
sq(2)
```

We can update to allow for x *and* y values this time. Instead of just squaring our supplied value, we are now allowing for raising any value to any power. As such, we change the name of the original function from `sq` to `exp` for "exponent."

```{r}
exp <- function(x, y) {
  expn <- x^y
  return(expn)
}
```

Try a simple case: raise 2 to the power of 4.

```{r}
exp(2, 4)
```

We can make it more descriptive using both `print()` and `paste()`, the latter of which allows us to paste words together to be returned along with our output, which is especially useful when writing R packages. 

```{r}
exp <- function(x, y) {
  expn <- x^y
  print(paste(x,"raised to the power of", y, "is", expn))
  return(expn) # optional 
}
```

We can also assign default values, which are values you don't have to specify, but *could* change if you want. Default values also make the default value *optional*. Note, we are continuing to redefine our `exp` function from earlier. If you wanted to leave the original intact, you would simply need to change the function object's name to the left of the assignment operator, `<-`.

```{r}
exp <- function(x, y = 2) {
  expn <- x^y
  print(paste(x,"raised to the power of", y, "is", expn))
}
```

From here, we can make a few calls to the function to see everything come together. 

Try with one value...

```{r}
exp(3)
```

Now with two values...

```{r}
exp(3,1)
```

Now, let's do something a little more useful: write a function to calculate temperature in Celsius, given a supplied Fahrenheit value. Note the letter inside the defined function doesn't matter, e.g., `x` or `f` in this case. Just make sure to use the letter/name consistently in the statement to be expressed within the function's definition

```{r}
celsius <- function(f) {
  c <- ((f - 32) * 5) / 9
  return(c)
}
```

With the function defined, we can either supply individual Fahrenheit values, or a vector of Fahrenheit values; the function can handle both. Let's store a vector of Fahrenheit values in the object `fahrenheit` and test out the function (**note**: if we supply a vector of values, we should get a vector of values returned as output).

```{r}
fahrenheit <- c(60, 65, 70, 75, 80, 85, 90, 95, 100) 

celsius(fahrenheit)
```

Try it with a single value to see for yourself. 

```{r}
celsius(4)
```

From here, we can include all kinds of innovations to complicate a function, e.g., laying statements to do different things, defensive coding (catching bugs and helping locate those bugs or violations of the statements), and so on. Here is a quick overview of some of these extensions.

Combine several logical statements into a single function.

```{r}

# define
pnz <- function(x) {
  if (x > 0) {
    n <- "Positive"
  }
  else if (x < 0) {
    n <- "Negative"
  }
  else {
    n <- "Zero"
  }
  return(n)
}

# call
pnz(4)
pnz(-3)
pnz(0)

```

Next, we can also debug, which is a key piece of writing *defensive* code. For example, we can tell a function to `stop` if something in the code is wrong or missing, or we could print `warning` messages if things aren't where they should be, but we don't want to stop the function entirely and throw an error message. 

So, let's put all of these pieces together that we have learned so far and replicate a function to calculate the Herfindahl-Hirschman Index, which is a measure of concentration (often used as a proxy for competition). This is from an R package I wrote, [`hhi`](https://CRAN.R-project.org/package=hhi), and serves as a useful case study applying all of these ideas of algorithmic thinking in the applied context of writing effective user defined functions for a computational social science task (Waggoner 2018).

```{r}

#' Calculate Herfindahl-Hirschman Index Scores
#' 
#' @usage hhi(x, "s")
#' @param x Name of the data frame
#' @param s Name of the vector (variable) from the data frame, x, corresponding with market share values
#' @return hhi A measure of market concentration
#' @note The vector of "share" values == total share of individual firms (e.g., df$s <- c(35, 40, 5, 4)
#' @note 0 is perfect competitiveness and 10,000 is a perfect monopoly


hhi <- function(x, s){
  if(!is.data.frame(x)) {
    stop('"x" must be data frame\n',
         'You have provided an object of class: ', class(x)[1])
  }
  shares <- try(sum(x[ ,s]))
  if(shares < 100 | shares > 100) warning('shares, "s", do not sum to 100')
  d <- x[ ,s]
  if(!is.numeric(d)) {
    stop('"s" must be numeric vector\n',
         'You have provided an object of class: ', class(d)[1])
  }
  if (any(d < 0)) {
    stop('vector "s" must contain only positive values')
  }
  hhi <- sum(d^2)
  
  return(hhi)
}
```

With the function defined, as well as parameters defined, we can create some fake "firm" data as well as the share of the market each retains, and then calculate the competitiveness (or concentration) of this hypothetical market.

```{r}
a <- c(1,2,3,4) # firm id
b <- c(20,30,40,10) # market share of each firm (totaling 100%)
x <- data.frame(a,b) # create data frame

hhi(x, "b")
```

On your own: I encourage you to alter the stored values and test out some of the functions warning and error messages. Then, feel free to go back and change or even add new error or warning messages (e.g., allow for a tibble *and* a data frame, perhaps).

## Data Wrangling

Now, we will going to cover some core concepts on working with data in R.

Data "munging" or "wrangling" is the process of getting your data into the form you need for analysis (i.e., data management). Even though most of the datasets we use in this course will be cleaned given the large amount of ground we will cover in a relatively short period of time, these wrangling and functional programming techniques should still help in your own work. 

Let's now work with some social science data, the [2016 American National Election Study](https://electionstudies.org/data-center/anes-2016-pilot-study/) (ANES or NES) pilot data. It has many features, including a battery of feeling thermometers (i.e., preferences on a scale from 1-100), which we will cover more later. Start by using the `here` package to load the data.

```{R eval=FALSE, echo=TRUE, message=FALSE, warning=FALSE}
library(here)
library(tidyverse)

NESdta <- read_csv(here("data", "anes_2016.csv"))
```

Most of the functions we will cover are from the `dplyr` package, which is one of the main components of the tidyverse, in line with the tidy philosophy of *consistent syntax*. We will cover the following functions:
  
* `select()`: choose specific variables
* `filter()`: filter data by values
* `group_by()`: group data by categorical values
* `summarise()`: create summaries of variables and data
* `mutate()`: create new variables
* `replace()`: changes particular values within a variable

As with almost any dataset, the NES data has many more features than we could ever really plan on using in a single analysis. So, we might want to limit our set to include those of greatest interest. Using `select()`, let's create a new object called `NESdta_short`, which includes a subset of key variables/features.

```{R eval=FALSE, echo=TRUE, message=FALSE, warning=FALSE}
# Select particular variables
NESdta_short <- NESdta %>% 
  select(fttrump, pid3, birthyr, gender, ftobama, state, pid7)

head(NESdta_short,
     n = 5)
```

This creates a new tibble (a tidyverse data frame) that only includes seven variables: `fttrump` - a "feeling thermometer" where people rate their feelings of then primary candidate Donald Trump from 0 to 100; `pid3` - a three point rating of political identity, where 1 means Democrat, 2 means Independent, and 3 means Republican; `birthyr` - the respondent's year of birth, which we will use to establish their age; the respondent's `gender`, which is 1 if male and 2 if female; the feeling thermometer for then-President Barack Obama (`ftobama`), again from 0 to 100; `state` - Census code for US states; and `pid7` - party ID on a seven point scale (e.g., "lean Republican").

`select()` can be combined with other functions like `starts_with()`, `ends_with()`, and `contains()` to select more than one variable at a time. We can also use the `:` to select more than one variable that are consecutive, e.g., for all of the feeling thermometers, and we know that they all start with the prefix `ft`, we could simply put the following.

```{r}
# Select using starts_with()
NESdta %>% 
  select(starts_with("ft"))

# Or... 

# Selecting using a range (if we new the natural order)
NESdta %>% 
  select(ftobama:ftsci)

# Or multiple features at a time
NESdta %>%
  select(pid3, birthyr, gender, starts_with("ft"))

# Or remove a feature
NESdta_short <- NESdta_short %>%
  select(-pid7)

NESdta_short
```

Next, `filter()` works similar to `select()`, but instead of selecting columns by their _names_, `filter()` allows you to select rows by their _values_. If, for example, we wanted to find only the respondents who gave Donald Trump the highest possible rating, we could do this and a bunch of other layered expressions using `filter()`. 

```{r}
# Select only those respondents who give Trump a 100
NESdta_short %>%
  filter(fttrump == 100)

# Or select only those respondents who give Trump a 100 and Obama a 1
NESdta_short %>%
  filter(fttrump == 100 & ftobama == 1)

# Or select only those respondents who give Trump a 100 or give Obama a 100
NESdta_short %>%
  filter(fttrump == 100 | ftobama == 100)

# Or combine and select only those respondents who both Trump AND Obama 100s or 1s.
NESdta_short %>%
  filter((fttrump == 100 & ftobama == 100) | (fttrump == 1 & ftobama == 1))

# Or select only those respondents who give Trump an approval rating greater than 50
NESdta_short %>%
  filter(fttrump > 50)

# Or select those with higher than average approval of Trump
NESdta_short %>%
  filter(fttrump > mean(fttrump, na.rm = TRUE))
```

You may have noticed the symbol in a few of the previous commands or in other work/classes, `%>%`. For those who might be unfamiliar, this is called a **pipe**, which is read as "then," and it was originally developed by Stefan Milton Bache for the R package `magrittr`. The pipe allows you to declare at the beginning the data on which you want to work and stack/pass a number of operations onto that data without having to declare the data you want to use each time you issue a command. This has many other benefits, but chiefly it allows your code to be more consistent and *modular*. This will come in handy for tons of stuff like merging code, fitting a model, visualizing results and so on, all in a single chunk of code. 

For example, to show how the pipe allows you to stack commands, let's look at the next two other key munging functions - `group_by()` and `summarise()`. 

Suppose we think Republicans will be most positively disposed to Donald Trump as a candidate, followed by Independents, and then by Democrats. We can use the `group_by()` function to tell R what groups we want to make, and follow this with the `summarise()` command to create the needed summaries.

```{r}
NESdta_short %>%
  group_by(pid3) %>%
  summarise(average_fttrump = mean(fttrump, na.rm = TRUE),
            median_fttrump = median(fttrump, na.rm = TRUE),
            variance_fttrump = var(fttrump, na.rm = TRUE),
            stddev_fttrump = sd(fttrump, na.rm = TRUE),
            max_fttrump = max(fttrump, na.rm = TRUE),
            min_fttrump = min(fttrump, na.rm = TRUE),
            n_fttrump = n()) %>%
  ungroup()
```

Note that since we are done with these groups, we include `ungroup()` at the end to ensure that the grouping does not persist past these commands.

The `group_by()` command is also useful in a number of other conditions. Suppose we are interested in finding the respondents of each gender who give him a higher than average score than other people of the same gender. In this case, we can group respondents by gender and then `filter()` respondents who rate higher than the average in *each* group.

```{r}
# Filter data to only include those who give Trump higher than average approval by gender.
NESdta_short %>%
  group_by(gender) %>%
  filter(fttrump > mean(fttrump, na.rm = T))

# Or, summarizing mean approval of Trump by party and gender
NESdta_short %>%
  group_by(pid3, gender) %>%
  summarise(average_fttrump = mean(fttrump, na.rm = TRUE),
            n_fttrump = n()) %>%
  ungroup()
```

Another task that you will often find yourself doing is creating new features. The tidy approach for this is `mutate()`. Start with a simple transformation. `birthyr` does not directly represent the concept we really want, *age*. To get `age`, we need to create a new variable that calculates the age of a respondent by subtracting their birth year from the year of the survey.

```{r}
# Create a new variable giving the respondent's age
NESdta_short <- NESdta_short %>%
  mutate(age = 2016 - birthyr)

# Or create a new variable giving the square of respondent's age
NESdta_short <- NESdta_short %>%
  mutate(age2 = age^2)

# Now, get rid of it
NESdta_short <- NESdta_short %>%
  mutate(age2 = NULL)
```

We just summarized support for then-candidate Donald Trump by party affiliation. But what if we want these summaries to be a part of the `NESdta_short` dataset? `mutate()` helps out with this too, along with redefining our data object, `NESdta_short`.

```{r}
# Creating a new variable using group_by() and mutate()
NESdta_short <- NESdta_short %>%
  group_by(pid3) %>%
  mutate(average_fttrump = mean(fttrump, na.rm = TRUE)) %>%
  ungroup()

# Or calculate deviation from the mean
NESdta_short <- NESdta_short %>%
  mutate(deviation_fttrump = fttrump - average_fttrump)
```

When you inspect the output, note that while the feeling thermometers for Donald Trump and Barack Obama are only supposed to go from 0 to 100, the summary statistics said the maximum values were `998`. Many datasets try not to leave blank spaces or mix strings and numeric values. So, instead, they represent missing values by highly improbable numeric values, e.g., `998` or `-999`. We need to change these to `NA` instead, which we will handle last in today's class when we get to missing data and feature engineering.

To do so, we will combine `mutate()` with `replace()`. `replace()` takes three values as input. The first is the variable on which we are making the replacement, the second is a logical test. This can be read as, "Where the variable is..." For example, the second part of the first replacement asks it to make the replacement where the variable `fttrump` is greater than 100. As you can see, within the `mutate()` function, we have asked for our original variable to be equal to the specified replacement (i.e., we have _redefined_ the original variable to drop these nonsensical values).

```{r}
# Using replace() to recode values
NESdta_short <- NESdta_short %>%
  mutate(fttrump = replace(fttrump, fttrump > 100, NA),
         ftobama = replace(ftobama, ftobama == 998, NA))

glimpse(NESdta_short)

skimr::skim(NESdta_short) # tidy summary
```

Another variable we will likely want to change is the `state` variable. Right now, it is using Census codes that represent the states, but we will probably want strings with the state names as well. We can look up the numbers associated with each state in the ANES (which I have done for you... *you're welcome*) and create a new variable called `state_name` that contains the character name of the state. To do so, we will use `case_when()`, which allows us to change a large number of values within a variable.

```{r}
# Create a new variable called state_name with the string names of states
NESdta_short <- NESdta_short %>%
  mutate(state_name = case_when(state == 1~"Alabama", 
                                state == 2~"Alaska", 
                                state == 4~"Arizona", 
                                state == 5~"Arkansas",
                                state == 6~"California", 
                                state == 8~"Colorado", 
                                state == 9~"Connecticut", 
                                state == 10~"Delaware", 
                                state == 11~"District of Columbia", 
                                state == 12~"Florida", 
                                state == 13~"Georgia",
                                state == 15~"Hawaii", 
                                state == 16~"Idaho", 
                                state == 17~"Illinois", 
                                state == 18~"Indiana",
                                state == 19~"Iowa", 
                                state == 20~"Kansas", 
                                state == 21~"Kentucky", 
                                state == 22~"Louisiana", 
                                state == 23~"Maine", 
                                state == 24~"Maryland", 
                                state == 25~"Massachusetts", 
                                state == 26~"Michigan", 
                                state == 27~"Minnesota",
                                state == 28~"Mississippi", 
                                state == 29~"Missouri", 
                                state == 30~"Montana",
                                state == 31~"Nebraska", 
                                state == 32~"Nevada", 
                                state == 33~"New Hampshire", 
                                state == 34~"New Jersey",
                                state == 35~"New Mexico", 
                                state == 36~"New York", 
                                state == 37~"North Carolina", 
                                state == 38~"North Dakota",
                                state == 39~"Ohio", 
                                state == 40~"Oklahoma", 
                                state == 41~"Oregon",  
                                state == 42~"Pennsylvania",
                                state == 44~"Rhode Island",  
                                state == 45~"South Carolina", 
                                state == 46~"South Dakota", 
                                state == 47~"Tennessee", 
                                state == 48~"Texas", 
                                state == 49~"Utah", 
                                state == 50~"Vermont", 
                                state == 51~"Virginia",
                                state == 53~"Washington", 
                                state == 54~"West Virginia", 
                                state == 55~"Wisconsin", 
                                state == 56~"Wyoming"))
```

A final note: you might have noticed the double equal sign, `==`, which is used for comparison. It returns a value of `TRUE` if the item on the left-hand side is equal to the item on the right-hand side, and `FALSE` otherwise. A type of logical command you will find yourself using a lot is `ifelse(condition, outcome if true, outcome if false)`. For example, re: the `gender` variable, suppose we want to recode gender (currently 2 = female and 1 = male) to be more descriptive and also on the more common `{0,1}` scale. Using `mutate()` and `ifelse()` (from base R), we create a new `female` feature.

```{r}
NESdta_short <- NESdta_short %>%
  mutate(female = ifelse(gender == 2, 1, 0))

sample_n(NESdta_short,
         size = 5)
```

## Key Functions for Modeling

Here are some key functions and concepts that especially come in handy for modeling tasks. Start with visualization. But to bring this into our framework and to start simply to ensure we are on the same page, `ggplot2` is going to be your best friend, allowing for excellent, flexible visualizations. For example, we can take a look at a histogram of feelings toward Obama (`ftobama`).

```{R eval=FALSE, echo=TRUE, warning=FALSE}
NESdta_short %>% 
  ggplot() +
  geom_histogram(aes(x = ftobama), 
                 binwidth = 2, 
                 color = "white", 
                 fill = "steelblue") +
  theme_minimal() + 
  labs(x = "Feelings toward Obama (0-100)", 
       y = "Count of Respondents",
       title = "Feelings toward Obama in 2016")
```

Next is the ability to map or iteratively pass functions to values stored in vectors. We will use the `purrr` package for this if we are interested in bypassing the use of loops.

```{r}
# first create a narrower subset
NESdta_small <- NESdta_short %>%
  select(female, fttrump, birthyr)

# inspect a small random sample
sample_n(NESdta_small, 5) 
```

With our smaller subset created, let's now `split()` the data into groups, to more easily demonstrate `map()`.

```{r}
fems <- split(NESdta_small, 
              NESdta_small$female)

fems

# now map the base R function `head()` to the split object to inspect the first 2 observations in each group
map(fems, head, n = 2)

# Or, explore how many rows in each group
map(fems, nrow)
```

Importantly, the basic `map()` function will always return a list, which is a mixed data type. If you are working with a specific type of data or want a specific type to condition the map function, there are a series of more specific map functions, `map_*()` that can be used. For example, the raw count of rows is an integer, so we could use `map_int()` to get the same result, but ensuring it's a real valued integer.

```{r}
map_int(fems, nrow)

# double check to make sure the returned value is an integer, not a list
is.integer(map_int(fems, nrow)) 

is.integer(map(fems, nrow))
```

Now for are a few extensions of map, which will come in handy in the coming days. 

Rather than using `split()` from base R, the tidyverse version of this is to "nest" via `nest()` or `unnest()`. We might be interested in using `nest()` by female to return a data frame into a so-called "list-column" (thanks to Jenny Bryan; see the package documentation for `purrr` for more), which can host multiple data types in a single vector. Basically, this means we can nest an entire data frame, e.g., within a row or column.

First, create a new object, `new_nes`, with tibbles for each level of female. In substantive terms, the syntax means that we want two rows for each level of `female`, and a data frame/tibble for the other two features in the `NESdta_small` data set, nested for each level of `female`.

```{r}
(new_nes <- nest(NESdta_small, -female))
```

We can complicate this by adding a new feature (or many features if you want): number of rows (`n_row`) for each level. To reverse, we can simply `unnest()`.

```{r}
new_nes %>%
  mutate(n_row = map_int(data, nrow))

# Now, reverse via unnest()
unnest(new_nes, data)
```

Finally, let's take a look at some key functions for splitting data into training and testing sets. There are many ways to do this. We will cover a few: `tidymodels` (via `rsample`), `caret`, and base R. 

First, `tidymodels`.

```{r}
library(tidymodels)
library(tidyverse)

set.seed(1234)

split_tidy <- initial_split(NESdta_short,
                            prop = 0.8) # default is 75%

# take a look...
split_tidy

# create
train_tidy <- training(split_tidy)
test_tidy  <- testing(split_tidy)

# double check
nrow(train_tidy)/nrow(NESdta_short)

# viz
tidy <- train_tidy %>% 
  ggplot(aes(x = fttrump)) + 
  geom_line(stat = "density") + 
  geom_line(data = test_tidy, 
            stat = "density", 
            col = "red") +
  labs(title = "Via Tidymodels") +
  theme_minimal()
```

Now, let's try with `caret`.

```{r}
library(caret)

set.seed(1234)

# need to drop NAs for caret (also, can't read tibbles when indexing)
NESdta_short <- NESdta_short %>% 
  na.omit() %>% 
  as.data.frame()

split_caret <- createDataPartition(NESdta_short$fttrump, 
                                        p = 0.8, 
                                        list = FALSE)

# then use basic index notation for creating the splits
train_caret <- NESdta_short[split_caret, ]
test_caret  <- NESdta_short[-split_caret, ]

# viz again
caret <- train_caret %>% 
  ggplot(aes(x = fttrump)) + 
  geom_line(stat = "density") + 
  geom_line(data = test_caret, 
            stat = "density", 
            col = "red")  +
  labs(title = "Via Caret") +
  theme_minimal()
```

Finally, let's see base R.

```{r}
set.seed(1234)

split_baseR <- sample(nrow(NESdta_short), 
                      size = 0.8*nrow(NESdta_short))  

train_baseR <- NESdta_short[split_baseR, ]
test_baseR <- NESdta_short[-split_baseR, ]

# viz again
base_r <- train_baseR %>% 
  ggplot(aes(x = fttrump)) + 
  geom_line(stat = "density") + 
  geom_line(data = test_baseR, 
            stat = "density", 
            col = "red") +
  labs(title = "Via Base R") +
  theme_minimal()
```

Now, we can view side by side. Should look pretty similar, though not exact given that we are randomly defining these sets.

```{r}
library(patchwork)

(tidy + caret + base_r)

```

## Simulations

A key aspect of working with data is the use of simulations, which are more explicit versions of mapping as we previously covered. 

Simulations are incredibly powerful tools allowing for creation of virtually any context you wish, e.g., 

  - testing theories under alternate constraints
  - distributional assumptions
  - real vs. predictions/forecasts, etc. 

Let's start with the `do()` approach and a hypothetical baseball question and then move to the more widely used and flexible `for` loop.


Many factors have been said to affect a baseball player's performance at the plate (i.e., the number of hits a player gets). Throughout a single season consisting of 6 months, most players will range from doing well to doing poorly at the plate, with few players ever performing at a high level for the *duration* of such a long season. The reality of the extreme difficulty of hitting baseballs that travel around 95 miles per hour across only 60 feet and 6 inches (the distance from the pitcher's mount to home plate) makes a lot of people interested in forecasting future hits, or how players will perform over the course of the season. 

> Thus, the problem becomes: *can we predict monthly hits for the average player in the context of an entire season?*

Despite the many models out there that assert they have figured out how to accurately forecast this quantity of interest, though, it turns out that player hits probably look more like random draws from a Gaussian distribution, with a mean number of hits per season for the average player around 165, the monthly average around 28 hits, and a wide standard deviation of about 12 given the extreme difficulty of hitting baseballs as well as the many other factors that go into player performance. In short, average player hits per month are *incredibly hard to forecast*, as they are virtually random and also vary widely. 

The point that you basically can't predict player hits with any degree of reliability, we will take advantage of a simulation to produce several hypothetical seasons worth of hits for the average player, given these known truths (again, *this is hypothetical in the specifics, but probably mirrors the reality of forecasting baseball hits*). Specifically, we will use the information we have -- mean player hits, hits standard deviation, and a Gaussian data generating process -- to make predictions about the future. 

The punch line: **the predicted hits vary so widely, it's virtually impossible to predict player hits with any degree of reliability or certainty**. But let's see this in action.

Quick caveat: `do()` is on its way out as it basically maps functions to other values. So this is not too commonly used. but it's useful to show a tidy-friendly approach to simulation, which will help `for` loops make more sense hopefully.

```{R eval=TRUE, echo=TRUE, warning = FALSE, message = FALSE}
library(tidyverse)

# constants
hits_true <- 28 # ~165/season
hits_se <- 12  # "true" variance
n_sims <- 5 # number of times to simulate

# function for calculating a season's worth (6 months) of monthly hit samples based on true values above
baseball_fun <- function(true_mean, true_sd, months = 6, mu = 0, Iteration = 1) { # takes 5 inputs
  season <- rep(true_mean, months) + rnorm(months, # simulate draws from a normal dist
                                           mean = mu * (1:months), # allows means to vary around the true population mean + some randomness
                                           sd = true_sd) # sets variance of each simulated season
  return(data.frame(hits_number = season, # df to store our simulated values
                    month = as.factor(1:months), 
                    Iteration = Iteration))
}

# now create a df for parameters for sims (variance around mean = sd; and sim number)
# 
baseball <- data.frame(sd = c(0, rep(hits_se, n_sims)), # allows us to apply the SD to each iteration of the season --> 0 for "true" because we know that value already and don't want to sample it
                       Iteration = c("True (Average) Season", paste("Simulated Season", 1:n_sims)))

# simulate
set.seed(123)

baseball_simulation <- baseball %>% 
  group_by(Iteration) %>% 
  do(baseball_fun(true_mean = hits_true,
                  true_sd = .$sd, # this and iteration come from placeholder dataframe (hence the pipe)
                  Iteration = .$Iteration))

# viz
baseball_simulation %>% 
  ggplot(aes(x = month, 
             y = hits_number, 
             color = Iteration, # "season"
             fill = Iteration)) + 
  geom_hline(yintercept = hits_true, 
             linetype = 2) +
  geom_bar(stat = "identity") +
  facet_wrap(~ Iteration) +
  labs(y = "Count of Hits",
       x = "Month") +
  theme_minimal()
```

Sure enough, simulating only 5 seasons based on the ground truth data we started with, there seem to be *no clear patterns*, suggesting that player hits are pretty random and tough to predict.

Now let's give simulations a try with `for` loops.

Suppose we drew a random sample of 50 respondents' self-reported political ideologies on a 7 point scale, where 1 was extremely liberal and 7 was a extremely conservative. The mean of that sample was `3.32`, suggesting the average person in this sample sees herself as generally moderate, or in the middle of the distribution of political ideology. 

Here is the code setting this up:

```{r}
sample_ideology <- c(3, 1, 2, 4, 4, 6, 1, 3, 2, 6, 
                     1, 7, 3, 1, 4, 3, 4, 4, 1, 6, 
                     7, 5, 7, 1, 1, 3, 2, 4, 1, 7, 
                     1, 2, 1, 4, 6, 3, 2, 3, 1, 4, 
                     1, 6, 3, 4, 5, 4, 1, 7, 2, 2)
mean(sample_ideology)
```


Now, suppose we wanted to simulate the reported sample ideology to see whether this random sample was reflective of the broader American population. However, we aren't exactly sure how many times to do this such that it will accurately reflect the population of interest. 

To get traction on this question, the central limit theorem and law of large numbers can help us out. To see how the *shape* of the distribution over many samples (central limit theorem) and how the *location* of these distributions change (law of large numbers), we can use a series of `for` loops, and **plot the different distributions to see when and where (and whether) the samples converge on the underlying population**.

`for` loops allow for iterating some calculation or function over many different observations. This is a simulation. 

The syntax of `for` loops is similar to functions, where they begin with "for" and then start with some value in a sequence in parentheses. Then, within the braces, there is a statement to be evaluated. 

For each chunk below, we start by creating an empty vector in which to store our simulated values (e.g., `sm1`). We then loop sampling with replacement based on the initially-drawn sample (`sample_ideology`), and then take the mean of that iteration and store the mean in the empty vector, `sm*`. We will then plot each simulation and compare side by side via another tidyverse-friendly package, `patchwork`.

```{r}
library(tidyverse)

# N = 5
sm1 <- rep(NA, 5)

for(i in 1:5){
  samp <- sample(sample_ideology, 30, replace = TRUE)
  sm1[i] <- mean(samp)
}

# N = 20
sm2 <- rep(NA, 20)

for(i in 1:20){
  samp <- sample(sample_ideology, 30, replace = TRUE)
  sm2[i] <- mean(samp)
}

# N = 50
sm3 <- rep(NA, 50)

for(i in 1:50){
  samp <- sample(sample_ideology, 30, replace = TRUE)
  sm3[i] <- mean(samp)
}

# N = 100
sm4 <- rep(NA, 100)

for(i in 1:100){
  samp <- sample(sample_ideology, 30, replace = TRUE)
  sm4[i] <- mean(samp)
}

# N = 500
sm5 <- rep(NA, 500)

for(i in 1:500){
  samp <- sample(sample_ideology, 30, replace = TRUE)
  sm5[i] <- mean(samp)
}

# N = 1500
sm6 <- rep(NA, 1500)

for(i in 1:1500){
  samp <- sample(sample_ideology, 30, replace = TRUE)
  sm6[i] <- mean(samp)
}

# N = 3500
sm7 <- rep(NA, 3500)

for(i in 1:3500){
  samp <- sample(sample_ideology, 30, replace = TRUE)
  sm7[i] <- mean(samp)
}

# N = 7000
sm8 <- rep(NA, 7000)

for(i in 1:7000){
  samp <- sample(sample_ideology, 30, replace = TRUE)
  sm8[i] <- mean(samp)
}

# Now plot each simulation
library(patchwork)

p1 <- quickplot(sm1, geom = "histogram", main = "N=5", bins = 30) + 
  theme_minimal() + 
  geom_vline(xintercept = 3.32, linetype="dashed", color = "red")

p2 <- quickplot(sm2, geom = "histogram", main = "N=20", bins = 30) + 
  theme_minimal() + 
  geom_vline(xintercept = 3.32, linetype="dashed", color = "red")

p3 <- quickplot(sm3, geom = "histogram", main = "N=50", bins = 30) +
  theme_minimal() + 
  geom_vline(xintercept = 3.32, linetype="dashed", color = "red")

p4 <- quickplot(sm4, geom = "histogram", main = "N=100", bins = 30) + 
  theme_minimal() + 
  geom_vline(xintercept = 3.32, linetype="dashed", color = "red")

p5 <- quickplot(sm5, geom = "histogram", main = "N=500", bins = 30) + 
  theme_minimal() + 
  geom_vline(xintercept = 3.32, linetype="dashed", color = "red")

p6 <- quickplot(sm6, geom = "histogram", main = "N=1500", bins = 30) + 
  theme_minimal() + 
  geom_vline(xintercept = 3.32, linetype="dashed", color = "red")

p7 <- quickplot(sm7, geom = "histogram", main = "N=3500", bins = 30) + 
  theme_minimal() + 
  geom_vline(xintercept = 3.32, linetype="dashed", color = "red")

p8 <- quickplot(sm8, geom = "histogram", main = "N=7000", bins = 30) + 
  theme_minimal() + 
  geom_vline(xintercept = 3.32, linetype="dashed", color = "red")

# piece ggplot objects together with the patchwork package
p1 +
  p2 +
  p3 +
  p4 +
  p5 +
  p6 +
  p7 +
  p8
```

## R markdown (`.Rmd`)

Now that we know how to do some basic tasks in R, let's talk about R markdown a bit. Basically, R markdown is a document editor and generator allowing for well-formatted documents with code and output directly inserted. This document was created with R markdown. Perhaps the biggest value of R markdown is replication. The document will not compile if the inserted code has errors or is missing something, even as small as a single comma in an expression. So while often frustrating, R markdown (and any other markdown languages, e.g., yaml, or document/code editors such as Jupyter Notebooks) ensures that whatever is written in the document can be successfully run and re-run and re-run... 

And you can (and should) use R markdown from *within* R Studio. To start a new `.Rmd` file, first select the icon for creating a new document in the upper left portion of the screen. Then, in the drop-down menu, select `R Markdown...`. It will populate the editor pane with a template including some basic sample code. And you're off! 

Once you have finished your document and would like to format, simply select `Knit` at the top menu bar in R Studio. You can either select the type of output document (PDF, HTML, etc.) or prespecify in the metadata at the outset of the document, e.g., `output: html_document` or `output: pdf_document` or `output: word_document`, and so on.

**You will be required to submit assignments in this class via some replicable format**. The easiest way is an R markdown document with code and evaluations directly inserted. We will get to this more throughout the Quarter. 

### Exploring R Markdown

Though there is much more to R markdown syntax, we will cover just a few basics here.

First, formatting text. Though less intuitive than Microsoft Word, at least formatting in R markdown is more straightforward than formatting text in LaTeX. To italicize, either use `*` or `_` before and after the *section to be emphasized*. For bold, use `**` before and **after**. For footnotes, use `^` with the following text in brackets, like^[this]. And to strike through, use `~~` before and after the text to ~~strike through~~. 

Second, ordered and unordered lists, and block quotes. Lists are simple. For unordered, use either `*` or `-` and then a space, and then the text (if no space, it will think you mean to italicize). For ordered lists, simply write the number followed by a period and a space and then start writing. For nested lists, indent once (or twice) on the next line and use `+` followed by a space. 

An unordered list:

* item 1
  + sub
  + sub
* item 2
* item 3
  + sub
  
And an ordered list:

1. item 1
    + sub
        + sub sub
    + sub
2. item 2
3. item 3
    + sub

You can also include block quotes within text using `>`. For example, here is a line of text that is in my document. But now we should break to emphasize a great quote:

> Here is the great quote. Notice also that when you start a block quote in R markdown, the text is green

Third, code chunks. R markdown allows you to directly insert (and thus evaluate) code via "code chunks", both in R as well as a variety of other programming languages into the document. To start a chunk, type ``` and then the name of the language in braces, e.g., R. First, of course, we have R,

```{R echo = FALSE, eval = FALSE}
print("Hello, World!")
```

You can also insert bash (shell/UNIX) code,

```{bash echo = FALSE, eval = FALSE}
osx:~ $ echo "Hello, world!"
```

Or Python,

```{python echo = FALSE, eval = FALSE}
print('Hello, world!') # same as R
```

Or C++, 

```{Rcpp echo = FALSE, eval = FALSE}
#include <iostream>
using namespace std;
int main() 
{
    cout << "Hello, World!";
    return 0;
}
```

And so on...

Finally, equations, links, and tables. In line equations are denoted by `$` like in LaTeX, as TeX is also supported by R markdown. And also like TeX, use `\` to call mathematical symbols or Greek letters.^[Here is a great resource for mathematical operators, Greek letters, etc. in R markdown: <https://www.calvin.edu/~rpruim/courses/s341/S17/from-class/MathinRmd.html>] For example, `$\sum_{i=1}^{n} \lambda (x_i-\bar{x})$` gives you $\sum_{i=1}^{n} \lambda (x_i-\bar{x})$. Use double `$$` to separate equations from text and center them, e.g., $$\sum_{i=1}^{n} \lambda (x_i-\bar{x})$$ 

To link to external sites, put the linked text in brackets, `[]`, followed immediately by the url in parentheses `()`. So, e.g., `[R Studio](https://www.rstudio.com/)` produces [R Studio](https://www.rstudio.com/).

And for tables, simply use `-----` for rows and `|` for columns. For example, 

here | is | a | table | with | seven | columns
-----|----|----|-----|----|----|----
and | it | also | has | three | rows | ...
here | is | the | third | row | ...| ...

# Feature Engineering for Missing Data

The concept of *feature engineering*, though fancy sounding, really just means to create or transform features for some reason or future task. The **reason** could be overcoming limitations in data, and the **future task** could be for a modeling requirement. Learning how to approach contexts necessitating feature engineering at some level requires intimately knowing and thinking carefully about the structure of your data. These skills will serve you well in *any* computational modeling context. 

More specifically for our purposes, given the many contexts requiring unique feature engineering solutions, we will only scratch the surface here by covering feature engineering in the context of missing data. This is meant to get you started down the right path when you encounter one of the most common and most frustrating issues in working with social science data. Even still, I strongly encourage you to think about thorough data exploration and feature engineering from the start of *any* future project. 

With that, and to conclude today's content, we will cover the following couple of topics. Some of this discussion was adapted from the excellent Kuhn and Johnson 2019 [book](http://www.feat.engineering/) on feature engineering and selection, which you were to skim for today's class. I **strongly** encourage you to dive deeper into that book and the supporting code. Both will serve you as excellent resources for a long time.

  * Diagnosing & Working with Missing Data
  * Essential Feature Engineering for Missing Data

If left undetected or unaddressed, missing data can plague any project. And in the social sciences, this is an extremely common occurrence as we are usually dealing with humans, surveys, and other data structures produced by flawed people or institutions. Compared to other realms of data production like computational statistics/simulated data that often result in complete datasets, the social sciences must anticipate that at least some of the data they collect or use will be missing or flawed in some important way. 

Yet, though frustrating, missing data is certainly *not* an insurmountable obstacle.

We will cover the basics of how to diagnose and address missing data, as a common step prior to the "engineering" part of feature engineering. Although missing data and feature engineering are often addressed together as we are doing today, these two don't necessarily have to go hand-in-hand. For example, we may detect missing data, but leave it unaddressed for some project-specific reason. Or alternatively, we may be interested in engineering new features from data, but entirely separate from the question of missingness. Still, researchers most commonly correct for missing data using feature engineering techniques. So we will progress accordingly.

We will first discuss common types of missing data, and thus how best to think about missingness. This step will help with developing steps needed to address the issue. Then, with the structure of missingness in mind, we will proceed to exploring (mostly visually) patterns of missingness. Intimately tied with these two steps, we will conclude with discussion and application of some common solutions to address missing data.

**Structure**

In thinking about the structure of missingness in data, we are really interested in uncovering the *cause* behind the missingness. This revelation will allow for more targeted feature engineering solutions. Some of the more common causes of missingness in data include: 

  1. Recording errors
  2. Unique instances
  3. Missing at random; *See more discussion of the nuance of types of missing data at random in the Little and Rubin (2019) paper*

The first cause of missingness has to do with errors in measurements of the original data collection effort. This can include structural error on the part of the data collectors, or it could be due to something as simple and clerical as data that got lost during the translation of field records, e.g., digitization. Regardless of how this happened, a fix for this type of missingness is entirely dependent on the scope. Solutions could range from going back out in the field to correct the errors, or something like encoding categories (more in a bit). Either way, it's important to first *inspect* the data to understand whether this type of missingness is likely or not. 

The second cause of missingness is case-based, where unique observations are missing recorded values, but at a smaller scale then the previous cause. The difference in scope will translate to difference in solutions, as we will see throughout this course. This is what Little and Rubin (2019) call NMAR ("not missing at random"; descriptive and creative to be sure...). These are often the most difficult cases to address, because there is some structural, underlying phenomenon that has led to the data being missing in some detectable pattern. Simple imputation methods may or may not suffice. Even if the values could be imputed, there is no guarantee that they are reflecting any substance about the individual observation, because that value was, at some level, *deterministically* missing. This is honest reporting of your process, and also act as a signal that you are ethically and carefully thinking about the underlying data generating process (as opposed to flippantly manipulating data, assuming all is well, when perhaps it is not). 

The final major cause of missing data is missing at random. This is often (and generally) considered the easiest type of missingness to correct for, mostly because there is no underlying driver behind the missingness. The probability of missing completely at random version is the same for all observations in the set. This is compared to the partially missing at random version, which suggests that the pattern of missingness is dependent on the sampled and thus observed data, but not on the unobserved data from the underlying population. 

In short, always think very carefully about your data, and *never ignore missing cases*. If you do ultimately decide to leave missing data as is in the full set, be prepared to report on your process of reaching this decision. 

**Exploring missingness**

Suppose we have carefully thought about what could be driving these patterns at a more abstract level, we would now be interesting in exploring and inspecting our data to see exactly how these patterns take shape. 

As a first step, I strongly recommend getting right to the point and inspecting where missing values are concentrated. To do so, I really like the `naniar` package, authored and contributed to by some excellent programmers and statisticians (e.g., Hadley Wickham, Di Cook, Nick Tierney, and so on). The reason I use this package as a starting place for much of my work is captured succinctly in the package description: 

> `naniar` is a package to make it easier to summarise and handle missing values in R. It strives to do this in a way that is as consistent with tidyverse principles as possible.

Let's walk through some of the basics of exploration via `naniar`. To do so, we will use the `congress` data, which includes legislator-level features from the 109th and 110th Congresses (i.e., 2005-2009). 

```{r}
library(naniar)
library(tidyverse)
library(here)

congress <- read_csv(here("data", "congress_109_110.csv"))

# cumulative missingness over cases/observations
congress %>% 
  gg_miss_case_cumsum()

# missing data by feature
congress %>% 
  gg_miss_var()

## and also facet within the function call

congress %>% 
  gg_miss_var(latino) # facet by "latino" or not

# intersection of missing cases across features
congress %>% 
  gg_miss_upset()

# check where missing data exist in the full data space (feature-level)
congress %>% 
  select(order(desc(colnames(.)))) %>% # alphabetize for aesthetics only
  gg_miss_which()
```

**Some Solutions** 

So, we've detected missing data. What now? We essentially have three options, at least practically *defensible* options:

  1. Deletion of the data
  2. Encoding
  3. Imputation
  
First, we have the option to simply **delete** the features or observations that contain the missing data. I would strongly suggest that, unless you have a very good reason for doing so, that you avoid this option at all costs. This is because you  ideally *never* want to throw away data completely, especially originally collected data. This is due to questions of expense, but also due to not knowing future uses and value of the data. You may not be interested in retaining the data now, but you may in the future (meaning in the future of the current project, as well as future projects entirely). At the very least, allow yourself the option to work both with *and* without the data in question. The simplest solution here is to create a copy of the original data, and work with and manipulate that, rather than manipulating and modeling the original data. In short, go through the other options coming up next very carefully, and if you still thinking deletion is the best path forward, *do so at your own risk*. That said, if you decide to delete the data, you must make the decision to do so at either of the following levels: **feature** (column) or **observational** (row). For the former, you might delete all features in the data with missingness at or above some threshold (e.g., 1, if no missingness is allowed). And for the latter, the same constraint should guide this process, e.g., removing entire rows (and necessarily throwing away other information and cases) if missingness is at or above some threshold. In sum, you *will* be throwing away some amount of useful information when you opt for brute-force-style deletion. You just have to decide whether the trade-off is worth it. This certainly is a case-by-case, project-by-project consideration. 

Second, you have the option to **encode** missingness. This option entails either creating a new feature (e.g., "Feature_X_NA", where 1 = TRUE, and 0 = FALSE, where this would be a "ZV", zero-variance feature; *See Kuhn and Johnson (2019) chs. 5 and 6 for more on ZV or NZV features and how best to handle them*), or changing/"encoding" the missing fields in the original feature to be "NA", which can allow for other solutions or fixes downstream, e.g., imputation or filtering out. In short, encoding is a process of explicitly treating the missing values as missing, rather than restricting, deleting, or imputing them.

Finally, usually the best option is to **impute** new values for missing cases. This means to make a best guess at what a likely value for the missing case could be, which requires some new calculation to be substituted for the missing value. There are *many* ways to think about how best to calculate the values, and indeed the selection depends on the task at hand, as you might have suspected. For example, you might impute a new value for a missing instance based on the mean or median (or mode for categorical features) of the feature. Or, you might look at a small neighborhood of observations that look similar to the missing case along several other feature values, and then impute based on the real values of the similar observations for the missing value in the given observation or feature. Such an approach is using k-Nearest Neighbors (kNN) for imputing new values. Exactly how you think about and employ imputation depends on the goals, nature and structure of the feature with missing values (e.g., categorical or numeric), assumptions you are willing to take on, and so on. But these aspects aside, imputation boils down to generating some new value as a best guess for the missing value, based on some reasonable metric and thus characterization of that missing case.

Let's walk through a couple of examples of imputing missing values using an excellent R package, `recipes`, written by Max Kuhn and of course, Hadley Wickham. In brief, there are four main functions in the `recipes` package:

  1. `recipe()`: defines combinations of features to use in the imputation procedure, and thus which features matter for imputation
  2. `prep()`: performs parameter estimation
  3. `juice()`: produces new feature values for a *full* set of data
  4. `bake()`: if not juicing, then `bake()` follows the recipe for a different set of data

Here is a simple example where we will impute missing data for the `votepct` feature (vote percentage received by a legislator in their most recent election) from the `congress` data previously used. We will impute based on the mean values for real (non-missing) cases across the feature.

```{r}
library(recipes)
library(skimr)
library(tidyverse)
library(here)

congress <- read_csv(here("data", "congress_109_110.csv"))

# sample model to predict legislative effectiveness (LES) as a function of ideology (dwnom1) and votepct
recipe <- recipe(les ~ dwnom1 + votepct, 
                     data = congress) %>%
  step_meanimpute(votepct) 

mean_impute <- prep(recipe) %>% 
  juice()
```

So we have no imputed our new values. Let's take a look and make sure things look correct. To do so, we will first `skim()` the original feature, expecting to see a `complete_rate < 1.0`, indicating there are some missing cases. This is also denoted as an integer under `n_missing`, indicating the number of missing values across the full set. 

```{r}
skim(congress$votepct) 
```

Now, with this as a baseline, we will then `skim()` the `votepct` feature, but from our new, imputed set, `mean_impute`. We should see a `complete_rate = 1.0` and `n_missing = 0`, suggesting we have fixed the issue. Take a look also at the summary statistics. These should be extremely close, if not identical. If these are way off, then perhaps our imputation method is suboptimal. 

```{r}
skim(mean_impute$votepct) 
```

Importantly, with the `recipes` package, you can do a bunch of other very useful transformations and feature engineering, such as scaling features (e.g., `step_normalize()` or `step_scale()`), compute principal components from the features (e.g., `step_pca()`; more on this in a couple days when we get to PCA and autoencoders), mean-center the data (e.g., `step_center()`), and so on. 

**A Very Different Approach**

Finally, for the sake of a fuller discussion, I wanted to conclude the imputation and missing data section on a very different approach to thinking about these concepts. That is: multiple imputation by chained equations (MICE). The authors of the `mice` package in R sum it up best in their [JSS paper](https://dspace.library.uu.nl/handle/1874/44635),

> The philosophy behind MICE is that multiple imputation is best done as a sequence of small steps, each of which may require diagnostic checking.

In this world, the problem is that some quantity of interest cannot be properly estimated in the presence of missingness, such that potentially problematic assumptions *must be made* about the missing values when the coefficients are estimated. That is, many models will either ignore the missing cases or essentially randomly supply values that likely violate many statistical assumptions about DGP, and so on. MICE as a statistical question, rather than an empirical one seen in the techniques previously covered where we imputed values to *look like the observed, non-missing values*.

Essentially, MICE works by creating several versions of the original data, each with different imputed values. The different iterations of the original set are thought to be "plausible" versions of the original data, where they are comprised of features drawn from some conditional distribution that is unique to each missing value. The differences across each of the sets of imputed data (with different values for the same missing cases in each imputed set) capture the uncertainty pertaining to the ideal imputed value. Then, with the sets created, the next step is to fit the model and estimate the desired quantity of interest *using each of the sets*. Then, with the estimates from each of the fits in hand, estimates from each set are pooled and averaged to create a single estimate, with it's own unique variance. This notion of chaining equations/models to model the uncertainty surrounding plausible values of the missing case(s) has deep roots in statistical theory, with Rubin's (1987) canonical book on missingness being the first place this showed up. 

The key distinction between MICE and the approach to feature engineering for missing data that we discussed above mostly from Kuhn and Johnson (2019), is that rather than pick a single value to impute the missing case based on some metric (e.g., the mean of the feature), we instead embrace and actually *incorporate* uncertainty about the likely value of the missing case by considering many possible versions of the missing value. We then base our estimates on *all* of these plausible values to essentially hedge our bet on what the true missing value would have been in some alternate universe where the value was *not* missing. So here again, MICE sounds a lot like most formal statistical treatments (e.g., incorporating uncertainty, while also attempting to minimize it).

And of course, there is an R package for MICE, aptly named `mice`. The main functions in the package include: 

  * `mice()`: specify the conditional distribution for each feature; Missing values are replaced by simulated draws from some predictive posterior distribution
  * `with()`: estimate the quantity of interest, $Q$, using each imputed set, $m$, giving $\hat{Q}_m$ versions of the estimate
  * `pool()`: pool and average all estimates $\hat{Q}_m \in \{1, \dots, m\}$, and compute the variance of the pooled estimate 

See the JSS article introducing v2 of `mice` (though the package is currently on [v3.10.0](https://CRAN.R-project.org/package=mice) at a recent check) for a ton of rich detail on the method, as well as many applications: <https://www.jstatsoft.org/article/view/v045i03>.

## Concluding Remarks

[Hadley Wickham](http://hadley.nz/) once [told me](https://www.r-bloggers.com/advice-to-young-and-old-programmers-a-conversation-with-hadley-wickham/) that every programmer gets frustrated all the time. This is a good sign because it means you are pushing yourself. Don't resist the frustration; embrace it and learn from it. Keep pushing yourself, and it will certainly become more fun and much better every day you do! The only way to get better at R or anything for that matter, is practice and repetition. As a starting place, try to code at least 1 hour everyday. If you don't know what to do, try Googling "complicated R code" and you will quickly be humbled and (hopefully) motivated to learn something new.