2022-07-11_day_1_slides.html

<!DOCTYPE html>
<html lang="" xml:lang="">
  <head>
    <title>Intro to Data Wrangling, Exploration, and Analysis with R</title>
    <meta charset="utf-8" />
    <meta name="author" content="Nina Brooks  Assistant Professor, UConn School of Public Policy" />
    <meta name="date" content="2022-07-11" />
    <script src="2022-07-11_day_1_slides_files/header-attrs/header-attrs.js"></script>
    <link href="2022-07-11_day_1_slides_files/remark-css/default.css" rel="stylesheet" />
    <link href="2022-07-11_day_1_slides_files/remark-css/default-fonts.css" rel="stylesheet" />
  </head>
  <body>
    <textarea id="source">
class: center, middle, inverse, title-slide

.title[
# Intro to Data Wrangling, Exploration, and Analysis with R
]
.subtitle[
## Summer 2022 Workshop: Day 1
]
.author[
### Nina Brooks<br> Assistant Professor, UConn School of Public Policy
]
.date[
### July 11, 2022
]

---


# Welcome

My name is [Nina Brooks](www.ninarbrooks.com). I am an Assistant Professor in the UConn School of Public Policy and an R enthusiast!

&lt;br&gt;
--
This short course is designed to introduce you to the R programming language and get a sense of its possibilities for your own projects.

&lt;br&gt;
--
By the end of this 2-day course, you will see how to do the following in R:
--

- Load variety of different data types
--

- Prepare data for analysis by creating new variables, reshaping data, &amp; identifying duplicates and missing values
--

- Create summary tables
--

- Run regressions
--

- Create data visualizations
--

- And most importantly, where to find help

.footnote[
*I relied heavily on the materials from [Stat545](https://stat545.com/), [Tidyverse Skills](https://jhudatascience.org/tidyversecourse/wrangle-data.html), and [IPUMS PMA Data Analysis Hub](https://tech.popdata.org/pma-data-hub/) for creating this workshop

]
???
Emphasize that this is a bit of a survey course - i'll cover a lot of topics in brief, but won't go into depth in any of them. also, this is not a statistics class - i will discuss certain topics you may have learned in other courses (hypothesis testing, regression, etc), but I will not spend any time in this workshop discussing the theory behind them - only demonstrating how to do them in R

---
name: agenda
# Agenda

.pull-left[## Day 1
1. Intro to R

2. Reading in different types of Data

3. Data manipulation with the [tidyverse](https://www.tidyverse.org/)


]
--
.pull-right[## Day 2
1. Descriptive statistics &amp; nice looking tables

2. Linear Regression &amp; exporting nice looking tables

3. Data Visualization

]

???
Everyone should have access to today's slides, as well as all of the data and code used in the 2-day workshop.

I will also toggle between the slides and doing live demo in R

---
name: rintro
layout: true
# R and RStudio

---
--
## Did you do the pre-workshop setup?

- Download and install [R](https://cloud.r-project.org/) or update if you had a previous installation
- Download and install [RStudio](https://www.rstudio.com/) or update if you had a previous installation
- Install the packages sent in the "2022-R-Summer-Workshop-Setup.pdf"

--

## No?
- We won't have time to address R or RStudio installation issues during this workshop
- You can follow along with the slides and my shared screens of my R environment
- Reach out to me afterwards about individual issues, although I can't troubleshoot everyone's individual setup
---

## What is R?
--

R is a system for statistical computation and graphics. It consists of a language plus a run-time environment with graphics, a debugger, access to certain system functions, and the ability to run programs stored in script files.

--

R is **free** and open source, has a large group of users that provide support and develop R packages to augment the functions of base R, can be used to perform statistical analyses of all types, make data visualizations, apply machine learning algorithms, and do web scraping, as well as make slides (like these!), prepare documents, such as academic manuscripts or books, build [websites](www.ninarbrooks.com), create interactive web applications, among many other things!

--

## What is R Studio?

RStudio is an integrated development environment or IDE for using R. You need to install R first to also use RStudio. I highly recommend using RStudio instead of just R because it provides a powerful and user-friendly interface for interacting with R.

???
this is a pretty complex definition of R - the thing to take away is that it is a programming language with the ability to do statistical analysis and produce graphics (eg visualizations). 

In fact, I will not even show anything in the base R environment.
---
## Add-on packages

There are many user-written packages (that are well maintained and publicly available) to support the functioning of base R. It is easy to install a package directly in the R console:


```r
install.packages("tidyverse", dependencies = TRUE)
```
--

A few comments:
- By including `dependencies = TRUE`, we are being explicit and extra-careful to install any additional packages the target  package, tidyverse in the example above, needs to have around.
- The name of the package must be enclosed in quotes
- `install.packages` not `install.package` (it's plural)
- Package names are case sensitive: `"tidyverse"` is not the same as `"Tidyverse"` or `"tidyVerse"`

---
layout: false
# Further resources
Here are some links if you are interested in reading a bit further

- [How to Use RStudio](https://support.rstudio.com/hc/en-us)

- [Getting Help with R from RStudio](https://support.rstudio.com/hc/en-us/articles/200552336-Getting-Help-with-R)

- [R FAQ](https://cloud.r-project.org/)

- [R Installation and Administration](https://cloud.r-project.org/doc/manuals/r-release/R-admin.html)

---
# Objects &amp; Data Structures

.pull-left[
R has 6 basic data types:
- character: `"a"`, `"R has 6 basic data types."`
- numeric (real or decimal): `2`, `17.5`
- integer: `2L`(the L tells R to store this as an integer)
- logical: `TRUE`, `FALSE`
- complex: `1+4i`
]
--

.pull-right[
R also has different data structures:
- atomic vector
- list
- matrix
- data frame
- factors
]

???
The data frame is going to be the primary type of data structure most of you will be working with. Data frames refer to tabulaur/rectangular/spreadsheet style data. As we'll see a little later today, when you work with the tidyverse, data frames are also stored as "tibbles" - but these operate just like data frames for all intensive purposes.

R has other data structures as well, for example, you can import different types of spatial datasets into R and also text data.

---
# R basics

All R statements where you create objects – “assignments” – have this form:

```r
objectName &lt;- value
```
You should **always** use the `&lt;-` operator when assigning objects in R.

--

For example: 

```r
x  &lt;- 3 * 4
*x 
```

```
## [1] 12
```

```r
example &lt;- "This is a string"
*example
```

```
## [1] "This is a string"
```

The highlighted lines of code tell R you want to print out whatever is stored in the object "x" or the object "example".
---

# R Scripts

To make your code reproducible, you should always work from a well-annotated R script. These are saved with the file extension `".R"` For example, `"2022_summer_workshop.R"`. 

--

It is also common to write code in R Markdown documents, which are saved with the file extension `".Rmd"`. For example, `"2022_summer_workshop.Rmd"`. 

--

The big difference between an R script and an Rmd document is that Rmd documents are typically intended to weave together narrative text and code together, whereas R scripts contain only code. If you wanted to write a report based on your analysis, you could include both the analysis, writing, and output (e.g. tables, plots, regressions) in a single R Markdown document that is compiled into a clean report.

--

To make a comment in your R code (in either an R script or Rmd document), use the `#` before the text you want to comment. To comment out an entire line of code, put the `#` before the code:

```r
example &lt;- "This is a string" # this line of code demonstrates how to assign a string object

# example2 &lt;- "Hello World." # this entire line is commented out and will not be evaluated
```

---

# Misc important commands
To identify or change your working directory:

```r
# identify current working directory
getwd()

# change working directory
setwd("~/Users/nib21006/my_research_project") # must be contained in quotes!
```

To remove objects in your environment:

```r
rm(example) # removes the object called "example"

rm(list = ls()) # removes everything in the environment
```

Get help from within R on any command (even without an internet connection!):

```r
?getwd
help(getwd)
```

---

# Workflow
I recommend the following workflow:
- Write code in an R script or Rmd document
    - Annotate it well, so your future self understands what you did and why
    - Pro move: use Git or GitHub for version control (you can easily integrate this with RStudio)
    
- Keep an organized working directory with separate folders for raw data, clean data, output (like figures or tables), and scripts
    - The precise structure will depend a bit on your needs
    
- Save your R script/Rmd document! But no need to save your workspace (R will ask you this when you quit). And definitely **never** save an edited or manipulated version of your raw data
    - The ability to reproduce your clean data is exactly why you have your script(s)!
---
class: center, inverse, middle

# LET'S SEE SOME R!
???
In R Studio walk through: 

- panels
	1. Source (where your R script is) 
	2. console (this is where the output goes) 
	3. environment (what's loaded in R, files, plots, help &amp; viewer)
	4. "everything else"
- can customize the layout (show this) &amp; change colors
- R script - header, comments
- demonstrate running code up through reading in data

---
name: read_data
layout: true
# Importing Data into R
---

Data are stored in all sorts of different file formats and structures. We’ll discuss each of these common formats and discuss how to get them into R so you can start working with them!

--

## Excel files
Microsoft Excel files, which typically have the file extension .xls or .xlsx, store information in a workbook. Excel files can only be viewed in specific pieces of software (like Microsoft Excel), and thus are generally less flexible than many other formats of storing data. 

Additionally, Excel has certain defaults that make working with Excel data difficult outside of Excel. For example, Excel has a habit of aggressively changing data types. For example if you type 1/2, to mean 0.5, Excel assumes it is a date and converts that information to January 2nd.

.footnote[
*Special thanks to [Tidyverse Skills](https://jhudatascience.org/tidyversecourse/wrangle-data.html) for the content in this section.
]
---

Reading Excel spreadsheets into R is made possible thanks to the `readxl` package. You’ll need to install and load the package in before use.


```r
# install.packages("readxl")
library(readxl)
```

--

You will use the function `read_excel()` to read an Excel file into your R Environment. The only required argument of this function is the path to the Excel file on your computer. In the following example, `read_excel()` would look for the file “filename.xlsx” in your **current working directory.** If the file were located somewhere else on your computer, you would have to provide the path to that file


```r
# read Excel file into R
excel_df &lt;- read_excel("filename.xlsx")
```
---

## Google Sheets files
Many of you probably work in Google Sheets instead of Excel. While you could download data stored in Google Sheets as an Excel file and then import it into R using the `readxl` package, there's a package that allows you to read in a Google Sheets file directly from the Internet, where it lives: `googlesheets4`!

--

Note that if the data hosted on Google Sheets changes, every time the file is read into R, the most updated version of the file will be utilized. This can be very helpful if you’re collecting data over time; however, it could lead to unexpected changes in results if you’re not aware that the data in the Google Sheet is changing.


```r
# install.packages("googlesheets4")
# load package
library(googlesheets4)
```

--

The `googlesheets4` package allows R users to take advantage of the Google Sheets Application Programming Interface (API). Very generally, APIs allow different applications to communicate with one another. In this case, Google has released an API that allows other software to communicate with Google Drive and retrieve data and information directly from Google Sheets.
---

Every time you start a new session, you need to authenticate the use of the `googlesheets4` package with your Google account. 


```r
gs4_auth() # run this to open the Google API authenticator
```

You can use `googlesheets4` to search through the various Google Sheets in your account using `gs4_find()`. Then, you can use the `read_sheets()` function by typing in the id listed for your Google Sheet of interest when using `gs4_find()`. 

```r
gs4_find()

# read Google Sheet into R with id
read_sheet("2cdw-678dSPLfdID__LIt8eEFZPasdebgIGwHk") # note this is a fake id
```

---

You can also navigate to your own sheets or to other people’s sheets using a URL.

```r
# read Google Sheet into R with URL
googlesheet_df &lt;- read_sheet("https://docs.google.com/spreadsheets/d/1FN7VVKzJJyifZFY5POdz_LalGTBYaC4SLB-X9vyDnbY/")

googlesheet_df
```

```
## # A tibble: 5 × 7
##   name   hrs_working hrs_sleeping hrs_fun hrs_eating hrs_socializing hrs_other
##   &lt;chr&gt;        &lt;dbl&gt;        &lt;dbl&gt;   &lt;dbl&gt;      &lt;dbl&gt;           &lt;dbl&gt;     &lt;dbl&gt;
## 1 Damon            9            7       1          1               2         4
## 2 Lilly            7            8       2          1               1         5
## 3 Will             8            8       0          2               2         4
## 4 Aisha            8            6       2          1               4         3
## 5 Hassan           6            9       3          2               2         2
```

---

## CSV files
Like Excel Spreadsheets and Google Sheets, Comma-separated values (CSV) files allow us to store tabular data. Note that CSV files have a .csv extension at the end. CSVs are plain-text files. This means that there are no workbooks or metadata making it difficult to open these files. 

--

One of the advantages of CSV files is their simplicity. CSVs are flexible files and are thus the preferred storage method for tabular data for many researchers.

--

You can read .csv files directly into R using a "base" R command: `read.csv()`. However, we'll use the `tidyverse` version in this workshop, which is `read_csv()` from the `readr` package, which is loaded when you load the `tidyverse`.


```r
# install.packages("tidyverse")
# load package
library(tidyverse)
```

---

To read in a .csv file using the `readr::read_csv()` command is simple. Without any additional steps, it automatically knows that the first row of the file contains the variable names and figures out the "type" of most variables (although not the `day` variable - we'll talk about dates later!)


```r
# read CSV into R
csv_df &lt;- read_csv("./data/weather_data.csv")

# look at the object
# you can you head() or just print the df, if it's a tibble
# it will only print out the first 10 rows anyway!
head(csv_df) 
```

```
## # A tibble: 6 × 8
##   day      season   month    ws    wd dewpoint  temp        rain
##   &lt;chr&gt;    &lt;chr&gt;    &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;    &lt;dbl&gt; &lt;dbl&gt;       &lt;dbl&gt;
## 1 11/28/18 Kilns On Nov    1.20  354.     287.  294. 0.0000014  
## 2 11/29/18 Kilns On Nov    1.01  344.     287.  294. 0.00000104 
## 3 11/30/18 Kilns On Nov    1.15  314.     287.  294. 0.00000144 
## 4 12/1/18  Kilns On Dec    1.29  341.     287.  294. 0.000000108
## 5 12/2/18  Kilns On Dec    1.33  326.     287.  294. 0          
## 6 12/3/18  Kilns On Dec    1.75  348.     287.  294. 0
```
---

If you had a few extra rows of information that you didn't want to import as part of your data frame, you can easily skip those:

```r
# read CSV into R
csv_df &lt;- read_csv("./data/weather_data.csv", skip = 3) # skips the first 3 rows
 
# look at the object
csv_df # see how more rows are printed out when we didn't use the head() command?
```

```
## # A tibble: 250 × 8
##    `11/30/18` `Kilns On` Nov   `1.15185662` `314.1624506` `286.879566`
##    &lt;chr&gt;      &lt;chr&gt;      &lt;chr&gt;        &lt;dbl&gt;         &lt;dbl&gt;        &lt;dbl&gt;
##  1 12/1/18    Kilns On   Dec           1.29          341.         287.
##  2 12/2/18    Kilns On   Dec           1.33          326.         287.
##  3 12/3/18    Kilns On   Dec           1.75          348.         287.
##  4 12/4/18    Kilns On   Dec           1.95          319.         288.
##  5 12/5/18    Kilns On   Dec           1.98          318.         288.
##  6 12/6/18    Kilns On   Dec           1.84          307.         287.
##  7 12/7/18    Kilns On   Dec           1.76          322.         287.
##  8 12/8/18    Kilns On   Dec           2.25          325.         286.
##  9 12/9/18    Kilns On   Dec           1.79          332.         287.
## 10 12/10/18   Kilns On   Dec           1.59          349.         286.
## # … with 240 more rows, and 2 more variables: `293.8420586` &lt;dbl&gt;,
## #   `1.44E-06` &lt;dbl&gt;
```

---

## Text (.txt) files
Sometimes, tab-separated files are saved with the .txt file extension. TXT files can store tabular data, but they can also store simple text. In these cases, you’ll want to use the more generic `read_delim()` function from readr.


```r
# read a TXT file into R
txt_df &lt;- read_delim("./data/HAP0006.txt", delim = ",") # specify the delimiter

# look at the object
head(txt_df)
```

```
## # A tibble: 6 × 12
##   household   device     Time specCO figaroCO2 plantower_2_5_v… plantower_10_va…
##   &lt;chr&gt;       &lt;chr&gt;     &lt;dbl&gt;  &lt;dbl&gt;     &lt;dbl&gt;            &lt;dbl&gt;            &lt;dbl&gt;
## 1 Icddrb test HAP0006  1.54e9   1671      2711               36               39
## 2 Icddrb test HAP0006  1.54e9   1671      2722               37               46
## 3 Icddrb test HAP0006  1.54e9   1671      2688               37               39
## 4 Icddrb test HAP0006  1.54e9   1671      2765               37               43
## 5 Icddrb test HAP0006  1.54e9   1671      2689               48               59
## 6 Icddrb test HAP0006  1.54e9   1671      2646               41               49
## # … with 5 more variables: bme_temp_C &lt;dbl&gt;, bme_humidity &lt;dbl&gt;,
## #   fuel_v_cell &lt;dbl&gt;, fuel_percent &lt;dbl&gt;, loggedAt &lt;dttm&gt;
```
???
view the hap.txt file (within R) just to show what a .txt file looks like
---

## Stata .dta files
You can import Stata .dta files directly into R using the `haven` package. This package also allows you to import SAS or SPSS files. The command within the package you'll use is `read_dta()` (notice the pattern?):

```r
# install.packages("haven")
library(haven)

dta_df &lt;- read_dta("./data/lead_mortality.dta")

# look at the object
head(dta_df)
```

```
## # A tibble: 6 × 15
##    year city      state   age hardness    ph infrate typhoid_rate np_tub_rate
##   &lt;dbl&gt; &lt;chr&gt;     &lt;chr&gt; &lt;dbl&gt;    &lt;dbl&gt; &lt;dbl&gt;   &lt;dbl&gt;        &lt;dbl&gt;       &lt;dbl&gt;
## 1  1900 Alameda   CA     29.0       97  7.60   0.110       0.0244     0.0305 
## 2  1900 Albany    NY     30.3       43  7.30   0.299       0.0414     0.0138 
## 3  1900 Allegheny PA     27.1      111  7.30   0.447       0.0940     0.0277 
## 4  1900 Allentown PA     27.8      176  7.70   0.384       0.0282     0.00565
## 5  1900 Altoona   PA     27.0      111  7.30   0.468       0.0437     0.00771
## 6  1900 Amsterdam NY     28.6       43  7.30   0.306       0.0144     0.0191 
## # … with 6 more variables: mom_rate &lt;dbl&gt;, population &lt;dbl&gt;,
## #   precipitation &lt;dbl&gt;, temperature &lt;dbl&gt;, lead &lt;dbl&gt;, foreign_share &lt;dbl&gt;
```

---

## API Data
We already saw how to use an API to access data stored in Google Sheets. APIs can be used to access many other sources of data as well - and many people have written special packages that allow users of common sources of data to authenticate their credentials and then easily import data into R. I won't go through how to use them all - but will list a few common ones:

- [tidycensus](https://walker-data.com/tidycensus/) is an R package that allows users to interface with a select number of the US Census Bureau’s data APIs and return tidyverse-ready data frames.
- [rdhs](https://cran.r-project.org/web/packages/rdhs/vignettes/client.html) allows users to access, search, download, and load USAID's Demographic and Health Survey data.
- [qualtRics](https://cran.r-project.org/web/packages/qualtRics/vignettes/qualtRics.html) implements the retrieval of survey data using the Qualtrics API and aims to reduce the pre-processing steps needed in analyzing such surveys.


---
layout:true
# Exporting Data
---

R has several native formats for writing R objects. These are both very efficient in terms of space as well.

- .rds files: can store a single R object, such as a data frame
- .Rdata or .Rda files: can store multiple objects (think of it like a list)

--

Rds files:

```r
# saves the object "dta_df" as an Rds file
saveRDS(dta_df, file = "./output/example.rds")

# load an Rds fike
rds_df &lt;- readRDS("./output/example.rds")
```

--
Rdata files:

```r
# saves all the dataframes as a single Rdata file
save(excel_df, googlesheet_df, csv_df, txt_df, dta_df, 
     file = "./output/all_dfs.Rdata")

# loads the Rdata file
load("./output/all_dfs.Rdata")
```

???
Note that for rds we use "saveRDS" and "readRDS" and the file extension is .rds, but for Rdata, we just use save and load and the file extension is .Rdata - these differences matter! 
---

You can also export data in many formats using the `write` versions of these commands. This can be useful if you're collaborating with people who use different programs.

You need to specify the object name and the file path with the correct file extension, for the file it will be written to.

```r
# Write to csv
write_csv(dta_df, file = "./output/example.csv")

# Write to stata
write_dta(csv_df, file = "./output/example.dta")
```

---
layout:false
class: middle, center, inverse
# Questions?

---
layout:true
# Data Wrangling
---

Before we get into "data wrangling" in the Tidyverse, we have to understand what [tidy data](https://www.jstatsoft.org/article/view/v059i10) are.

--

## Principles of Tidy Data
.footnote[*The tidyverse has a very strong opinion about what "tidy" data is. I agree with most of the principles, but think there are plenty of cases in which people need to work with "untidy" data by the tidyverse definitions.]

1. Each variable should be in one column.

--

2. Each observation of that variable should be in a different row.

--

3. There should be one table for each “type” of data.

--

4. If you have multiple tables, they should include a column in each spreadsheet (with the same column label!) that allows them to be joined or merged.

???
An example of different types of data could be demographic information about patients, which would be stored in one data frame, and measurements collected from those patients at the doctor (such as height, weight, and BP) in another tidy data frame. Clearly, depending on your research question, you may need to merge these into a single data frame to run a regression of BP on patient demographics.
---
layout:false
# Examples of  "untidy" data
![untidyy](./figs/untidy.png)
???
What makes these "untidy"?
---

# Example of tidy data

![tidy](./figs/tidy-1.png)
---
# Tidying Process aka Data Wrangling

![untidy_to_tidy](./figs/untidy_to_tidy.png)

---
# The pipe operator &lt;img src="./figs/pipe.png" alt="pipe" width="60"/&gt;
Pipes, or `%&gt;%` are a `tidyverse` tool (officially contained in the `magrittr` package) for chaining multiple operations on the same data frame together. They can greatly simplify your code and make your operations more intuitive. I'll introduce them now, and soon you'll see how useful they can be! Note that if you're not using pipes, you'll have to specify a data frame directly as an argument to any commands you use.

--


```r
# head(dta_df) # this is how you specify the data argument 
dta_df %&gt;%
    head()
```

```
## # A tibble: 6 × 15
##    year city      state   age hardness    ph infrate typhoid_rate np_tub_rate
##   &lt;dbl&gt; &lt;chr&gt;     &lt;chr&gt; &lt;dbl&gt;    &lt;dbl&gt; &lt;dbl&gt;   &lt;dbl&gt;        &lt;dbl&gt;       &lt;dbl&gt;
## 1  1900 Alameda   CA     29.0       97  7.60   0.110       0.0244     0.0305 
## 2  1900 Albany    NY     30.3       43  7.30   0.299       0.0414     0.0138 
## 3  1900 Allegheny PA     27.1      111  7.30   0.447       0.0940     0.0277 
## 4  1900 Allentown PA     27.8      176  7.70   0.384       0.0282     0.00565
## 5  1900 Altoona   PA     27.0      111  7.30   0.468       0.0437     0.00771
## 6  1900 Amsterdam NY     28.6       43  7.30   0.306       0.0144     0.0191 
## # … with 6 more variables: mom_rate &lt;dbl&gt;, population &lt;dbl&gt;,
## #   precipitation &lt;dbl&gt;, temperature &lt;dbl&gt;, lead &lt;dbl&gt;, foreign_share &lt;dbl&gt;
```

???
When you use the pipe, you start with the name of the df and then chain commands onto that dataframe from the pipe.
---
layout:true
# Basic data frame operations
---

To see how these work, let's use the World Bank [World Development Indicators Data](https://databank.worldbank.org/source/world-development-indicators), which we read into R earlier using `read_excel()`.


```r
wdi &lt;- read_excel("./data/Data_Extract_From_World_Development_Indicators.xlsx",
                       n_max = max(which(wdi$`Country Name` == "Zimbabwe")))
head(wdi)
```

```
## # A tibble: 6 × 16
##   `Country Name` `Country Code` `Series Name`      `Series Code` `2010 [YR2010]`
##   &lt;chr&gt;          &lt;chr&gt;          &lt;chr&gt;              &lt;chr&gt;         &lt;chr&gt;          
## 1 Afghanistan    AFG            Access to electri… EG.ELC.ACCS.… 42.70000076293…
## 2 Afghanistan    AFG            Educational attai… SE.SEC.CUAT.… ..             
## 3 Afghanistan    AFG            Educational attai… SE.SEC.CUAT.… ..             
## 4 Afghanistan    AFG            Poverty gap at $1… SI.POV.GAPS   ..             
## 5 Afghanistan    AFG            Incidence of mala… SH.MLR.INCD.… 12.98047343667…
## 6 Afghanistan    AFG            Individuals using… IT.NET.USER.… 4              
## # … with 11 more variables: `2011 [YR2011]` &lt;chr&gt;, `2012 [YR2012]` &lt;chr&gt;,
## #   `2013 [YR2013]` &lt;chr&gt;, `2014 [YR2014]` &lt;chr&gt;, `2015 [YR2015]` &lt;chr&gt;,
## #   `2016 [YR2016]` &lt;chr&gt;, `2017 [YR2017]` &lt;chr&gt;, `2018 [YR2018]` &lt;chr&gt;,
## #   `2019 [YR2019]` &lt;chr&gt;, `2020 [YR2020]` &lt;chr&gt;, `2021 [YR2021]` &lt;chr&gt;
```
???
Talk through what the WDI data contains. All countries, different series, 2010-2021. Explain how it's wide &amp; long - with different years stored as variables and different variables stored in rows -- very untidy data!
---

The opposite of `head` is `tail`:

```r
tail(wdi)
```

```
## # A tibble: 6 × 16
##   `Country Name` `Country Code` `Series Name`      `Series Code` `2010 [YR2010]`
##   &lt;chr&gt;          &lt;chr&gt;          &lt;chr&gt;              &lt;chr&gt;         &lt;chr&gt;          
## 1 Zimbabwe       ZWE            Individuals using… IT.NET.USER.… 6.4            
## 2 Zimbabwe       ZWE            Labor force parti… SL.TLF.CACT.… ..             
## 3 Zimbabwe       ZWE            Labor force parti… SL.TLF.CACT.… ..             
## 4 Zimbabwe       ZWE            Prevalence of und… SN.ITK.DEFC.… ..             
## 5 Zimbabwe       ZWE            Military expendit… MS.MIL.XPND.… 0.816251453246…
## 6 Zimbabwe       ZWE            GDP per capita (c… NY.GDP.PCAP.… 1110.447012348…
## # … with 11 more variables: `2011 [YR2011]` &lt;chr&gt;, `2012 [YR2012]` &lt;chr&gt;,
## #   `2013 [YR2013]` &lt;chr&gt;, `2014 [YR2014]` &lt;chr&gt;, `2015 [YR2015]` &lt;chr&gt;,
## #   `2016 [YR2016]` &lt;chr&gt;, `2017 [YR2017]` &lt;chr&gt;, `2018 [YR2018]` &lt;chr&gt;,
## #   `2019 [YR2019]` &lt;chr&gt;, `2020 [YR2020]` &lt;chr&gt;, `2021 [YR2021]` &lt;chr&gt;
```
---

`names()` or `ls()` or `colnames()` will all give you the variable names of a dataframe:

--


```r
names(wdi)
```

```
##  [1] "Country Name"  "Country Code"  "Series Name"   "Series Code"  
##  [5] "2010 [YR2010]" "2011 [YR2011]" "2012 [YR2012]" "2013 [YR2013]"
##  [9] "2014 [YR2014]" "2015 [YR2015]" "2016 [YR2016]" "2017 [YR2017]"
## [13] "2018 [YR2018]" "2019 [YR2019]" "2020 [YR2020]" "2021 [YR2021]"
```

```r
ls(wdi) # not that ls prints them in alphabetical order, while the other 2 print them in the order they appear in the df
```

```
##  [1] "2010 [YR2010]" "2011 [YR2011]" "2012 [YR2012]" "2013 [YR2013]"
##  [5] "2014 [YR2014]" "2015 [YR2015]" "2016 [YR2016]" "2017 [YR2017]"
##  [9] "2018 [YR2018]" "2019 [YR2019]" "2020 [YR2020]" "2021 [YR2021]"
## [13] "Country Code"  "Country Name"  "Series Code"   "Series Name"
```

```r
colnames(wdi)
```

```
##  [1] "Country Name"  "Country Code"  "Series Name"   "Series Code"  
##  [5] "2010 [YR2010]" "2011 [YR2011]" "2012 [YR2012]" "2013 [YR2013]"
##  [9] "2014 [YR2014]" "2015 [YR2015]" "2016 [YR2016]" "2017 [YR2017]"
## [13] "2018 [YR2018]" "2019 [YR2019]" "2020 [YR2020]" "2021 [YR2021]"
```
---

More ways to query basic info on a data frame:

```r
ncol(wdi) # number of columns
```

```
## [1] 16
```

```r
nrow(wdi) # number of rows
```

```
## [1] 2387
```

```r
length(wdi) # length when applied to a df also returns # of columns
```

```
## [1] 16
```

```r
dim(wdi) # dimensions of the df: rows by columns
```

```
## [1] 2387   16
```
---

You can get an overview of what's in the data frame using `summary()`:

```r
summary(wdi)
```

```
##  Country Name       Country Code       Series Name        Series Code       
##  Length:2387        Length:2387        Length:2387        Length:2387       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##  2010 [YR2010]      2011 [YR2011]      2012 [YR2012]      2013 [YR2013]     
##  Length:2387        Length:2387        Length:2387        Length:2387       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##  2014 [YR2014]      2015 [YR2015]      2016 [YR2016]      2017 [YR2017]     
##  Length:2387        Length:2387        Length:2387        Length:2387       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##  2018 [YR2018]      2019 [YR2019]      2020 [YR2020]      2021 [YR2021]     
##  Length:2387        Length:2387        Length:2387        Length:2387       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character
```
???
Looks like we have some work to do on this data frame! Everything is currently stored as a character.
---

Another way to see what's in a data frame (and other objects) is with the `str()` command:

```r
str(wdi)
```

```
## tibble [2,387 × 16] (S3: tbl_df/tbl/data.frame)
##  $ Country Name : chr [1:2387] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ Country Code : chr [1:2387] "AFG" "AFG" "AFG" "AFG" ...
##  $ Series Name  : chr [1:2387] "Access to electricity (% of population)" "Educational attainment, at least completed lower secondary, population 25+, female (%) (cumulative)" "Educational attainment, at least completed lower secondary, population 25+, male (%) (cumulative)" "Poverty gap at $1.90 a day (2011 PPP) (%)" ...
##  $ Series Code  : chr [1:2387] "EG.ELC.ACCS.ZS" "SE.SEC.CUAT.LO.FE.ZS" "SE.SEC.CUAT.LO.MA.ZS" "SI.POV.GAPS" ...
##  $ 2010 [YR2010]: chr [1:2387] "42.700000762939503" ".." ".." ".." ...
##  $ 2011 [YR2011]: chr [1:2387] "43.222019195556598" ".." ".." ".." ...
##  $ 2012 [YR2012]: chr [1:2387] "69.099998474121094" ".." ".." ".." ...
##  $ 2013 [YR2013]: chr [1:2387] "68.2906494140625" ".." ".." ".." ...
##  $ 2014 [YR2014]: chr [1:2387] "89.5" ".." ".." ".." ...
##  $ 2015 [YR2015]: chr [1:2387] "71.5" ".." ".." ".." ...
##  $ 2016 [YR2016]: chr [1:2387] "97.699996948242202" ".." ".." ".." ...
##  $ 2017 [YR2017]: chr [1:2387] "97.699996948242202" ".." ".." ".." ...
##  $ 2018 [YR2018]: chr [1:2387] "96.616134643554702" ".." ".." ".." ...
##  $ 2019 [YR2019]: chr [1:2387] "97.699996948242202" ".." ".." ".." ...
##  $ 2020 [YR2020]: chr [1:2387] "97.699996948242202" ".." ".." ".." ...
##  $ 2021 [YR2021]: chr [1:2387] ".." "6.39573001861572" "14.865710258483899" ".." ...
```
---

To specify a single variable from a data frame, use the dollar sign `$` operator:

```r
head(wdi$`Country Name`) # note that variable names shouldn't have spaces in them; this is a "bad" variable name
```

```
## [1] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan"
## [6] "Afghanistan"
```

```r
head(wdi$`Series Name`) # that's why they're enclosed in `back ticks`
```

```
## [1] "Access to electricity (% of population)"                                                            
## [2] "Educational attainment, at least completed lower secondary, population 25+, female (%) (cumulative)"
## [3] "Educational attainment, at least completed lower secondary, population 25+, male (%) (cumulative)"  
## [4] "Poverty gap at $1.90 a day (2011 PPP) (%)"                                                          
## [5] "Incidence of malaria (per 1,000 population at risk)"                                                
## [6] "Individuals using the Internet (% of population)"
```

---

## Recap

- You'll mostly be working with data frames and their cousin, the tibble

- Use the `tidyverse`!!! This will provide a special type of data frame called a “tibble” that has nice default printing behavior, among other benefits.

- When in doubt, `str()` something or print something.

- Always understand the basic extent of your data frames: number of rows and columns.

- Understand what variables are in your data.

- Refer to variables by name, e.g., wdi$country_name, not by column number. Your code will be more robust and readable.

---
layout:false
# Data Cleaning Packages

[`dplyr`](https://dplyr.tidyverse.org/) is the workhorse package for data cleaning in the  `tidyverse`. We'll also make a lot of use [`tidyr`](https://tidyr.tidyverse.org/) and [`janitor`](https://cran.r-project.org/web/packages/janitor/vignettes/janitor.html), which is not part of the `tidyverse`.

If you've never installed these before you'll need to install them first before loading using `install.package("janitor")`, for example `dplyr` and `tidyr` are core `tidyverse` packages and will be installed automatically when you run `install.package("tidyverse")`.

Then you need to load them:

```r
library(janitor)
library(tidyverse) # loads all of the core tidyverse packages
```
---
# Cleaning up variable names

`janitor` provides useful tools for cleaning messy data, such as:

- `clean_names()` - cleans the variable names of a data frame
- `tabyl()` - provides cross-tabs get of variables (although I like the `tabulator` pacakge for this)
- `get_dupes()` - identify duplicate observations

In R, variable name must start with letter and can contain numbers, letters, underscores ('_') and periods ('.'). Special characters and spaces are not allowed. Tibbles will allow variable names that are not allowed in base R, but it can complicate things to have these. So let's fix this WDI data.


```r
# these are equivalent -- but the first uses the pipe (%&gt;%) syntax
wdi_clean &lt;- wdi %&gt;%
    clean_names()

wdi_clean &lt;- clean_names(wdi)
```
???
You have to assign your new data frame to an object to save this version with clean names.If you use the same name, it will override that object with the new version. 
---
# Cleaning up variable names

Let's see what the variable names look like now:

```r
head(wdi_clean)
```

```
## # A tibble: 6 × 16
##   country_name country_code series_name    series_code x2010_yr2010 x2011_yr2011
##   &lt;chr&gt;        &lt;chr&gt;        &lt;chr&gt;          &lt;chr&gt;       &lt;chr&gt;        &lt;chr&gt;       
## 1 Afghanistan  AFG          Access to ele… EG.ELC.ACC… 42.70000076… 43.22201919…
## 2 Afghanistan  AFG          Educational a… SE.SEC.CUA… ..           ..          
## 3 Afghanistan  AFG          Educational a… SE.SEC.CUA… ..           ..          
## 4 Afghanistan  AFG          Poverty gap a… SI.POV.GAPS ..           ..          
## 5 Afghanistan  AFG          Incidence of … SH.MLR.INC… 12.98047343… 15.60724125…
## 6 Afghanistan  AFG          Individuals u… IT.NET.USE… 4            5           
## # … with 10 more variables: x2012_yr2012 &lt;chr&gt;, x2013_yr2013 &lt;chr&gt;,
## #   x2014_yr2014 &lt;chr&gt;, x2015_yr2015 &lt;chr&gt;, x2016_yr2016 &lt;chr&gt;,
## #   x2017_yr2017 &lt;chr&gt;, x2018_yr2018 &lt;chr&gt;, x2019_yr2019 &lt;chr&gt;,
## #   x2020_yr2020 &lt;chr&gt;, x2021_yr2021 &lt;chr&gt;
```
???
The default makes variable names lower case and puts an underscore where spaces used to be. You can change these defaults by specifying other available patterns - look up the help file to see what's there. Also notice how the variables that were all numeric now have an x in front of them? 
---
#  dplyr &lt;img src="./figs/dplyr.png" alt="dplyr" width="60"/&gt;

Below is a list of the main functions/commands available within `dplyr`. We won't cover all of them in this workshop. I highly recommend reviewing the documentation for [`dplyr`](https://dplyr.tidyverse.org/), which has extensive explanations and examples. The nice thing about `dplyr` functions is they're all very literal -- they do what they say:

- `mutate()` - create a new column or modify an existing column
- `glimpse()` - get an overview of what’s included in dataset (simialr to `str()`)
- `filter()` - filter rows
- `select()` - select, rename, and reorder columns
- `rename()` - rename columns
- `arrange()` - reorder rows
- `group_by()` - group variables
- `summarize()` - summarize information within a dataset
- `left_join()` - combine data across data frame (other types of joins ar also available)
- `tally()` - get overall sum of values of specified column(s) or the number of rows of tibble
- `count()` - get counts of unique values of specified column(s) (shortcut of group_by() and tally())
- `add_count()` - add values of count() as a new column
- `add_tally()` - add value(s) of tally() as a new column
---

layout: true
# Mutate
---

A lot of the data cleaning work you'll do is within the `mutate` function. For example, let's convert the country name variable into a factor and then convert these numeric variables that are being stored as strings into actual numbers. 

Factors are used for storing categorical variables and have a lot of useful properties for plotting as we'll see later. Factors are comprised of two components: the actual values of the data and the possible levels within the factor.  In general, the levels are friendly human-readable character strings, like “male/female” and “control/treated”. But never ever ever forget that, under the hood, R is really storing integer codes 1, 2, 3, etc.


```r
wdi &lt;- wdi %&gt;%  # notice how i'm chaining the original wdi df to the clean_names
    clean_names() %&gt;% # which is chained to the mutate 
    mutate(
        country_name = factor(country_name)
    )

str(wdi$country_name)
```

```
##  Factor w/ 217 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
```

???
Now that country name is a factor, the levels are the names of the countries, but the data underneath those levels are numeric. 
---


```r
wdi &lt;- wdi %&gt;%
    mutate(
        x2010_yr2010 = as.numeric(x2010_yr2010)
    )

str(wdi$x2010_yr2010) # numeric
```

```
##  num [1:2387] 42.7 NA NA NA 13 ...
```

```r
str(wdi$x2011_yr2011) # still a character string
```

```
##  chr [1:2387] "43.222019195556598" ".." ".." ".." "15.607241259574" "5" ...
```

We changed the `x2010_yr2010` variable to be numeric, but what about all the other years? It would be inefficient if we had to do these al one-by-one. Luckily, we don't! We can use the `tidyverse` [selection tools](https://dplyr.tidyverse.org/reference/select.html).
---

In this case, to mutate many variables at once, it's easy since they all have the same pattern -- they all start with "x". We'll use `mutate()` with [`across`](https://dplyr.tidyverse.org/reference/across.html), that makes it easy to apply the same operation across multiple columns at once.


```r
wdi &lt;- wdi %&gt;%
    mutate(across(starts_with("x"), as.numeric)) 

# now all of them are numeric
str(wdi$x2010_yr2010)
```

```
##  num [1:2387] 42.7 NA NA NA 13 ...
```

```r
str(wdi$x2011_yr2011)
```

```
##  num [1:2387] 43.2 NA NA NA 15.6 ...
```

```r
str(wdi$x2021_yr2021)
```

```
##  num [1:2387] NA 6.4 14.9 NA NA ...
```
???
Notice the NAs -- this is what R shows when data are missing.
---
layout: true
# Filter
---

When working with a large dataset, you’re often interested in only working with a portion of the data at any one time. For example, this dataset contains data on 217 countries, but what if you wanted to work with data only for India or only for countries in South Asia? To do this, you would want to filter your dataset to only include data that match these criteria.


```r
india &lt;- wdi %&gt;%
    filter(
        country_name == "India"
    )

head(india)
```

```
## # A tibble: 6 × 16
##   country_name country_code series_name    series_code x2010_yr2010 x2011_yr2011
##   &lt;fct&gt;        &lt;chr&gt;        &lt;chr&gt;          &lt;chr&gt;              &lt;dbl&gt;        &lt;dbl&gt;
## 1 India        IND          Access to ele… EG.ELC.ACC…         76.3         67.6
## 2 India        IND          Educational a… SE.SEC.CUA…         NA           27.7
## 3 India        IND          Educational a… SE.SEC.CUA…         NA           47.1
## 4 India        IND          Poverty gap a… SI.POV.GAPS         NA            4.6
## 5 India        IND          Incidence of … SH.MLR.INC…         17.5         14.8
## 6 India        IND          Individuals u… IT.NET.USE…          7.5         10.1
## # … with 10 more variables: x2012_yr2012 &lt;dbl&gt;, x2013_yr2013 &lt;dbl&gt;,
## #   x2014_yr2014 &lt;dbl&gt;, x2015_yr2015 &lt;dbl&gt;, x2016_yr2016 &lt;dbl&gt;,
## #   x2017_yr2017 &lt;dbl&gt;, x2018_yr2018 &lt;dbl&gt;, x2019_yr2019 &lt;dbl&gt;,
## #   x2020_yr2020 &lt;dbl&gt;, x2021_yr2021 &lt;dbl&gt;
```
---

You can filter based on multiple criteria using the or ("|") operator.

```r
wdi %&gt;% 
# notice this time I didn't assign it to a new object? this won't be saved in the environment, only printed out
    filter(
        country_name == "India" | country_name =="Bangladesh"
    )
```

```
## # A tibble: 22 × 16
##    country_name country_code series_name   series_code x2010_yr2010 x2011_yr2011
##    &lt;fct&gt;        &lt;chr&gt;        &lt;chr&gt;         &lt;chr&gt;              &lt;dbl&gt;        &lt;dbl&gt;
##  1 Bangladesh   BGD          Access to el… EG.ELC.ACC…        55.3         59.6 
##  2 Bangladesh   BGD          Educational … SE.SEC.CUA…        NA           29.6 
##  3 Bangladesh   BGD          Educational … SE.SEC.CUA…        NA           41.1 
##  4 Bangladesh   BGD          Poverty gap … SI.POV.GAPS         3.5         NA   
##  5 Bangladesh   BGD          Incidence of… SH.MLR.INC…         4.37         3.95
##  6 Bangladesh   BGD          Individuals … IT.NET.USE…         3.7          4.5 
##  7 Bangladesh   BGD          Labor force … SL.TLF.CAC…        35.5         NA   
##  8 Bangladesh   BGD          Labor force … SL.TLF.CAC…        81.6         NA   
##  9 Bangladesh   BGD          Prevalence o… SN.ITK.DEF…        15.2         15.7 
## 10 Bangladesh   BGD          Military exp… MS.MIL.XPN…         1.32         1.36
## # … with 12 more rows, and 10 more variables: x2012_yr2012 &lt;dbl&gt;,
## #   x2013_yr2013 &lt;dbl&gt;, x2014_yr2014 &lt;dbl&gt;, x2015_yr2015 &lt;dbl&gt;,
## #   x2016_yr2016 &lt;dbl&gt;, x2017_yr2017 &lt;dbl&gt;, x2018_yr2018 &lt;dbl&gt;,
## #   x2019_yr2019 &lt;dbl&gt;, x2020_yr2020 &lt;dbl&gt;, x2021_yr2021 &lt;dbl&gt;
```
---

We can also mutate to create a new variable and then filter based on that. Here we'll use the "in" `%in%` function that comes in base R. This code creates a new `LOGICAL` variable that is `TRUE` if the country name is any of those South Asian countries and `FALSE` otherwise.


```r
sa &lt;- wdi %&gt;% 
    # this south asia variable follows the WB classification for South Asian countries
    mutate(
        south_asia = country_name %in% c("Afghanistan", "Bangladesh","Bhutan",
                                        "India", "Maldives", "Nepal", 
                                        "Pakistan", "Sri Lanka")
    ) %&gt;%
    filter(
        south_asia # by default R takes the TRUE cases
    )
```
---

Let's take a look at this new data frame. We can use `dplyr::count()` to give us a quick tally of 

```r
sa %&gt;%
    count(country_name)
```

```
## # A tibble: 8 × 2
##   country_name     n
##   &lt;fct&gt;        &lt;int&gt;
## 1 Afghanistan     11
## 2 Bangladesh      11
## 3 Bhutan          11
## 4 India           11
## 5 Maldives        11
## 6 Nepal           11
## 7 Pakistan        11
## 8 Sri Lanka       11
```
---
layout: true
# Select
---

`filter()` will subset data frames by row (ie. if you observations meet a certain condition), `select()` is what you use when you want to... select certain columns. You can also use `select()` to reorder columns. For example, maybe we don't want to keep a country name AND a country code variable in this data set (or a series name AND a series code).

Here I'm selecting `country_name`, `series_name`, and all the variables that start with "x" (using that same "selection tool" we saw before).

```r
wdi %&gt;%
    select(country_name, series_name, starts_with("x"))
```

```
## # A tibble: 2,387 × 14
##    country_name series_name  x2010_yr2010 x2011_yr2011 x2012_yr2012 x2013_yr2013
##    &lt;fct&gt;        &lt;chr&gt;               &lt;dbl&gt;        &lt;dbl&gt;        &lt;dbl&gt;        &lt;dbl&gt;
##  1 Afghanistan  Access to e…        42.7         43.2         69.1         68.3 
##  2 Afghanistan  Educational…        NA           NA           NA           NA   
##  3 Afghanistan  Educational…        NA           NA           NA           NA   
##  4 Afghanistan  Poverty gap…        NA           NA           NA           NA   
##  5 Afghanistan  Incidence o…        13.0         15.6          9.19        11.2 
##  6 Afghanistan  Individuals…         4            5            5.45         5.9 
##  7 Afghanistan  Labor force…        NA           NA           16.0         NA   
##  8 Afghanistan  Labor force…        NA           NA           77.1         NA   
##  9 Afghanistan  Prevalence …        23.7         24.7         28.2         26.3 
## 10 Afghanistan  Military ex…         1.95         1.82         1.18         1.08
## # … with 2,377 more rows, and 8 more variables: x2014_yr2014 &lt;dbl&gt;,
## #   x2015_yr2015 &lt;dbl&gt;, x2016_yr2016 &lt;dbl&gt;, x2017_yr2017 &lt;dbl&gt;,
## #   x2018_yr2018 &lt;dbl&gt;, x2019_yr2019 &lt;dbl&gt;, x2020_yr2020 &lt;dbl&gt;,
## #   x2021_yr2021 &lt;dbl&gt;
```
---

You can also do the "negative" of select and say you don't want certain variables:

```r
wdi %&gt;%
    select(-c(country_code, series_code))
```

```
## # A tibble: 2,387 × 14
##    country_name series_name  x2010_yr2010 x2011_yr2011 x2012_yr2012 x2013_yr2013
##    &lt;fct&gt;        &lt;chr&gt;               &lt;dbl&gt;        &lt;dbl&gt;        &lt;dbl&gt;        &lt;dbl&gt;
##  1 Afghanistan  Access to e…        42.7         43.2         69.1         68.3 
##  2 Afghanistan  Educational…        NA           NA           NA           NA   
##  3 Afghanistan  Educational…        NA           NA           NA           NA   
##  4 Afghanistan  Poverty gap…        NA           NA           NA           NA   
##  5 Afghanistan  Incidence o…        13.0         15.6          9.19        11.2 
##  6 Afghanistan  Individuals…         4            5            5.45         5.9 
##  7 Afghanistan  Labor force…        NA           NA           16.0         NA   
##  8 Afghanistan  Labor force…        NA           NA           77.1         NA   
##  9 Afghanistan  Prevalence …        23.7         24.7         28.2         26.3 
## 10 Afghanistan  Military ex…         1.95         1.82         1.18         1.08
## # … with 2,377 more rows, and 8 more variables: x2014_yr2014 &lt;dbl&gt;,
## #   x2015_yr2015 &lt;dbl&gt;, x2016_yr2016 &lt;dbl&gt;, x2017_yr2017 &lt;dbl&gt;,
## #   x2018_yr2018 &lt;dbl&gt;, x2019_yr2019 &lt;dbl&gt;, x2020_yr2020 &lt;dbl&gt;,
## #   x2021_yr2021 &lt;dbl&gt;
```
---

And combine everything: `mutate()`,  `filter()` &amp; `select()`, chaining all the operations together with `%&gt;%`s:

```r
wdi %&gt;% 
    mutate(
        south_asia = country_name %in% c("Afghanistan", "Bangladesh","Bhutan",
                                        "India", "Maldives", "Nepal", 
                                        "Pakistan", "Sri Lanka")
    ) %&gt;%
    filter(
        south_asia # by default R takes the TRUE cases
    ) %&gt;%
    select(country_name, series_name, starts_with("x"))
```

```
## # A tibble: 88 × 14
##    country_name series_name  x2010_yr2010 x2011_yr2011 x2012_yr2012 x2013_yr2013
##    &lt;fct&gt;        &lt;chr&gt;               &lt;dbl&gt;        &lt;dbl&gt;        &lt;dbl&gt;        &lt;dbl&gt;
##  1 Afghanistan  Access to e…        42.7         43.2         69.1         68.3 
##  2 Afghanistan  Educational…        NA           NA           NA           NA   
##  3 Afghanistan  Educational…        NA           NA           NA           NA   
##  4 Afghanistan  Poverty gap…        NA           NA           NA           NA   
##  5 Afghanistan  Incidence o…        13.0         15.6          9.19        11.2 
##  6 Afghanistan  Individuals…         4            5            5.45         5.9 
##  7 Afghanistan  Labor force…        NA           NA           16.0         NA   
##  8 Afghanistan  Labor force…        NA           NA           77.1         NA   
##  9 Afghanistan  Prevalence …        23.7         24.7         28.2         26.3 
## 10 Afghanistan  Military ex…         1.95         1.82         1.18         1.08
## # … with 78 more rows, and 8 more variables: x2014_yr2014 &lt;dbl&gt;,
## #   x2015_yr2015 &lt;dbl&gt;, x2016_yr2016 &lt;dbl&gt;, x2017_yr2017 &lt;dbl&gt;,
## #   x2018_yr2018 &lt;dbl&gt;, x2019_yr2019 &lt;dbl&gt;, x2020_yr2020 &lt;dbl&gt;,
## #   x2021_yr2021 &lt;dbl&gt;
```
---
layout:false
# Rename

To rename variables in your dataset, you can use the aptly named `rename()` command. The syntax is `new_name = old_name`, which differs from other programs. You can do multiple variables within the same command -- they just need to be separated by a comma.


```r
wdi %&gt;%
    select(-c(country_code, series_code)) %&gt;%
    rename(
        country = country_name,
        series = series_name
    )
```

```
## # A tibble: 2,387 × 14
##    country     series        x2010_yr2010 x2011_yr2011 x2012_yr2012 x2013_yr2013
##    &lt;fct&gt;       &lt;chr&gt;                &lt;dbl&gt;        &lt;dbl&gt;        &lt;dbl&gt;        &lt;dbl&gt;
##  1 Afghanistan Access to el…        42.7         43.2         69.1         68.3 
##  2 Afghanistan Educational …        NA           NA           NA           NA   
##  3 Afghanistan Educational …        NA           NA           NA           NA   
##  4 Afghanistan Poverty gap …        NA           NA           NA           NA   
##  5 Afghanistan Incidence of…        13.0         15.6          9.19        11.2 
##  6 Afghanistan Individuals …         4            5            5.45         5.9 
##  7 Afghanistan Labor force …        NA           NA           16.0         NA   
##  8 Afghanistan Labor force …        NA           NA           77.1         NA   
##  9 Afghanistan Prevalence o…        23.7         24.7         28.2         26.3 
## 10 Afghanistan Military exp…         1.95         1.82         1.18         1.08
## # … with 2,377 more rows, and 8 more variables: x2014_yr2014 &lt;dbl&gt;,
## #   x2015_yr2015 &lt;dbl&gt;, x2016_yr2016 &lt;dbl&gt;, x2017_yr2017 &lt;dbl&gt;,
## #   x2018_yr2018 &lt;dbl&gt;, x2019_yr2019 &lt;dbl&gt;, x2020_yr2020 &lt;dbl&gt;,
## #   x2021_yr2021 &lt;dbl&gt;
```
---
layout:false
class: middle, center, inverse
# Questions?

---
layout:true
# Reshaping Data  &lt;img src="./figs/tidyr.png" alt="tidyr" width="60"/&gt;
---

Tidy data generally exist in two forms: wide data and long data. Both types of data are used and needed in data analysis, and fortunately, there are tools that can take you from wide-to-long format and from long-to-wide format (and bath and forth as needed). This is called **reshaping** data and makes it easy to work with any tidy dataset.

--

### Wide Data
Wide data has a column for each variable and a row for each observation. Data are often entered and stored in this manner. 

--

### Long Data
Long data, on the other hand, may have multiple rows for a given unit, such as repeated observations over time or different measurements for the same person (ie., height, weight, blood pressure).
---

The `tidyverse` contains two commands for reshaping data: `pivot_wider()`, which reshapes data from long to wide, and `pivot_longer()`, which reshape data from wide to long.

Before we introduce any more data cleaning tools, we need to reshape this WDI dataset. You may have noticed we're storing years as columns and variables as rows -- which is not ideal for most analysis, but that's how the data are exported from the World Bank Databank.

We'll actually need to `pivot_longer()` and `pivot_wider()` to get this data in the tidy format we want.

.footnote[
*I highly recommend the IPUMS PMA Data Analysis Hub post on [pivoting](https://tech.popdata.org/pma-data-hub/posts/2021-04-15-migration-discovery/).]
---


First, we'll `pivot_longer()` to get the years stored in rows. We're saying to keep `country` and `series` as they are (don't pivot them), and then put each column name into a new column called year and the corresponding value into a new column called value.

```r
wdi %&gt;%
    pivot_longer(
        -c(country, series), 
        names_to = "year",
        values_to = "value") %&gt;%
    head(n = 3)
```

```
## # A tibble: 3 × 4
##   country     series                                  year         value
##   &lt;fct&gt;       &lt;chr&gt;                                   &lt;chr&gt;        &lt;dbl&gt;
## 1 Afghanistan Access to electricity (% of population) x2010_yr2010  42.7
## 2 Afghanistan Access to electricity (% of population) x2011_yr2011  43.2
## 3 Afghanistan Access to electricity (% of population) x2012_yr2012  69.1
```
---

But we can do better. We only want the numeric part stored as the year and we don't need it twice. We can use the option `names_pattern` to specify a [regular expression](https://en.wikipedia.org/wiki/Regular_expression) that looks for numbers and we can use the `names_transform` option to convert it to an integer.


```r
wdi_long &lt;- wdi %&gt;% # this time we'll save it as a new object
    pivot_longer(
        -c(country, series), 
        names_to = "year",
        names_pattern = "(\\d+)",
        names_transform = list(year = as.integer),
        values_to = "value") 

wdi_long %&gt;%
    head(n = 3)
```

```
## # A tibble: 3 × 4
##   country     series                                   year value
##   &lt;fct&gt;       &lt;chr&gt;                                   &lt;int&gt; &lt;dbl&gt;
## 1 Afghanistan Access to electricity (% of population)  2010  42.7
## 2 Afghanistan Access to electricity (% of population)  2011  43.2
## 3 Afghanistan Access to electricity (% of population)  2012  69.1
```
---

Depending on your end goal, you may want to keep the data in this very long format. But for many purposes, you would want each of the `series` stored as a separate variable. So, let's `pivot_wider()` to do that. We have to specify which variable in the long data set the new variable names come from (`names_from`) and which variable contains the values (`values_from`).


```r
wdi_long %&gt;%
    pivot_wider(
        names_from = "series",
        values_from = "value"
    ) %&gt;%
    clean_names() %&gt;%
    head(n = 3)
```

```
## # A tibble: 3 × 13
##   country      year access_to_electricity_per… educational_att… educational_att…
##   &lt;fct&gt;       &lt;int&gt;                      &lt;dbl&gt;            &lt;dbl&gt;            &lt;dbl&gt;
## 1 Afghanistan  2010                       42.7               NA               NA
## 2 Afghanistan  2011                       43.2               NA               NA
## 3 Afghanistan  2012                       69.1               NA               NA
## # … with 8 more variables: poverty_gap_at_1_90_a_day_2011_ppp_percent &lt;dbl&gt;,
## #   incidence_of_malaria_per_1_000_population_at_risk &lt;dbl&gt;,
## #   individuals_using_the_internet_percent_of_population &lt;dbl&gt;,
## #   labor_force_participation_rate_female_percent_of_female_population_ages_15_national_estimate &lt;dbl&gt;,
## #   labor_force_participation_rate_male_percent_of_male_population_ages_15_national_estimate &lt;dbl&gt;,
## #   prevalence_of_undernourishment_percent_of_population &lt;dbl&gt;,
## #   military_expenditure_percent_of_gdp &lt;dbl&gt;, …
```
---

But now we have reintroduced a problem: invalid variable names that contain spaces and symbols. There are a lot of ways you could solve this. I'll use it as a chance to introduce another important `tidyverse` command: `case_when()` and some string operations.

First, I'll create an abbreviated version of the series, and then `pivot_wider()` using this version. To do that, let's look at what series (or variables) are in this data: 

```r
library(tabulator)

wdi_long %&gt;%
    tab(series)
```

```
## # A tibble: 11 × 4
##    series                                                       N  prop cum_prop
##    &lt;chr&gt;                                                    &lt;int&gt; &lt;dbl&gt;    &lt;dbl&gt;
##  1 Access to electricity (% of population)                   2604  0.09     0.09
##  2 Educational attainment, at least completed lower second…  2604  0.09     0.18
##  3 Educational attainment, at least completed lower second…  2604  0.09     0.27
##  4 GDP per capita (constant 2015 US$)                        2604  0.09     0.36
##  5 Incidence of malaria (per 1,000 population at risk)       2604  0.09     0.45
##  6 Individuals using the Internet (% of population)          2604  0.09     0.55
##  7 Labor force participation rate, female (% of female pop…  2604  0.09     0.64
##  8 Labor force participation rate, male (% of male populat…  2604  0.09     0.73
##  9 Military expenditure (% of GDP)                           2604  0.09     0.82
## 10 Poverty gap at $1.90 a day (2011 PPP) (%)                 2604  0.09     0.91
## 11 Prevalence of undernourishment (% of population)          2604  0.09     1
```
---

Based on those long names, I'll use `case_when()` within a `mutate()` to match certain words and then create a new short version.

```r
wdi_long &lt;- wdi_long %&gt;%
    mutate(
        series_short = case_when( # use case_when() for conditional operations
            str_detect(series, "electricity") ~ "electricity",
*           str_detect(series, "(?=.*Education)(?=.*female)") ~ "edu_female",
*           str_detect(series, "(?=.*Education)(?=.*male)") ~ "edu_male",
            str_detect(series, "GDP per capita") ~ "gdppc",
            str_detect(series, "malaria") ~ "malaria",
            str_detect(series, "Internet") ~ "internet",
*           str_detect(series,  "(?=.*Labor)(?=.*female)") ~ "lfp_female",
*           str_detect(series, "(?=.*Labor)(?=.*male)") ~ "lfp_male",
            str_detect(series, "Military") ~ "military",
            str_detect(series, "Poverty") ~ "poverty_gap",
            str_detect(series, "Prevalence") ~ "underourished"
        )
    )
```
???
These highlighted lines are a little more complicated - we don't have time to into all the details of regular expressions, but to briefly explain. The main command, str_detect is looking at the original variable series and seeing if the word I put there can be detected. for example, it looks for the word "electricity" and wherever it finds a match, the case_when tells R to "electricity" in the new variable called series_short. The educational attainment and labor force participation cases are more complicated because the data has variables for males and females separately. So we need to use some extra regular expressions to detect if "eduction" &amp; "female" are found together.
---


These new values of the variable `series_short` will work great as variable names when we pivot.

```
## # A tibble: 11 × 5
##    series                                      series_short     N  prop cum_prop
##    &lt;chr&gt;                                       &lt;chr&gt;        &lt;int&gt; &lt;dbl&gt;    &lt;dbl&gt;
##  1 Access to electricity (% of population)     electricity   2604  0.09     0.09
##  2 Educational attainment, at least completed… edu_female    2604  0.09     0.18
##  3 Educational attainment, at least completed… edu_male      2604  0.09     0.27
##  4 GDP per capita (constant 2015 US$)          gdppc         2604  0.09     0.36
##  5 Incidence of malaria (per 1,000 population… malaria       2604  0.09     0.45
##  6 Individuals using the Internet (% of popul… internet      2604  0.09     0.55
##  7 Labor force participation rate, female (% … lfp_female    2604  0.09     0.64
##  8 Labor force participation rate, male (% of… lfp_male      2604  0.09     0.73
##  9 Military expenditure (% of GDP)             military      2604  0.09     0.82
## 10 Poverty gap at $1.90 a day (2011 PPP) (%)   poverty_gap   2604  0.09     0.91
## 11 Prevalence of undernourishment (% of popul… underourish…  2604  0.09     1
```
---

Now that we have this new variable for the series, we can `pivot_wider()` again (and drop the original `series` variable. Now you can see that each row represents a country in a given year and all the variables (electricity access, poverty gap, etc) are stored as columns with the corresponding value.

```r
wdi_long %&gt;%
    select(-series) %&gt;%
    pivot_wider(
        names_from = "series_short",
        values_from = "value"
    ) %&gt;%
    head(n = 3)
```

```
## # A tibble: 3 × 13
##   country      year electricity edu_female edu_male poverty_gap malaria internet
##   &lt;fct&gt;       &lt;int&gt;       &lt;dbl&gt;      &lt;dbl&gt;    &lt;dbl&gt;       &lt;dbl&gt;   &lt;dbl&gt;    &lt;dbl&gt;
## 1 Afghanistan  2010        42.7         NA       NA          NA   13.0      4   
## 2 Afghanistan  2011        43.2         NA       NA          NA   15.6      5   
## 3 Afghanistan  2012        69.1         NA       NA          NA    9.19     5.45
## # … with 5 more variables: lfp_female &lt;dbl&gt;, lfp_male &lt;dbl&gt;,
## #   underourished &lt;dbl&gt;, military &lt;dbl&gt;, gdppc &lt;dbl&gt;
```


---
layout: true
# Grouping &amp; Summarizing Data
---

You will often want to calculate statistics by groups of variables. For example, in this WDI data we might want to calculate the average poverty gap in every country (and for each year). For operations like this, `group_by()` is a powerful tool!

The `group_by()` function groups a dataset by one or more variables. On it's own it does not appear to change the dataset very much. The difference between the two outputs below is subtle:


An ungrouped data frame:

```r
head(wdi_df, 3)
```

```
## # A tibble: 3 × 13
##   country      year electricity edu_female edu_male poverty_gap malaria internet
##   &lt;fct&gt;       &lt;int&gt;       &lt;dbl&gt;      &lt;dbl&gt;    &lt;dbl&gt;       &lt;dbl&gt;   &lt;dbl&gt;    &lt;dbl&gt;
## 1 Afghanistan  2010        42.7         NA       NA          NA   13.0      4   
## 2 Afghanistan  2011        43.2         NA       NA          NA   15.6      5   
## 3 Afghanistan  2012        69.1         NA       NA          NA    9.19     5.45
## # … with 5 more variables: lfp_female &lt;dbl&gt;, lfp_male &lt;dbl&gt;,
## #   underourished &lt;dbl&gt;, military &lt;dbl&gt;, gdppc &lt;dbl&gt;
```
---

A grouped data frame:

```r
wdi_df %&gt;%
    group_by(country) %&gt;%
    head(3)
```

```
## # A tibble: 3 × 13
## # Groups:   country [1]
##   country      year electricity edu_female edu_male poverty_gap malaria internet
##   &lt;fct&gt;       &lt;int&gt;       &lt;dbl&gt;      &lt;dbl&gt;    &lt;dbl&gt;       &lt;dbl&gt;   &lt;dbl&gt;    &lt;dbl&gt;
## 1 Afghanistan  2010        42.7         NA       NA          NA   13.0      4   
## 2 Afghanistan  2011        43.2         NA       NA          NA   15.6      5   
## 3 Afghanistan  2012        69.1         NA       NA          NA    9.19     5.45
## # … with 5 more variables: lfp_female &lt;dbl&gt;, lfp_male &lt;dbl&gt;,
## #   underourished &lt;dbl&gt;, military &lt;dbl&gt;, gdppc &lt;dbl&gt;
```

--

To get summary statistics, you'll need to combine it with a call to `summarise()`.
---

Let’s start with simple counting. How many observations do we have per country? (Hint: you can accomplish the same using `dplyr::tally()`). You can see we have 12 observations per country -- that's because we have a separate row for every year and there are 12 years of data (2010 - 2021).


```r
wdi_df %&gt;%
  group_by(country) %&gt;%
  summarise(n = n())
```

```
## # A tibble: 217 × 2
##    country                 n
##    &lt;fct&gt;               &lt;int&gt;
##  1 Afghanistan            12
##  2 Albania                12
##  3 Algeria                12
##  4 American Samoa         12
##  5 Andorra                12
##  6 Angola                 12
##  7 Antigua and Barbuda    12
##  8 Argentina              12
##  9 Armenia                12
## 10 Aruba                  12
## # … with 207 more rows
```
???
R will accept both the British and American spellings.
---

What if we wanted to calculate the mean GDP per capita for each country over the 12 years of data? We can simply add more operations within the call to `summarise()`.


```r
wdi_df %&gt;%
  group_by(country) %&gt;%
  summarise(
      n = n(),
      gdp_avg = mean(gdppc))
```

```
## # A tibble: 217 × 3
##    country                 n gdp_avg
##    &lt;fct&gt;               &lt;int&gt;   &lt;dbl&gt;
##  1 Afghanistan            12      NA
##  2 Albania                12      NA
##  3 Algeria                12      NA
##  4 American Samoa         12      NA
##  5 Andorra                12      NA
##  6 Angola                 12      NA
##  7 Antigua and Barbuda    12      NA
##  8 Argentina              12      NA
##  9 Armenia                12      NA
## 10 Aruba                  12      NA
## # … with 207 more rows
```
???
Why is everything missing? We'll see how to identify missing data shortly.
---

When calculating many different statistics, you need to be aware of missing data and decide how to handle them. R can't calculate a mean when some of the values are missing unless you tell R to ignore them.


```r
wdi_df %&gt;%
  group_by(country) %&gt;%
  summarise(
      n = n(),
      gdp_avg = mean(gdppc, na.rm = TRUE)) 
```

```
## # A tibble: 217 × 3
##    country                 n gdp_avg
##    &lt;fct&gt;               &lt;int&gt;   &lt;dbl&gt;
##  1 Afghanistan            12    548.
##  2 Albania                12   4026.
##  3 Algeria                12   4067.
##  4 American Samoa         12  11745.
##  5 Andorra                12  35802.
##  6 Angola                 12   3885.
##  7 Antigua and Barbuda    12  14495.
##  8 Argentina              12  13381.
##  9 Armenia                12   3616.
## 10 Aruba                  12  27411.
## # … with 207 more rows
```
---

The functions you’ll apply within summarize() include classical statistical summaries, like `mean()`, `median()`, `var()`, `sd()`, `IQR()`, `min()`, and `max()`. `dplyr` also includes short-cuts for working across multiple variables, for example you can use `summarise_at()` the same summary function(s) to multiple variables.  Let’s compute the average and median of GDP per capita and % electricity access by country.


```r
wdi_df %&gt;%
  group_by(country) %&gt;%
  summarise_at(
      vars(gdppc, electricity),
      list(~mean(., na.rm = TRUE), ~median(., na.rm = TRUE))) 
```

```
## # A tibble: 217 × 5
##    country             gdppc_mean electricity_mean gdppc_median electricity_med…
##    &lt;fct&gt;                    &lt;dbl&gt;            &lt;dbl&gt;        &lt;dbl&gt;            &lt;dbl&gt;
##  1 Afghanistan               548.             79.2         553.             89.5
##  2 Albania                  4026.            100.         3953.            100  
##  3 Algeria                  4067.             99.3        4112.             99.2
##  4 American Samoa          11745.            NaN         11812.             NA  
##  5 Andorra                 35802.            100         34957.            100  
##  6 Angola                   3885.             40.1        3980.             41.8
##  7 Antigua and Barbuda     14495.             99.4       13987.            100  
##  8 Argentina               13381.             99.6       13568.             99.8
##  9 Armenia                  3616.             99.8        3601.             99.8
## 10 Aruba                   27411.             99.4       27051.            100  
## # … with 207 more rows
```
???
Here I specified 2 variables by name, but you can also use the handy selectors in the tidyverse, such as "starts_with" and "ends_with", among others, to efficiently select variables.
---

But note how the data frame that was returned has fewer rows -- 217 or 1 for each country. When you use `summarise()` it changes the structure of your data, if you wanted to calculate the mean GDP per capita for each country and add it as a variable to the original data frame with yearly observations for each country, you would use a `mutate()`.

```r
wdi_df %&gt;%
  group_by(country) %&gt;%
  mutate(
      n = n(),
      gdp_avg = mean(gdppc, na.rm = TRUE)) %&gt;%
    select(country, year, gdppc, n, gdp_avg)
```

```
## # A tibble: 2,604 × 5
## # Groups:   country [217]
##    country      year gdppc     n gdp_avg
##    &lt;fct&gt;       &lt;int&gt; &lt;dbl&gt; &lt;int&gt;   &lt;dbl&gt;
##  1 Afghanistan  2010  526.    12    548.
##  2 Afghanistan  2011  512.    12    548.
##  3 Afghanistan  2012  558.    12    548.
##  4 Afghanistan  2013  569.    12    548.
##  5 Afghanistan  2014  565.    12    548.
##  6 Afghanistan  2015  556.    12    548.
##  7 Afghanistan  2016  553.    12    548.
##  8 Afghanistan  2017  553.    12    548.
##  9 Afghanistan  2018  547.    12    548.
## 10 Afghanistan  2019  555.    12    548.
## # … with 2,594 more rows
```
???
Now notice how the same value of mean GDP we calculated shows up for every observation of Afghanistan. This is the key distinction between summarise and mutate.

---
layout: true
# Identifying duplicates &amp; missing data
---

Identifying duplicates and missing values are two of the most important tasks you'll do when cleaning data. Let's start with duplicates. There are a few different commands you can use to identify duplicates or identify distinct observations.

To see how this works, I created some data with duplicates:


```r
df
```

```
## # A tibble: 4 × 3
##   x         y     z
##   &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1 a         2     5
## 2 a         2     5
## 3 a         0     5
## 4 b         3     9
```
---

`dplyr` contains a command called `distinct()`, which returns rows that contain unique data across all the variables. In this case that leaves us with only one row where `x = a, y = 2, z = 5`. 

The other row where `x = a` is kept because the other variables have different values -- so they're not perfect duplicates.


```r
df %&gt;%
    distinct()
```

```
## # A tibble: 3 × 3
##   x         y     z
##   &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1 a         2     5
## 2 a         0     5
## 3 b         3     9
```
---

But before just blindly removing duplicates, you might just want to see where they are. Or maybe you want to get rid of all duplicates of a single variable. You can use `janitor::get_dupes()` for this:

```r
df %&gt;%
    get_dupes() 
```

```
## # A tibble: 2 × 4
##   x         y     z dupe_count
##   &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;      &lt;int&gt;
## 1 a         2     5          2
## 2 a         2     5          2
```

```r
df %&gt;%
    get_dupes(x)
```

```
## # A tibble: 3 × 4
##   x     dupe_count     y     z
##   &lt;chr&gt;      &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1 a              3     2     5
## 2 a              3     2     5
## 3 a              3     0     5
```

---

Or you could check for duplicates across values of multiple variables:


```r
df %&gt;%
    group_by(x, y) %&gt;% # you can group by as many variables as you want
    add_tally() # gives you how many observations per value of x &amp; y
```

```
## # A tibble: 4 × 4
## # Groups:   x, y [3]
##   x         y     z     n
##   &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt;
## 1 a         2     5     2
## 2 a         2     5     2
## 3 a         0     5     1
## 4 b         3     9     1
```

---

Now, let's see how to check for missing values. First, it’s important to note the difference between “NA” and “NaN”. You can use the help function to take a closer look at both values: `?NA`. Briefly, NA” or “Not Available” is used for missing values, while “NaN” or “Not a Number” is used for numeric calculations. If a value is undefined, such as 0/0, “NaN” is the appropriate way to represent this.


To see how this works, I created some data with missing values:


```r
df
```

```
## # A tibble: 5 × 3
##   x         y     z
##   &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1 a         2    NA
## 2 a         2     5
## 3 a       NaN     5
## 4 b         3     9
## 5 c        NA     8
```
---

Now let's see how to use `dplyr` functions to identify and filter out missing data along with the base R command `is.na()`. The below code gives us the count of missing (NA or NaN) observations by each variable using the `summarise_all()` function.


```r
df %&gt;%
  summarise_all(list(~sum(is.na(.))))
```

```
## # A tibble: 1 × 3
##       x     y     z
##   &lt;int&gt; &lt;int&gt; &lt;int&gt;
## 1     0     2     1
```

---

Or if you wanted to just see which rows of data had missing values, you could use:

```r
df %&gt;% 
  filter(if_any(everything(), is.na))
```

```
## # A tibble: 3 × 3
##   x         y     z
##   &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1 a         2    NA
## 2 a       NaN     5
## 3 c        NA     8
```
---

Finally, if you wanted to filter to keep only rows with complete you can use `drop_na`.
.pull-left[

```r
# original data
df
```

```
## # A tibble: 5 × 3
##   x         y     z
##   &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1 a         2    NA
## 2 a         2     5
## 3 a       NaN     5
## 4 b         3     9
## 5 c        NA     8
```
]

.pull-right[

```r
# after dropping all NA
df %&gt;% drop_na
```

```
## # A tibble: 2 × 3
##   x         y     z
##   &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1 a         2     5
## 2 b         3     9
```
]

---

And, to drop rows if they contain NAs only of one variable you can use a filter:

```r
df %&gt;% 
    filter(!is.na(z)) # use ! to negate
```

```
## # A tibble: 4 × 3
##   x         y     z
##   &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1 a         2     5
## 2 a       NaN     5
## 3 b         3     9
## 4 c        NA     8
```
---
layout: true
# Combining Data
---

Data often arrives in many pieces and it's common to combine data from many different sources. To get all of the data into a single data frame, there are two common tasks:

### Binds
Essentially "smashes" dataframes together. When done in a row-wise way, it's called a row bind (you may have seen this called "appending" somewhere else). You can also do this in a column wise way, which unsurprisingly is called a column bind.

With binds you have to be really careful. For example, when row binding, do the same variables exist in each? Are they of the same type? Different approaches for row binding have different combinations of flexibility vs rigidity around these matters. When column binding, the onus is entirely on the analyst to make sure that the rows are aligned. I would avoid column binding whenever possible.

.footnote[
*This section leans heavily on the tutorials in [Stats545](https://stat545.com/multiple-tibbles.html)
]
---

### Joins
This is often referred to as a *merge*. When joining data, you designate a variable as a "key" and combine data frames by matching observations via the key. This is a much safer way of combining data. `dplyr` contains a multitude of join operations: `left_join()`, `right_join()`, `anti_join()`, `inner_join()`, and `full_join()`.


We'll briefly run through examples of binds and joins.

---

Here is some data on word counts from the Lord of the Rings trilogy:


```r
fship
```

```
## # A tibble: 3 × 4
##   Film                       Race   Female  Male
##   &lt;chr&gt;                      &lt;chr&gt;   &lt;dbl&gt; &lt;dbl&gt;
## 1 The Fellowship Of The Ring Elf      1229   971
## 2 The Fellowship Of The Ring Hobbit     14  3644
## 3 The Fellowship Of The Ring Man         0  1995
```

```r
rking
```

```
## # A tibble: 3 × 4
##   Film                   Race   Female  Male
##   &lt;chr&gt;                  &lt;chr&gt;   &lt;dbl&gt; &lt;dbl&gt;
## 1 The Return Of The King Elf       183   510
## 2 The Return Of The King Hobbit      2  2673
## 3 The Return Of The King Man       268  2459
```

---

Here is some data on word counts from the Lord of the Rings trilogy:


```r
ttow
```

```
## # A tibble: 3 × 4
##   Film           Race   Female  Male
##   &lt;chr&gt;          &lt;chr&gt;   &lt;dbl&gt; &lt;dbl&gt;
## 1 The Two Towers Elf       331   513
## 2 The Two Towers Hobbit      0  2463
## 3 The Two Towers Man       401  3589
```

---

Because these data all have the same variable names, they are well-suited to binding. For a row bind (stacking them on top of each other), we'll use `bind_rows()`.  Column binding works in a similar way, except you use the aptly named `bind_cols()` command from `dplyr` (or `cbind()` in base R). 
.pull-left[

```r
lotr &lt;- bind_rows(
    fship, ttow, rking
)
```
]

.pull-right[

```r
lotr
```

```
## # A tibble: 9 × 4
##   Film                       Race   Female  Male
##   &lt;chr&gt;                      &lt;chr&gt;   &lt;dbl&gt; &lt;dbl&gt;
## 1 The Fellowship Of The Ring Elf      1229   971
## 2 The Fellowship Of The Ring Hobbit     14  3644
## 3 The Fellowship Of The Ring Man         0  1995
## 4 The Two Towers             Elf       331   513
## 5 The Two Towers             Hobbit      0  2463
## 6 The Two Towers             Man       401  3589
## 7 The Return Of The King     Elf       183   510
## 8 The Return Of The King     Hobbit      2  2673
## 9 The Return Of The King     Man       268  2459
```
]
---

But what if one of the data frames is somehow missing a variable?  Note there is also a base R command that does this, `rbind()`, but it doesn't handle missing data well. Let's do that and see what happens.
.pull-left[

```r
ttow_no_Female &lt;- ttow %&gt;% 
    mutate(Female = NULL)

ttow_no_Female
```

```
## # A tibble: 3 × 3
##   Film           Race    Male
##   &lt;chr&gt;          &lt;chr&gt;  &lt;dbl&gt;
## 1 The Two Towers Elf      513
## 2 The Two Towers Hobbit  2463
## 3 The Two Towers Man     3589
```
]

.pull-right[

```r
rbind(fship, ttow_no_Female, rking)
```

```
## Error in rbind(deparse.level, ...): numbers of columns of arguments do not match
```

```r
bind_rows(fship, ttow_no_Female, rking)
```

```
## # A tibble: 9 × 4
##   Film                       Race   Female  Male
##   &lt;chr&gt;                      &lt;chr&gt;   &lt;dbl&gt; &lt;dbl&gt;
## 1 The Fellowship Of The Ring Elf      1229   971
## 2 The Fellowship Of The Ring Hobbit     14  3644
## 3 The Fellowship Of The Ring Man         0  1995
## 4 The Two Towers             Elf        NA   513
## 5 The Two Towers             Hobbit     NA  2463
## 6 The Two Towers             Man        NA  3589
## 7 The Return Of The King     Elf       183   510
## 8 The Return Of The King     Hobbit      2  2673
## 9 The Return Of The King     Man       268  2459
```
]

--- 


---

Joins are a much more reliable way of combining data. To see how they work, we'll return to the WDI data, with different variables for different countries.


```
## # A tibble: 2 × 13
##   country      year electricity edu_female edu_male poverty_gap malaria internet
##   &lt;fct&gt;       &lt;int&gt;       &lt;dbl&gt;      &lt;dbl&gt;    &lt;dbl&gt;       &lt;dbl&gt;   &lt;dbl&gt;    &lt;dbl&gt;
## 1 Afghanistan  2010        42.7         NA       NA          NA    13.0        4
## 2 Afghanistan  2011        43.2         NA       NA          NA    15.6        5
## # … with 5 more variables: lfp_female &lt;dbl&gt;, lfp_male &lt;dbl&gt;,
## #   underourished &lt;dbl&gt;, military &lt;dbl&gt;, gdppc &lt;dbl&gt;
```

--

What if we had another dataset with other information on countries that we wanted to merge:


```
## # A tibble: 3 × 3
##   country      year population
##   &lt;chr&gt;       &lt;int&gt;      &lt;dbl&gt;
## 1 Afghanistan  1997   19357126
## 2 Afghanistan  1998   19737770
## 3 Afghanistan  1999   20170847
```

---

A `left_join(x, y)` requires you to specify two data frames (`x` and `y`) will keep all the rows that are in `x` and all the matching rows in `y`. But if certain rows of `y` don't exist in `x`, they will not be returned. 

--

For example, this new data frame we have dates back to 1997, whereas the WDI data begins in 2010, so all of those earlier years will not be merged included if the WDI df is specified as the `x` df in a `left_join()`. 

--

Because it's important to keep track of what does and does not merge, I highly recommend the [`tidylog`](https://cran.r-project.org/web/packages/tidylog/readme/README.html) package, which gives you feedback on all `tidyverse` functions, but is especially useful for joins.

--

Let's look at the syntax first, then the output. In this example `wdi_df` is the `x` df, `country_df` is the `y`, and "country" is the key. This variable **must** be in both datasets and if you have multiple variables to match on, you can specify multiple keys.


```r
join_df &lt;- left_join(wdi_df, country_df, by = "country")
```

???
The choice of x and y is yours and depends on your task. The opposite of a left join is a right join - so you can either change the order of x and y or change the join.
---

Now let's see what `tidylog` shows:


```r
library(tidylog)
join_df &lt;- left_join(wdi_df, country_df, by = "country")
```

```
## left_join: added 3 columns (year.x, year.y, population)
```

```
##            &gt; rows only in x        0
```

```
##            &gt; rows only in y  (   342)
```

```
##            &gt; matched rows     49,476    (includes duplicates)
```

```
##            &gt;                 ========
```

```
##            &gt; rows total       49,476
```

???
Explain all the tidylog output
---

Now, let's look at the output:

```
## # A tibble: 49,476 × 15
##    country   year.x electricity edu_female edu_male poverty_gap malaria internet
##    &lt;chr&gt;      &lt;int&gt;       &lt;dbl&gt;      &lt;dbl&gt;    &lt;dbl&gt;       &lt;dbl&gt;   &lt;dbl&gt;    &lt;dbl&gt;
##  1 Afghanis…   2010        42.7         NA       NA          NA    13.0        4
##  2 Afghanis…   2010        42.7         NA       NA          NA    13.0        4
##  3 Afghanis…   2010        42.7         NA       NA          NA    13.0        4
##  4 Afghanis…   2010        42.7         NA       NA          NA    13.0        4
##  5 Afghanis…   2010        42.7         NA       NA          NA    13.0        4
##  6 Afghanis…   2010        42.7         NA       NA          NA    13.0        4
##  7 Afghanis…   2010        42.7         NA       NA          NA    13.0        4
##  8 Afghanis…   2010        42.7         NA       NA          NA    13.0        4
##  9 Afghanis…   2010        42.7         NA       NA          NA    13.0        4
## 10 Afghanis…   2010        42.7         NA       NA          NA    13.0        4
## # … with 49,466 more rows, and 7 more variables: lfp_female &lt;dbl&gt;,
## #   lfp_male &lt;dbl&gt;, underourished &lt;dbl&gt;, military &lt;dbl&gt;, gdppc &lt;dbl&gt;,
## #   year.y &lt;int&gt;, population &lt;dbl&gt;
```

???
Explain the year.x and year.y and that's why we have all these extra rows.
---


```r
library(tidylog)
join_df &lt;- left_join(wdi_df, country_df, by = c("country", "year"))
```

```
## left_join: added one column (population)
```

```
##            &gt; rows only in x   1,302
```

```
##            &gt; rows only in y  (3,163)
```

```
##            &gt; matched rows     1,302
```

```
##            &gt;                 =======
```

```
##            &gt; rows total       2,604
```
---

Now, let's look at the output:

```
## # A tibble: 2,604 × 14
##    country   year population electricity edu_female edu_male poverty_gap malaria
##    &lt;chr&gt;    &lt;int&gt;      &lt;dbl&gt;       &lt;dbl&gt;      &lt;dbl&gt;    &lt;dbl&gt;       &lt;dbl&gt;   &lt;dbl&gt;
##  1 Afghani…  2010   29185511        42.7         NA       NA          NA   13.0 
##  2 Afghani…  2011   30117411        43.2         NA       NA          NA   15.6 
##  3 Afghani…  2012   31161378        69.1         NA       NA          NA    9.19
##  4 Afghani…  2013   32269592        68.3         NA       NA          NA   11.2 
##  5 Afghani…  2014   33370804        89.5         NA       NA          NA   12.4 
##  6 Afghani…  2015   34413603        71.5         NA       NA          NA   13.9 
##  7 Afghani…  2016         NA        97.7         NA       NA          NA   26.9 
##  8 Afghani…  2017         NA        97.7         NA       NA          NA   28.6 
##  9 Afghani…  2018         NA        96.6         NA       NA          NA   22.1 
## 10 Afghani…  2019         NA        97.7         NA       NA          NA   14.1 
## # … with 2,594 more rows, and 6 more variables: internet &lt;dbl&gt;,
## #   lfp_female &lt;dbl&gt;, lfp_male &lt;dbl&gt;, underourished &lt;dbl&gt;, military &lt;dbl&gt;,
## #   gdppc &lt;dbl&gt;
```

???
Explain how those NAs are there because the population data only contained information up to 2015.

---
layout:false
class: middle, center, inverse
# Questions?
---

# Wrap-up

Today we covered:

- intro to R
- reading data into R
- R basics
- data wrangling with the `tidyverse`

--

There is also a companion R script that has all the commands I used today: "2022_07_11_day_1_code.R"

--

I'll see you back here tomorrow to get into the "fun" stuff: data visualizations, creating publication ready tables, and the basics of regression analysis.


    </textarea>
<style data-target="print-only">@media screen {.remark-slide-container{display:block;}.remark-slide-scaler{box-shadow:none;}}</style>
<script src="https://remarkjs.com/downloads/remark-latest.min.js"></script>
<script>var slideshow = remark.create({
"ratio": "16:9",
"countIncrementalSlides": false,
"highlightStyle": "github",
"highlightLines": true
});
if (window.HTMLWidgets) slideshow.on('afterShowSlide', function (slide) {
  window.dispatchEvent(new Event('resize'));
});
(function(d) {
  var s = d.createElement("style"), r = d.querySelector(".remark-slide-scaler");
  if (!r) return;
  s.type = "text/css"; s.innerHTML = "@page {size: " + r.style.width + " " + r.style.height +"; }";
  d.head.appendChild(s);
})(document);

(function(d) {
  var el = d.getElementsByClassName("remark-slides-area");
  if (!el) return;
  var slide, slides = slideshow.getSlides(), els = el[0].children;
  for (var i = 1; i < slides.length; i++) {
    slide = slides[i];
    if (slide.properties.continued === "true" || slide.properties.count === "false") {
      els[i - 1].className += ' has-continuation';
    }
  }
  var s = d.createElement("style");
  s.type = "text/css"; s.innerHTML = "@media print { .has-continuation { display: none; } }";
  d.head.appendChild(s);
})(document);
// delete the temporary CSS (for displaying all slides initially) when the user
// starts to view slides
(function() {
  var deleted = false;
  slideshow.on('beforeShowSlide', function(slide) {
    if (deleted) return;
    var sheets = document.styleSheets, node;
    for (var i = 0; i < sheets.length; i++) {
      node = sheets[i].ownerNode;
      if (node.dataset["target"] !== "print-only") continue;
      node.parentNode.removeChild(node);
    }
    deleted = true;
  });
})();
// add `data-at-shortcutkeys` attribute to <body> to resolve conflicts with JAWS
// screen reader (see PR #262)
(function(d) {
  let res = {};
  d.querySelectorAll('.remark-help-content table tr').forEach(tr => {
    const t = tr.querySelector('td:nth-child(2)').innerText;
    tr.querySelectorAll('td:first-child .key').forEach(key => {
      const k = key.innerText;
      if (/^[a-z]$/.test(k)) res[k] = t;  // must be a single letter (key)
    });
  });
  d.body.setAttribute('data-at-shortcutkeys', JSON.stringify(res));
})(document);
(function() {
  "use strict"
  // Replace <script> tags in slides area to make them executable
  var scripts = document.querySelectorAll(
    '.remark-slides-area .remark-slide-container script'
  );
  if (!scripts.length) return;
  for (var i = 0; i < scripts.length; i++) {
    var s = document.createElement('script');
    var code = document.createTextNode(scripts[i].textContent);
    s.appendChild(code);
    var scriptAttrs = scripts[i].attributes;
    for (var j = 0; j < scriptAttrs.length; j++) {
      s.setAttribute(scriptAttrs[j].name, scriptAttrs[j].value);
    }
    scripts[i].parentElement.replaceChild(s, scripts[i]);
  }
})();
(function() {
  var links = document.getElementsByTagName('a');
  for (var i = 0; i < links.length; i++) {
    if (/^(https?:)?\/\//.test(links[i].getAttribute('href'))) {
      links[i].target = '_blank';
    }
  }
})();
// adds .remark-code-has-line-highlighted class to <pre> parent elements
// of code chunks containing highlighted lines with class .remark-code-line-highlighted
(function(d) {
  const hlines = d.querySelectorAll('.remark-code-line-highlighted');
  const preParents = [];
  const findPreParent = function(line, p = 0) {
    if (p > 1) return null; // traverse up no further than grandparent
    const el = line.parentElement;
    return el.tagName === "PRE" ? el : findPreParent(el, ++p);
  };

  for (let line of hlines) {
    let pre = findPreParent(line);
    if (pre && !preParents.includes(pre)) preParents.push(pre);
  }
  preParents.forEach(p => p.classList.add("remark-code-has-line-highlighted"));
})(document);</script>

<script>
slideshow._releaseMath = function(el) {
  var i, text, code, codes = el.getElementsByTagName('code');
  for (i = 0; i < codes.length;) {
    code = codes[i];
    if (code.parentNode.tagName !== 'PRE' && code.childElementCount === 0) {
      text = code.textContent;
      if (/^\\\((.|\s)+\\\)$/.test(text) || /^\\\[(.|\s)+\\\]$/.test(text) ||
          /^\$\$(.|\s)+\$\$$/.test(text) ||
          /^\\begin\{([^}]+)\}(.|\s)+\\end\{[^}]+\}$/.test(text)) {
        code.outerHTML = code.innerHTML;  // remove <code></code>
        continue;
      }
    }
    i++;
  }
};
slideshow._releaseMath(document);
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
  var script = document.createElement('script');
  script.type = 'text/javascript';
  script.src  = 'https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML';
  if (location.protocol !== 'file:' && /^https?:/.test(script.src))
    script.src  = script.src.replace(/^https?:/, '');
  document.getElementsByTagName('head')[0].appendChild(script);
})();
</script>
  </body>
</html>