02-notes.Rmd

---
layout: topic
title: Aggregating and analyzing data with dplyr (notes)
author: Data Carpentry contributors
---

```{r, echo=FALSE}
knitr::opts_chunk$set(results='hide', fig.path='img/r-lesson-')
surveys <- read.csv("data/portal_data_joined.csv")
```

## Key idea

All that bracket-based selecting can be a bit cumbersome.
Add-on package dplyr greatly simplifies the process; inspired by SQL.

`select`, `filter`, `mutate`, `group_by`, `summarize`

(also `tally` and `arrange`)


## Install and load the library

```{r eval=FALSE}
install.packages("dplyr")
install.packages("ggplot2")
```

```{r, message=FALSE}
library(dplyr)
```


## Select and filter

Select to grab columns.

```r
selectedcol <- select(surveys, species_id, plot_type, weight)
head(selectedcol)
```


Filter to grab rows.

```{r filter}
surveys2002 <- filter(surveys, year==2002)
head(surveys2002)
```

## Pipe

Output of one function becomes the input to the next.

```{r}
surveys %>%
  filter(weight < 5) %>%
  select(species_id, sex, weight)
```

<kbd>`Ctrl`</kbd> + <kbd>`Shift`</kbd> + <kbd>`M`</kbd> to insert `%>%`

Could assign this to something:

```{r}
surveys_sml <- surveys %>%
  filter(weight < 5) %>%
  select(species_id, sex, weight)
```


### Challenge

Using pipes, subset the data to include individuals collected before 1995,
and retain the columns `year`, `sex`, and `weight.`


## Mutate

`mutate()` to derive a new column.

```{r}
surveys %>%
  mutate(weight_kg = weight / 1000)
```

To just look at the top:

```{r}
surveys %>%
  mutate(weight_kg = weight / 1000) %>%
  head
```

Filter out `NA`s:

```{r}
surveys %>%
  filter(!is.na(weight)) %>%
  mutate(weight_kg = weight / 1000) %>%
  head
```

### Challenge

Create a new dataframe from the survey data that meets the following
criteria: contains only the `species_id` column and a column that contains
values that are the square-root of `hindfoot_length` values (e.g. a new column
`hindfoot_sqrt`). In this `hindfoot_sqrt` column, there are no NA values
and all values are < 3.

Hint: think about how the commands should be ordered

### split-apply-combine data analyses (group-by and summarize)

Many analyses fit a split-apply-combine pattern: split the data into
groups, apply some analysis to each group, and then combine the
results.

With dplyr, we use `group_by()` for the splitting and `tally` or
`summarize()` for the rest.

Count individuals by sex:

```{r}
surveys %>%
  group_by(sex) %>%
  tally()
```

Average weight by sex:

```{r}
surveys %>%
  group_by(sex) %>%
  summarize(mean_weight = mean(weight, na.rm = TRUE))
```

Can group by multiple columns:

```{r}
surveys %>%
  group_by(sex, species_id) %>%
  summarize(mean_weight = mean(weight, na.rm = TRUE))
```

Maybe filter those `NA`s

```{r}
surveys %>%
  group_by(sex, species_id) %>%
  summarize(mean_weight = mean(weight, na.rm = TRUE)) %>%
  filter(!is.na(mean_weight))
```

Another thing we might do here is sort rows by `mean_weight`, using
`arrange()`.


```{r}
surveys %>%
  group_by(sex, species_id) %>%
  summarize(mean_weight = mean(weight, na.rm = TRUE)) %>%
    filter(!is.na(mean_weight)) %>%
    arrange(mean_weight)
```

If you want them sorted from highest to lowest, use `desc()`.

```{r}
surveys %>%
  group_by(sex, species_id) %>%
  summarize(mean_weight = mean(weight, na.rm = TRUE)) %>%
    filter(!is.na(mean_weight)) %>%
    arrange(desc(mean_weight))
```

Also note that you can include multiple summaries.

```{r}
surveys %>%
  group_by(sex, species_id) %>%
  summarize(mean_weight = mean(weight, na.rm = TRUE),
            min_weight = min(weight, na.rm = TRUE)) %>%
    filter(!is.na(mean_weight)) %>%
    arrange(desc(mean_weight))
```

### Challenge

How many times was each `plot_type` surveyed?


### Challenge

Use `group_by()` and `summarize()` to find the mean, min, and max hindfoot
length for each species.

### Challenge

What was the heaviest animal measured in each year? Return the columns `year`,
`genus`, `species`, and `weight`.


## Data cleaning preparations

In preparations for the plotting, let's do a bit of data cleaning:
remove rows with missing `species_id`, `weight`, `hindfoot_length`, or
`sex`.

```{r clean_data_1}
surveys_complete <- surveys %>%
    filter(species_id != "", !is.na(weight)) %>%
    filter(!is.na(hindfoot_length), sex != "")
```

There are a lot of species with low counts. Let's remove the species
with less than 10 counts.

```{r}
# count records per species
species_counts <- surveys_complete %>%
    group_by(species_id) %>%
    tally

head(species_counts)

# get names of the species with counts >= 10
frequent_species <-  species_counts %>%
    filter(n >= 10) %>%
    select(species_id)

# filter out the less-frequent species
surveys_complete <- surveys_complete %>%
    filter(species_id %in% frequent_species$species_id)
```

We might save this to a file:

```{r save_reduced_data_to_file, eval=FALSE}
write.csv(reduced, "CleanData/portal_data_reduced.csv")
```


<br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/>