Skip to content

Commit

Permalink
Merge pull request #535 from jessesadler/rowsums
Browse files Browse the repository at this point in the history
Replace use of rowsums in tidyr episode
  • Loading branch information
juanfung authored Nov 6, 2024
2 parents face4e2 + 12957e7 commit 216d1df
Show file tree
Hide file tree
Showing 4 changed files with 265 additions and 224 deletions.
165 changes: 101 additions & 64 deletions episodes/04-tidyr.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -204,7 +204,7 @@ separate_longer_delim(items_owned, delim = ";") %>%

After this transformation, you may notice that the `items_owned` column contains
`NA` values. This is because some of the respondents did not own any of the items
that was in the interviewer's list. We can use the `replace_na()` function to
in the interviewer's list. We can use the `replace_na()` function to
change these `NA` values to something more meaningful. The `replace_na()` function
expects for you to give it a `list()` of columns that you would like to replace
the `NA` values in, and the value that you would like to replace the `NA`s. This
Expand All @@ -218,14 +218,39 @@ Next, we create a new variable named `items_owned_logical`, which has one value
(`TRUE`) for every row. This makes sense, since each item in every row was owned
by that household. We are constructing this variable so that when we spread the
`items_owned` across multiple columns, we can fill the values of those columns
with logical values describing whether the household did (`TRUE`) or didn't
with logical values describing whether the household did (`TRUE`) or did not
(`FALSE`) own that particular item.

```{r, eval=FALSE}
mutate(items_owned_logical = TRUE) %>%
```

![](fig/separate_longer.png){alt="Two tables shown side-by-side. The first row of the left table is highlighted in blue, and the first four rows of the right table are also highlighted in blue to show how each of the values of 'items owned' are given their own row with the separate longer delim function. The 'items owned logical' column is highlighted in yellow on the right table to show how the mutate function adds a new column."}
![](fig/separate_longer.png){alt="Two tables shown side-by-side. The first row
of the left table is highlighted in blue, and the first four rows of the right
table are also highlighted in blue to show how each of the values of 'items
owned' are given their own row with the separate longer delim function. The
'items owned logical' column is highlighted in yellow on the right table to show
how the mutate function adds a new column."}

At this point, we can also count the number of items owned by each household,
which is equivalent to the number of rows per `key_ID`. We can do this with a
`group_by()` and `mutate()` pipeline that works similar to `group_by()` and
`summarize()` discussed in the previous episode but instead of creating a
summary table, we will add another column called `number_items`. We use the
`n()` function to count the number of rows within each group. However, there is
one difficulty we need to take into account, namely those households that did
not list any items. These households now have `"no_listed_items"` under
`items_owned`. We do not want to count this as an item but instead show zero
items. We can accomplish this using **`dplyr`'s** `if_else()` function that
evaluates a condition and returns one value if true and another if false. Here,
if the `items_owned` column is `"no_listed_items"`, then a 0 is returned,
otherwise, the number of rows per group is returned using `n()`.

```{r, eval=FALSE}
group_by(key_ID) %>%
mutate(number_items = if_else(items_owned == "no_listed_items", 0, n())) %>%
```

Lastly, we use `pivot_wider()` to switch from long format to wide format. This
creates a new column for each of the unique values in the `items_owned` column,
Expand All @@ -240,30 +265,38 @@ pivot_wider(names_from = items_owned,
```

![](fig/pivot_wider.png){alt="Two tables shown side-by-side. The 'items owned' column is highlighted in blue on the left table, and the column names are highlighted in blue on the right table to show how the values of the 'items owned' become the column names in the output of the pivot wider function. The 'items owned logical' column is highlighted in yellow on the left table, and the values of the bicycle, television, and solar panel columns are highlighted in yellow on the right table to show how the values of the 'items owned logical' column became the values of all three of the aforementioned columns."}
![](fig/pivot_wider.png){alt="Two tables shown side-by-side. The 'items owned'
column is highlighted in blue on the left table, and the column names are
highlighted in blue on the right table to show how the values of the 'items
owned' become the column names in the output of the pivot wider function. The
'items owned logical' column is highlighted in yellow on the left table, and the
values of the bicycle, television, and solar panel columns are highlighted in
yellow on the right table to show how the values of the 'items owned logical'
column became the values of all three of the aforementioned columns."}

Combining the above steps, the chunk looks like this:
Combining the above steps, the chunk looks like this. Note that two new columns
are created within the same `mutate()` call.

```{r}
interviews_items_owned <- interviews %>%
separate_longer_delim(items_owned, delim = ";") %>%
replace_na(list(items_owned = "no_listed_items")) %>%
mutate(items_owned_logical = TRUE) %>%
group_by(key_ID) %>%
mutate(items_owned_logical = TRUE,
number_items = if_else(items_owned == "no_listed_items", 0, n())) %>%
pivot_wider(names_from = items_owned,
values_from = items_owned_logical,
values_fill = list(items_owned_logical = FALSE))
```

View the `interviews_items_owned` data frame. It should have
`r nrow(interviews)` rows (the same number of rows you had originally), but
extra columns for each item. How many columns were added?
Notice that there is no longer a
column titled `items_owned`. This is because there is a default
View the `interviews_items_owned` data frame. It should have `r
nrow(interviews)` rows (the same number of rows you had originally), but extra
columns for each item. How many columns were added? Notice that there is no
longer a column titled `items_owned`. This is because there is a default
parameter in `pivot_wider()` that drops the original column. The values that
were in that column have now become columns named `television`, `solar_panel`,
`table`, etc. You can use `dim(interviews)` and
`dim(interviews_wide)` to see how the number of columns has changed between
the two datasets.
`table`, etc. You can use `dim(interviews)` and `dim(interviews_wide)` to see
how the number of columns has changed between the two datasets.

This format of the data allows us to do interesting things, like make a table
showing the number of respondents in each village who owned a particular item:
Expand All @@ -276,21 +309,49 @@ interviews_items_owned %>%
```

Or below we calculate the average number of items from the list owned by
respondents in each village. This code uses the `rowSums()` function to count
the number of `TRUE` values in the `bicycle` to `car` columns for each row,
hence its name. Note that we replaced `NA` values with the value `no_listed_items`,
so we must exclude this value in the aggregation. We then group the data by
villages and calculate the mean number of items, so each average is grouped
by village.
respondents in each village using the `number_items` column we created to
count the items listed by each household.

```{r, purl=FALSE}
interviews_items_owned %>%
select(-no_listed_items) %>%
mutate(number_items = rowSums(select(., bicycle:car))) %>%
group_by(village) %>%
summarize(mean_items = mean(number_items))
```

::::::::::::::::::::::::::::::::::::::: challenge

## Exercise

We created `interviews_items_owned` by reshaping the data: first longer and then
wider. Replicate this process with the `months_lack_food` column in the
`interviews` dataframe. Create a new dataframe with columns for each of the
months filled with logical vectors (`TRUE` or `FALSE`) and a summary column
called `number_months_lack_food` that calculates the number of months each
household reported a lack of food.

Note that if the household did not lack food in the previous 12 months, the
value input was "none".

::::::::::::::: solution

## Solution

```{r}
months_lack_food <- interviews %>%
separate_longer_delim(months_lack_food, delim = ";") %>%
group_by(key_ID) %>%
mutate(months_lack_food_logical = TRUE,
number_months_lack_food = if_else(months_lack_food == "none", 0, n())) %>%
pivot_wider(names_from = months_lack_food,
values_from = months_lack_food_logical,
values_fill = list(months_lack_food_logical = FALSE))
```

:::::::::::::::::::::::::


::::::::::::::::::::::::::::::::::::::::::::::::::

## Pivoting longer

The opposing situation could occur if we had been provided with data in the form
Expand Down Expand Up @@ -329,10 +390,10 @@ We created some summary tables on `interviews_items_owned` using `count` and
`summarise`. We can create the same tables on `interviews_long`, but this will
require a different process.

1. Make a table showing showing the number of respondents in each village who
owned a particular item, and include all items. The difference between this
format and the wide format is that you can now `count` all the items using the
`items_owned` variable.
Make a table showing the number of respondents in each village who owned
a particular item, and include all items. The difference between this format and
the wide format is that you can now `count` all the items using the
`items_owned` variable.

::::::::::::::: solution

Expand All @@ -347,68 +408,44 @@ interviews_long %>%

:::::::::::::::::::::::::

2. Calculate the average number of items from the list owned by
respondents in each village. If you remove rows where `items_owned_logical` is
`FALSE` you will have a data frame where the number of rows per household is
equal to the number of items owned. You can use that to calculate the mean
number of items per village.

Remember, you need to make sure we don't count `no_listed_items`, since this is
not an actual item, but rather the absence thereof.

::::::::::::::: solution

## Solution

```{r}
interviews_long %>%
filter(items_owned_logical,
items_owned != "no_listed_items") %>%
# to keep information per household, we count key_ID
count(key_ID, village) %>% # we want to also keep the village variable
group_by(village) %>%
summarise(mean_items = mean(n))
```

:::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::::::::::::::::


## Applying what we learned to clean our data

Now we have simultaneously learned about `pivot_longer()` and `pivot_wider()`,
and fixed a problem in the way our data is structured. In the spreadsheets lesson,
we learned that it's best practice to
have only a single piece of information in each cell of your spreadsheet. In
this dataset, we have another column that stores multiple values in a single
cell. Some of the cells in the `months_lack_food` column contain multiple months
which, as before, are separated by semi-colons (`;`).
and fixed a problem in the way our data is structured. In this dataset, we have
another column that stores multiple values in a single cell. Some of the cells
in the `months_lack_food` column contain multiple months which, as before, are
separated by semi-colons (`;`).

To create a data frame where each of the columns contain only one value per cell,
we can repeat the steps we applied to `items_owned` and apply them to
`months_lack_food`. Since we will be using this data frame for the next episode,
we will call it `interviews_plotting`.

```{r, purl=FALSE}
## Plotting data ##
interviews_plotting <- interviews %>%
## pivot wider by items_owned
separate_longer_delim(items_owned, delim = ";") %>%
## if there were no items listed, changing NA to no_listed_items
replace_na(list(items_owned = "no_listed_items")) %>%
mutate(items_owned_logical = TRUE) %>%
## Use of grouped mutate to find number of rows
group_by(key_ID) %>%
mutate(items_owned_logical = TRUE,
number_items = if_else(items_owned == "no_listed_items", 0, n())) %>%
pivot_wider(names_from = items_owned,
values_from = items_owned_logical,
values_fill = list(items_owned_logical = FALSE)) %>%
values_fill = list(items_owned_logical = FALSE)) %>%
## pivot wider by months_lack_food
separate_longer_delim(months_lack_food, delim = ";") %>%
mutate(months_lack_food_logical = TRUE) %>%
mutate(months_lack_food_logical = TRUE,
number_months_lack_food = if_else(months_lack_food == "none", 0, n())) %>%
pivot_wider(names_from = months_lack_food,
values_from = months_lack_food_logical,
values_fill = list(months_lack_food_logical = FALSE)) %>%
## add some summary columns
mutate(number_months_lack_food = rowSums(select(., Jan:May))) %>%
mutate(number_items = rowSums(select(., bicycle:car)))
values_fill = list(months_lack_food_logical = FALSE))
```


Expand Down
30 changes: 15 additions & 15 deletions episodes/05-ggplot2.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -77,21 +77,21 @@ interviews_plotting <- read_csv("https://raw.githubusercontent.com/datacarpentry
interviews_plotting <- interviews %>%
## pivot wider by items_owned
separate_longer_delim(items_owned, delim = ";") %>%
## if there were no items listed, changing NA to no_listed_items
replace_na(list(items_owned = "no_listed_items")) %>%
mutate(items_owned_logical = TRUE) %>%
## Use of grouped mutate to find number of rows
group_by(key_ID) %>%
mutate(items_owned_logical = TRUE,
number_items = if_else(items_owned == "no_listed_items", 0, n())) %>%
pivot_wider(names_from = items_owned,
values_from = items_owned_logical,
values_fill = list(items_owned_logical = FALSE)) %>%
values_fill = list(items_owned_logical = FALSE)) %>%
## pivot wider by months_lack_food
separate_rows(months_lack_food, sep = ";") %>%
mutate(months_lack_food_logical = TRUE) %>%
separate_longer_delim(months_lack_food, delim = ";") %>%
mutate(months_lack_food_logical = TRUE,
number_months_lack_food = if_else(months_lack_food == "none", 0, n())) %>%
pivot_wider(names_from = months_lack_food,
values_from = months_lack_food_logical,
values_fill = list(months_lack_food_logical = FALSE)) %>%
## add some summary columns
mutate(number_months_lack_food = rowSums(select(., Jan:May))) %>%
mutate(number_items = rowSums(select(., bicycle:car)))
values_fill = list(months_lack_food_logical = FALSE))
```

:::
Expand Down Expand Up @@ -281,7 +281,7 @@ opposed to lighter gray):
```{r adding-transparency, fig.alt="Scatter plot of number of items owned versus number of household members, with transparency added to points.", purl=FALSE}
interviews_plotting %>%
ggplot(aes(x = no_membrs, y = number_items)) +
geom_point(alpha = 0.3)
geom_point(alpha = 0.5)
```

That only helped a little bit with the overplotting problem, so let's try option
Expand Down Expand Up @@ -313,7 +313,7 @@ between 0.1 and 0.4. Experiment with the values to see how your plot changes.
```{r adding-width-height, fig.alt="Scatter plot of number of items owned versus number of household members, with jitter and transparency.", purl=FALSE}
interviews_plotting %>%
ggplot(aes(x = no_membrs, y = number_items)) +
geom_jitter(alpha = 0.3,
geom_jitter(alpha = 0.5,
width = 0.2,
height = 0.2)
```
Expand All @@ -324,7 +324,7 @@ a `color` argument inside the `geom_jitter()` function:
```{r adding-colors, fig.alt="Scatter plot of number of items owned versus number of household members, showing points as blue.", purl=FALSE}
interviews_plotting %>%
ggplot(aes(x = no_membrs, y = number_items)) +
geom_jitter(alpha = 0.3,
geom_jitter(alpha = 0.5,
color = "blue",
width = 0.2,
height = 0.2)
Expand All @@ -346,7 +346,7 @@ of the observation:
```{r color-by-species, purl=FALSE}
interviews_plotting %>%
ggplot(aes(x = no_membrs, y = number_items)) +
geom_jitter(aes(color = village), alpha = 0.3, width = 0.2, height = 0.2)
geom_jitter(aes(color = village), alpha = 0.5, width = 0.2, height = 0.2)
```

There appears to be a positive trend between number of household
Expand Down Expand Up @@ -389,7 +389,7 @@ What other kinds of plots might you use to show this type of data?
interviews_plotting %>%
ggplot(aes(x = village, y = rooms)) +
geom_jitter(aes(color = respondent_wall_type),
alpha = 0.3,
alpha = 0.5,
width = 0.2,
height = 0.2)
```
Expand Down Expand Up @@ -420,7 +420,7 @@ measurements and of their distribution:
interviews_plotting %>%
ggplot(aes(x = respondent_wall_type, y = rooms)) +
geom_boxplot(alpha = 0) +
geom_jitter(alpha = 0.3,
geom_jitter(alpha = 0.5,
color = "tomato",
width = 0.2,
height = 0.2)
Expand Down
30 changes: 17 additions & 13 deletions episodes/data/download_data.R
Original file line number Diff line number Diff line change
Expand Up @@ -35,19 +35,23 @@ if (! file.exists("data/interviews_plotting.csv")) {
mutate(memb_assoc = na_if(memb_assoc, "NULL"),
affect_conflicts = na_if(affect_conflicts, "NULL"),
items_owned = na_if(items_owned, "NULL")) %>%
separate_rows(items_owned, sep = ";") %>%
replace_na(list(items_owned = "no_listed_items")) %>%
mutate(items_owned_logical = TRUE) %>%
pivot_wider(names_from = items_owned,
values_from = items_owned_logical,
values_fill = list(items_owned_logical = FALSE)) %>%
separate_rows(months_lack_food, sep = ";") %>%
mutate(months_lack_food_logical = TRUE) %>%
pivot_wider(names_from = months_lack_food,
values_from = months_lack_food_logical,
values_fill = list(months_lack_food_logical = FALSE)) %>%
mutate(number_months_lack_food = rowSums(select(., Jan:May))) %>%
mutate(number_items = rowSums(select(., bicycle:car)))
## pivot wider by items_owned
separate_longer_delim(items_owned, delim = ";") %>%
replace_na(list(items_owned = "no_listed_items")) %>%
## Use of grouped mutate to find number of rows
group_by(key_ID) %>%
mutate(items_owned_logical = TRUE,
number_items = if_else(items_owned == "no_listed_items", 0, n())) %>%
pivot_wider(names_from = items_owned,
values_from = items_owned_logical,
values_fill = list(items_owned_logical = FALSE)) %>%
## pivot wider by months_lack_food
separate_longer_delim(months_lack_food, delim = ";") %>%
mutate(months_lack_food_logical = TRUE,
number_months_lack_food = if_else(months_lack_food == "none", 0, n())) %>%
pivot_wider(names_from = months_lack_food,
values_from = months_lack_food_logical,
values_fill = list(months_lack_food_logical = FALSE))

write.csv(interviews_plotting, "data/interviews_plotting.csv", row.names = FALSE)
}
Expand Down
Loading

0 comments on commit 216d1df

Please sign in to comment.