diff --git a/branch.md b/branch.md index e982fd81..d4f2e130 100644 --- a/branch.md +++ b/branch.md @@ -1,6 +1,6 @@ --- title: 'Branching' -teaching: 10 +teaching: 30 exercises: 2 --- @@ -81,8 +81,8 @@ tar_plan( ✔ skipped target penguins_data_raw ✔ skipped target penguins_data ▶ dispatched target combined_model -● completed target combined_model [0.024 seconds, 11.201 kilobytes] -▶ ended pipeline [0.273 seconds] +● completed target combined_model [0.038 seconds, 11.201 kilobytes] +▶ ended pipeline [0.292 seconds] ``` Let's have a look at the model. We will use the `glance()` function from the `broom` package. Unlike base R `summary()`, this function returns output as a tibble (the tidyverse equivalent of a dataframe), which as we will see later is quite useful for downstream analyses. @@ -106,8 +106,7 @@ This seems to indicate that the model is highly significant. But wait a moment... is this really an appropriate model? Recall that there are three species of penguins in the dataset. It is possible that the relationship between bill depth and length **varies by species**. -We should probably test some alternative models. -These could include models that add a parameter for species, or add an interaction effect between species and bill length. +Let's try making one model *per* species (three models total) to see how that does (this is technically not the correct statistical approach, but our focus here is to learn `targets`, not statistics). Now our workflow is getting more complicated. This is what a workflow for such an analysis might look like **without branching** (make sure to add `library(broom)` to `packages.R`): @@ -130,18 +129,23 @@ tar_plan( bill_depth_mm ~ bill_length_mm, data = penguins_data ), - species_model = lm( - bill_depth_mm ~ bill_length_mm + species, - data = penguins_data + adelie_model = lm( + bill_depth_mm ~ bill_length_mm, + data = filter(penguins_data, species == "Adelie") ), - interaction_model = lm( - bill_depth_mm ~ bill_length_mm * species, - data = penguins_data + chinstrap_model = lm( + bill_depth_mm ~ bill_length_mm, + data = filter(penguins_data, species == "Chinstrap") + ), + gentoo_model = lm( + bill_depth_mm ~ bill_length_mm, + data = filter(penguins_data, species == "Gentoo") ), # Get model summaries combined_summary = glance(combined_model), - species_summary = glance(species_model), - interaction_summary = glance(interaction_model) + adelie_summary = glance(adelie_model), + chinstrap_summary = glance(chinstrap_model), + gentoo_summary = glance(gentoo_model) ) ``` @@ -151,42 +155,63 @@ tar_plan( ✔ skipped target penguins_data_raw ✔ skipped target penguins_data ✔ skipped target combined_model -▶ dispatched target interaction_model -● completed target interaction_model [0.003 seconds, 19.283 kilobytes] -▶ dispatched target species_model -● completed target species_model [0.001 seconds, 15.439 kilobytes] +▶ dispatched target adelie_model +● completed target adelie_model [0.008 seconds, 6.475 kilobytes] +▶ dispatched target gentoo_model +● completed target gentoo_model [0.001 seconds, 5.88 kilobytes] +▶ dispatched target chinstrap_model +● completed target chinstrap_model [0.001 seconds, 4.535 kilobytes] ▶ dispatched target combined_summary -● completed target combined_summary [0.006 seconds, 348 bytes] -▶ dispatched target interaction_summary -● completed target interaction_summary [0.003 seconds, 348 bytes] -▶ dispatched target species_summary -● completed target species_summary [0.003 seconds, 347 bytes] -▶ ended pipeline [0.28 seconds] +● completed target combined_summary [0.007 seconds, 348 bytes] +▶ dispatched target adelie_summary +● completed target adelie_summary [0.003 seconds, 348 bytes] +▶ dispatched target gentoo_summary +● completed target gentoo_summary [0.003 seconds, 348 bytes] +▶ dispatched target chinstrap_summary +● completed target chinstrap_summary [0.003 seconds, 348 bytes] +▶ ended pipeline [0.307 seconds] ``` Let's look at the summary of one of the models: ``` r -tar_read(species_summary) +tar_read(adelie_summary) ``` ``` output # A tibble: 1 × 12 - r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs - -1 0.769 0.767 0.953 375. 3.65e-107 3 -467. 944. 963. 307. 338 342 + r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs + +1 0.153 0.148 1.12 27.0 0.000000667 1 -231. 468. 477. 188. 149 151 ``` So this way of writing the pipeline works, but is repetitive: we have to call `glance()` each time we want to obtain summary statistics for each model. -Furthermore, each summary target (`combined_summary`, etc.) is explicitly named and typed out manually. +Furthermore, each summary target (`adelie_summary`, etc.) is explicitly named and typed out manually. It would be fairly easy to make a typo and end up with the wrong model being summarized. +Before moving on, let's define another **custom function** function: `model_glance()`. +You will need to write custom functions frequently when using `targets`, so it's good to get used to it! + +As the name `model_glance()` suggests (it is good to write functions with names that indicate their purpose), this will build a model then immediately run `glance()` on it. +The reason for doing so is that we get a **dataframe as a result**, which is very helpful for branching, as we will see in the next section. +Save this in `R/functions.R`: + + +``` r +model_glance_orig <- function(penguins_data) { + model <- lm( + bill_depth_mm ~ bill_length_mm, + data = penguins_data) + broom::glance(model) +} +``` + ## Example with branching ### First attempt -Let's see how to write the same plan using **dynamic branching**: +Let's see how to write the same plan using **dynamic branching** (after running it, we will go through the new version in detail to understand each step): ``` r @@ -202,22 +227,22 @@ tar_plan( ), # Clean data penguins_data = clean_penguin_data(penguins_data_raw), - # Build models - models = list( - combined_model = lm( - bill_depth_mm ~ bill_length_mm, data = penguins_data), - species_model = lm( - bill_depth_mm ~ bill_length_mm + species, data = penguins_data), - interaction_model = lm( - bill_depth_mm ~ bill_length_mm * species, data = penguins_data) + # Group data + tar_group_by( + penguins_data_grouped, + penguins_data, + species ), - # Get model summaries + # Build combined model with all species together + combined_summary = model_glance(penguins_data), + # Build one model per species tar_target( - model_summaries, - glance(models[[1]]), - pattern = map(models) + species_summary, + model_glance(penguins_data_grouped), + pattern = map(penguins_data_grouped) ) ) +NA ``` What is going on here? @@ -229,81 +254,80 @@ First, let's look at the messages provided by `tar_make()`. ✔ skipped target penguins_data_raw_file ✔ skipped target penguins_data_raw ✔ skipped target penguins_data -▶ dispatched target models -● completed target models [0.005 seconds, 43.009 kilobytes] -▶ dispatched branch model_summaries_812e3af782bee03f -● completed branch model_summaries_812e3af782bee03f [0.006 seconds, 348 bytes] -▶ dispatched branch model_summaries_2b8108839427c135 -● completed branch model_summaries_2b8108839427c135 [0.003 seconds, 347 bytes] -▶ dispatched branch model_summaries_533cd9a636c3e05b -● completed branch model_summaries_533cd9a636c3e05b [0.003 seconds, 348 bytes] -● completed pattern model_summaries -▶ ended pipeline [0.302 seconds] +▶ dispatched target combined_summary +● completed target combined_summary [0.008 seconds, 348 bytes] +▶ dispatched target penguins_data_grouped +● completed target penguins_data_grouped [0.007 seconds, 1.527 kilobytes] +▶ dispatched branch species_summary_7fe6634f7c7f6a77 +● completed branch species_summary_7fe6634f7c7f6a77 [0.006 seconds, 348 bytes] +▶ dispatched branch species_summary_c580675a85977909 +● completed branch species_summary_c580675a85977909 [0.004 seconds, 348 bytes] +▶ dispatched branch species_summary_af3bb92d1b0f36d3 +● completed branch species_summary_af3bb92d1b0f36d3 [0.004 seconds, 348 bytes] +● completed pattern species_summary +▶ ended pipeline [0.315 seconds] ``` -There is a series of smaller targets (branches) that are each named like model_summaries_812e3af782bee03f, then one overall `model_summaries` target. +There is a series of smaller targets (branches) that are each named like species_summary_7fe6634f7c7f6a77, then one overall `species_summary` target. That is the result of specifying targets using branching: each of the smaller targets are the "branches" that comprise the overall target. Since `targets` has no way of knowing ahead of time how many branches there will be or what they represent, it names each one using this series of numbers and letters (the "hash"). `targets` builds each branch one at a time, then combines them into the overall target. -Next, let's look in more detail about how the workflow is set up, starting with how we defined the models: +Next, let's look in more detail about how the workflow is set up, starting with how we set up the data: ``` r - # Build models - models = list( - combined_model = lm( - bill_depth_mm ~ bill_length_mm, data = penguins_data), - species_model = lm( - bill_depth_mm ~ bill_length_mm + species, data = penguins_data), - interaction_model = lm( - bill_depth_mm ~ bill_length_mm * species, data = penguins_data) + # Group data + tar_group_by( + penguins_data_grouped, + penguins_data, + species ), ``` -Unlike the non-branching version, we defined the models **in a list** (instead of one target per model). -This is because dynamic branching is similar to the `base::apply()` or [`purrrr::map()`](https://purrr.tidyverse.org/reference/map.html) method of looping: it applies a function to each element of a list. -So we need to prepare the input for looping as a list. +Unlike the non-branching version, we added a step that **groups the data**. +This is because dynamic branching is similar to the [`tidyverse` approach](https://dplyr.tidyverse.org/articles/grouping.html) of applying the same function to a grouped dataframe. +So we use the `tar_group_by()` function to specify the groups in our input data: one group per species. -Next, take a look at the command to build the target `model_summaries`. +Next, take a look at the command to build the target `species_summary`. ``` r - # Get model summaries + # Build one model per species tar_target( - model_summaries, - glance(models[[1]]), - pattern = map(models) + species_summary, + model_glance(penguins_data_grouped), + pattern = map(penguins_data_grouped) ) ``` -As before, the first argument is the name of the target to build, and the second is the command to build it. +As before, the first argument to `tar_target()` is the name of the target to build, and the second is the command to build it. -Here, we apply the `glance()` function to each element of `models` (the `[[1]]` is necessary because when the function gets applied, each element is actually a nested list, and we need to remove one layer of nesting). +Here, we apply our custom `model_glance()` function to each group (in other words, each species) in `penguins_data_grouped`. Finally, there is an argument we haven't seen before, `pattern`, which indicates that this target should be built using dynamic branching. -`map` means to apply the command to each element of the input list (`models`) sequentially. +`map` means to apply the function to each group of the input data (`penguins_data_grouped`) sequentially. Now that we understand how the branching workflow is constructed, let's inspect the output: ``` r -tar_read(model_summaries) +tar_read(species_summary) ``` ``` output # A tibble: 3 × 12 - r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs - -1 0.0552 0.0525 1.92 19.9 1.12e- 5 1 -708. 1422. 1433. 1256. 340 342 -2 0.769 0.767 0.953 375. 3.65e-107 3 -467. 944. 963. 307. 338 342 -3 0.770 0.766 0.955 225. 8.52e-105 5 -466. 947. 974. 306. 336 342 + r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs + +1 0.153 0.148 1.12 27.0 6.67e- 7 1 -231. 468. 477. 188. 149 151 +2 0.427 0.418 0.866 49.2 1.53e- 9 1 -85.7 177. 184. 49.5 66 68 +3 0.414 0.409 0.754 85.5 1.02e-15 1 -139. 284. 292. 68.8 121 123 ``` The model summary statistics are all included in a single dataframe. -But there's one problem: **we can't tell which row came from which model!** It would be unwise to assume that they are in the same order as the list of models. +But there's one problem: **we can't tell which row came from which species!** It would be unwise to assume that they are in the same order as the input data. This is due to the way dynamic branching works: by default, there is no information about the provenance of each target preserved in the output. @@ -312,114 +336,72 @@ How can we fix this? ### Second attempt The key to obtaining useful output from branching pipelines is to include the necessary information in the output of each individual branch. -Here, we want to know the kind of model that corresponds to each row of the model summaries. -To do that, we need to write a **custom function**. -You will need to write custom functions frequently when using `targets`, so it's good to get used to it! +Here, we want to know the species that corresponds to each row of the model summaries. -Here is the function. Save this in `R/functions.R`: +We can achieve this by modifying our `model_glance` function. Be sure to save it after modifying it to include a column for species: ``` r -glance_with_mod_name <- function(model_in_list) { - model_name <- names(model_in_list) - model <- model_in_list[[1]] +model_glance <- function(penguins_data) { + # Make model + model <- lm( + bill_depth_mm ~ bill_length_mm, + data = penguins_data) + # Get species name + species_name <- unique(penguins_data$species) + # If this is the combined dataset with multiple + # species, changed name to 'combined' + if (length(species_name) > 1) { + species_name <- "combined" + } + # Get model summary and add species name glance(model) |> - mutate(model_name = model_name) + mutate(species = species_name, .before = 1) } ``` -Our new pipeline looks almost the same as before, but this time we use the custom function instead of `glance()`. - +Our new pipeline looks exactly the same as before; we have made a modification, but to a **function**, not the pipeline. -``` r -source("R/functions.R") -source("R/packages.R") - -tar_plan( - # Load raw data - tar_file_read( - penguins_data_raw, - path_to_file("penguins_raw.csv"), - read_csv(!!.x, show_col_types = FALSE) - ), - # Clean data - penguins_data = clean_penguin_data(penguins_data_raw), - # Build models - models = list( - combined_model = lm( - bill_depth_mm ~ bill_length_mm, data = penguins_data), - species_model = lm( - bill_depth_mm ~ bill_length_mm + species, data = penguins_data), - interaction_model = lm( - bill_depth_mm ~ bill_length_mm * species, data = penguins_data) - ), - # Get model summaries - tar_target( - model_summaries, - glance_with_mod_name(models), - pattern = map(models) - ) -) -``` +Since `targets` tracks the contents of each custom function, it realizes that it needs to recompute `species_summary` and runs this target again with the newly modified function. ``` output ✔ skipped target penguins_data_raw_file ✔ skipped target penguins_data_raw ✔ skipped target penguins_data -✔ skipped target models -▶ dispatched branch model_summaries_812e3af782bee03f -● completed branch model_summaries_812e3af782bee03f [0.012 seconds, 374 bytes] -▶ dispatched branch model_summaries_2b8108839427c135 -● completed branch model_summaries_2b8108839427c135 [0.007 seconds, 371 bytes] -▶ dispatched branch model_summaries_533cd9a636c3e05b -● completed branch model_summaries_533cd9a636c3e05b [0.004 seconds, 377 bytes] -● completed pattern model_summaries -▶ ended pipeline [0.281 seconds] +▶ dispatched target combined_summary +● completed target combined_summary [0.021 seconds, 371 bytes] +✔ skipped target penguins_data_grouped +▶ dispatched branch species_summary_7fe6634f7c7f6a77 +● completed branch species_summary_7fe6634f7c7f6a77 [0.011 seconds, 368 bytes] +▶ dispatched branch species_summary_c580675a85977909 +● completed branch species_summary_c580675a85977909 [0.007 seconds, 372 bytes] +▶ dispatched branch species_summary_af3bb92d1b0f36d3 +● completed branch species_summary_af3bb92d1b0f36d3 [0.006 seconds, 369 bytes] +● completed pattern species_summary +▶ ended pipeline [0.323 seconds] ``` -And this time, when we load the `model_summaries`, we can tell which model corresponds to which row (you may need to scroll to the right to see it). +And this time, when we load the `model_summaries`, we can tell which model corresponds to which row (the `.before = 1` in `mutate()` ensures that it shows up before the other columns). ``` r -tar_read(model_summaries) +tar_read(species_summary) ``` ``` output # A tibble: 3 × 13 - r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs model_name - -1 0.0552 0.0525 1.92 19.9 1.12e- 5 1 -708. 1422. 1433. 1256. 340 342 combined_model -2 0.769 0.767 0.953 375. 3.65e-107 3 -467. 944. 963. 307. 338 342 species_model -3 0.770 0.766 0.955 225. 8.52e-105 5 -466. 947. 974. 306. 336 342 interaction_model + species r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs + +1 Adelie 0.153 0.148 1.12 27.0 6.67e- 7 1 -231. 468. 477. 188. 149 151 +2 Chinstrap 0.427 0.418 0.866 49.2 1.53e- 9 1 -85.7 177. 184. 49.5 66 68 +3 Gentoo 0.414 0.409 0.754 85.5 1.02e-15 1 -139. 284. 292. 68.8 121 123 ``` Next we will add one more target, a prediction of bill depth based on each model. These will be needed for plotting the models in the report. -Such a prediction can be obtained with the `augment()` function of the `broom` package. +Such a prediction can be obtained with the `augment()` function of the `broom` package, and we create a custom function that outputs predicted points as a dataframe much like we did for the model summaries. -``` r -tar_load(models) -augment(models[[1]]) -``` - -``` output -# A tibble: 342 × 8 - bill_depth_mm bill_length_mm .fitted .resid .hat .sigma .cooksd .std.resid - - 1 18.7 39.1 17.6 1.14 0.00521 1.92 0.000924 0.594 - 2 17.4 39.5 17.5 -0.127 0.00485 1.93 0.0000107 -0.0663 - 3 18 40.3 17.5 0.541 0.00421 1.92 0.000168 0.282 - 4 19.3 36.7 17.8 1.53 0.00806 1.92 0.00261 0.802 - 5 20.6 39.3 17.5 3.06 0.00503 1.92 0.00641 1.59 - 6 17.8 38.9 17.6 0.222 0.00541 1.93 0.0000364 0.116 - 7 19.6 39.2 17.6 2.05 0.00512 1.92 0.00293 1.07 - 8 18.1 34.1 18.0 0.114 0.0124 1.93 0.0000223 0.0595 - 9 20.2 42 17.3 2.89 0.00329 1.92 0.00373 1.50 -10 17.1 37.8 17.7 -0.572 0.00661 1.92 0.000296 -0.298 -# ℹ 332 more rows -``` - ::::::::::::::::::::::::::::::::::::: {.challenge} ## Challenge: Add model predictions to the workflow @@ -428,15 +410,25 @@ Can you add the model predictions using `augment()`? You will need to define a c :::::::::::::::::::::::::::::::::: {.solution} -Define the new function as `augment_with_mod_name()`. It is the same as `glance_with_mod_name()`, but use `augment()` instead of `glance()`: +Define the new function as `model_augment()`. It is the same as `model_glance()`, but use `augment()` instead of `glance()`: ``` r -augment_with_mod_name <- function(model_in_list) { - model_name <- names(model_in_list) - model <- model_in_list[[1]] +model_augment <- function(penguins_data) { + # Make model + model <- lm( + bill_depth_mm ~ bill_length_mm, + data = penguins_data) + # Get species name + species_name <- unique(penguins_data$species) + # If this is the combined dataset with multiple + # species, changed name to 'combined' + if (length(species_name) > 1) { + species_name <- "combined" + } + # Get model summary and add species name augment(model) |> - mutate(model_name = model_name) + mutate(species = species_name, .before = 1) } ``` @@ -456,26 +448,27 @@ tar_plan( ), # Clean data penguins_data = clean_penguin_data(penguins_data_raw), - # Build models - models = list( - combined_model = lm( - bill_depth_mm ~ bill_length_mm, data = penguins_data), - species_model = lm( - bill_depth_mm ~ bill_length_mm + species, data = penguins_data), - interaction_model = lm( - bill_depth_mm ~ bill_length_mm * species, data = penguins_data) + # Group data + tar_group_by( + penguins_data_grouped, + penguins_data, + species ), - # Get model summaries + # Get summary of combined model with all species together + combined_summary = model_glance(penguins_data), + # Get summary of one model per species tar_target( - model_summaries, - glance_with_mod_name(models), - pattern = map(models) + species_summary, + model_glance(penguins_data_grouped), + pattern = map(penguins_data_grouped) ), - # Get model predictions + # Get predictions of combined model with all species together + combined_predictions = model_glance(penguins_data_grouped), + # Get predictions of one model per species tar_target( - model_predictions, - augment_with_mod_name(models), - pattern = map(models) + species_predictions, + model_augment(penguins_data_grouped), + pattern = map(penguins_data_grouped) ) ) ``` @@ -484,13 +477,86 @@ tar_plan( ::::::::::::::::::::::::::::::::::::: +### Further simplify the workflow + +You may have noticed that we can further simplify the workflow: there is no need to have separate `penguins_data` and `penguins_data_grouped` dataframes. +In general it is best to keep the number of named objects as small as possible to make it easier to reason about your code. +Let's combine the cleaning and grouping step into a single command: + + +``` r +source("R/functions.R") +source("R/packages.R") + +tar_plan( + # Load raw data + tar_file_read( + penguins_data_raw, + path_to_file("penguins_raw.csv"), + read_csv(!!.x, show_col_types = FALSE) + ), + # Clean and group data + tar_group_by( + penguins_data, + clean_penguin_data(penguins_data_raw), + species + ), + # Get summary of combined model with all species together + combined_summary = model_glance(penguins_data), + # Get summary of one model per species + tar_target( + species_summary, + model_glance(penguins_data), + pattern = map(penguins_data) + ), + # Get predictions of combined model with all species together + combined_predictions = model_augment(penguins_data), + # Get predictions of one model per species + tar_target( + species_predictions, + model_augment(penguins_data), + pattern = map(penguins_data) + ) +) +NA +``` + +And run it once more: + + +``` output +✔ skipped target penguins_data_raw_file +✔ skipped target penguins_data_raw +▶ dispatched target penguins_data +● completed target penguins_data [0.023 seconds, 1.527 kilobytes] +▶ dispatched target combined_summary +● completed target combined_summary [0.014 seconds, 371 bytes] +▶ dispatched branch species_summary_1598bb4431372f32 +● completed branch species_summary_1598bb4431372f32 [0.011 seconds, 368 bytes] +▶ dispatched branch species_summary_6b9109ba2e9d27fd +● completed branch species_summary_6b9109ba2e9d27fd [0.006 seconds, 372 bytes] +▶ dispatched branch species_summary_625f9fbc7f62298a +● completed branch species_summary_625f9fbc7f62298a [0.007 seconds, 369 bytes] +● completed pattern species_summary +▶ dispatched target combined_predictions +● completed target combined_predictions [0.007 seconds, 25.908 kilobytes] +▶ dispatched branch species_predictions_1598bb4431372f32 +● completed branch species_predictions_1598bb4431372f32 [0.01 seconds, 11.581 kilobytes] +▶ dispatched branch species_predictions_6b9109ba2e9d27fd +● completed branch species_predictions_6b9109ba2e9d27fd [0.005 seconds, 6.248 kilobytes] +▶ dispatched branch species_predictions_625f9fbc7f62298a +● completed branch species_predictions_625f9fbc7f62298a [0.004 seconds, 9.626 kilobytes] +● completed pattern species_predictions +▶ ended pipeline [0.392 seconds] +``` + ::::::::::::::::::::::::::::::::::::: {.callout} ## Best practices for branching -Dynamic branching is designed to work well with **dataframes** (tibbles). +Dynamic branching is designed to work well with **dataframes** (it can also use [lists](https://books.ropensci.org/targets/dynamic.html#list-iteration), but that is more advanced, so we recommend using dataframes when possible). -So if possible, write your custom functions to accept dataframes as input and return them as output, and always include any necessary metadata as a column or columns. +It is recommended to write your custom functions to accept dataframes as input and return them as output, and always include any necessary metadata as a column or columns. ::::::::::::::::::::::::::::::::::::: diff --git a/files/plans/plan_10.R b/files/plans/plan_10.R index be92fd01..59bcfed9 100644 --- a/files/plans/plan_10.R +++ b/files/plans/plan_10.R @@ -16,27 +16,26 @@ tar_plan( path_to_file("penguins_raw.csv"), read_csv(!!.x, show_col_types = FALSE) ), - # Clean data - penguins_data = clean_penguin_data(penguins_data_raw), - # Build models - models = list( - combined_model = lm( - bill_depth_mm ~ bill_length_mm, data = penguins_data), - species_model = lm( - bill_depth_mm ~ bill_length_mm + species, data = penguins_data), - interaction_model = lm( - bill_depth_mm ~ bill_length_mm * species, data = penguins_data) + # Clean and group data + tar_group_by( + penguins_data, + clean_penguin_data(penguins_data_raw), + species ), - # Get model summaries + # Get summary of combined model with all species together + combined_summary = model_glance(penguins_data), + # Get summary of one model per species tar_target( - model_summaries, - glance_with_mod_name_slow(models), - pattern = map(models) + species_summary, + model_glance_slow(penguins_data), + pattern = map(penguins_data) ), - # Get model predictions + # Get predictions of combined model with all species together + combined_predictions = model_glance_slow(penguins_data), + # Get predictions of one model per species tar_target( - model_predictions, - augment_with_mod_name_slow(models), - pattern = map(models) + species_predictions, + model_augment_slow(penguins_data), + pattern = map(penguins_data) ) ) diff --git a/files/plans/plan_11.R b/files/plans/plan_11.R index 5c9af52f..6b23b0b3 100644 --- a/files/plans/plan_11.R +++ b/files/plans/plan_11.R @@ -9,34 +9,32 @@ tar_plan( path_to_file("penguins_raw.csv"), read_csv(!!.x, show_col_types = FALSE) ), - # Clean data - penguins_data = clean_penguin_data(penguins_data_raw), - # Build models - models = list( - combined_model = lm( - bill_depth_mm ~ bill_length_mm, data = penguins_data), - species_model = lm( - bill_depth_mm ~ bill_length_mm + species, data = penguins_data), - interaction_model = lm( - bill_depth_mm ~ bill_length_mm * species, data = penguins_data) + # Clean and group data + tar_group_by( + penguins_data, + clean_penguin_data(penguins_data_raw), + species ), - # Get model summaries + # Get summary of combined model with all species together + combined_summary = model_glance(penguins_data), + # Get summary of one model per species tar_target( - model_summaries, - glance_with_mod_name(models), - pattern = map(models) + species_summary, + model_glance(penguins_data), + pattern = map(penguins_data) ), - # Get model predictions + # Get predictions of combined model with all species together + combined_predictions = model_augment(penguins_data), + # Get predictions of one model per species tar_target( - model_predictions, - augment_with_mod_name(models), - pattern = map(models) + species_predictions, + model_augment(penguins_data), + pattern = map(penguins_data) ), # Generate report tar_quarto( penguin_report, path = "penguin_report.qmd", - quiet = FALSE, - packages = c("targets", "tidyverse") + quiet = FALSE ) ) diff --git a/files/plans/plan_5.R b/files/plans/plan_5.R index 882876cc..cecaae2b 100644 --- a/files/plans/plan_5.R +++ b/files/plans/plan_5.R @@ -16,16 +16,21 @@ tar_plan( bill_depth_mm ~ bill_length_mm, data = penguins_data ), - species_model = lm( - bill_depth_mm ~ bill_length_mm + species, - data = penguins_data + adelie_model = lm( + bill_depth_mm ~ bill_length_mm, + data = filter(penguins_data, species == "Adelie") ), - interaction_model = lm( - bill_depth_mm ~ bill_length_mm * species, - data = penguins_data + chinstrap_model = lm( + bill_depth_mm ~ bill_length_mm, + data = filter(penguins_data, species == "Chinstrap") + ), + gentoo_model = lm( + bill_depth_mm ~ bill_length_mm, + data = filter(penguins_data, species == "Gentoo") ), # Get model summaries combined_summary = glance(combined_model), - species_summary = glance(species_model), - interaction_summary = glance(interaction_model) + adelie_summary = glance(adelie_model), + chinstrap_summary = glance(chinstrap_model), + gentoo_summary = glance(gentoo_model) ) diff --git a/files/plans/plan_6.R b/files/plans/plan_6.R index fad7536b..33f30d95 100644 --- a/files/plans/plan_6.R +++ b/files/plans/plan_6.R @@ -11,19 +11,18 @@ tar_plan( ), # Clean data penguins_data = clean_penguin_data(penguins_data_raw), - # Build models - models = list( - combined_model = lm( - bill_depth_mm ~ bill_length_mm, data = penguins_data), - species_model = lm( - bill_depth_mm ~ bill_length_mm + species, data = penguins_data), - interaction_model = lm( - bill_depth_mm ~ bill_length_mm * species, data = penguins_data) + # Group data + tar_group_by( + penguins_data_grouped, + penguins_data, + species ), - # Get model summaries + # Build combined model with all species together + combined_summary = model_glance(penguins_data), + # Build one model per species tar_target( - model_summaries, - glance(models[[1]]), - pattern = map(models) + species_summary, + model_glance(penguins_data_grouped), + pattern = map(penguins_data_grouped) ) ) diff --git a/files/plans/plan_6b.R b/files/plans/plan_6b.R new file mode 100644 index 00000000..28ac909c --- /dev/null +++ b/files/plans/plan_6b.R @@ -0,0 +1,28 @@ +options(tidyverse.quiet = TRUE) +source("R/packages.R") +source("R/functions.R") + +tar_plan( + # Load raw data + tar_file_read( + penguins_data_raw, + path_to_file("penguins_raw.csv"), + read_csv(!!.x, show_col_types = FALSE) + ), + # Clean data + penguins_data = clean_penguin_data(penguins_data_raw), + # Group data + tar_group_by( + penguins_data_grouped, + penguins_data, + species + ), + # Build combined model with all species together + combined_summary = model_glance_orig(penguins_data), + # Build one model per species + tar_target( + species_summary, + model_glance_orig(penguins_data_grouped), + pattern = map(penguins_data_grouped) + ) +) diff --git a/files/plans/plan_6c.R b/files/plans/plan_6c.R new file mode 100644 index 00000000..8b72fa69 --- /dev/null +++ b/files/plans/plan_6c.R @@ -0,0 +1,34 @@ +options(tidyverse.quiet = TRUE) +source("R/functions.R") +source("R/packages.R") + +tar_plan( + # Load raw data + tar_file_read( + penguins_data_raw, + path_to_file("penguins_raw.csv"), + read_csv(!!.x, show_col_types = FALSE) + ), + # Clean and group data + tar_group_by( + penguins_data, + clean_penguin_data(penguins_data_raw), + species + ), + # Get summary of combined model with all species together + combined_summary = model_glance(penguins_data), + # Get summary of one model per species + tar_target( + species_summary, + model_glance(penguins_data), + pattern = map(penguins_data) + ), + # Get predictions of combined model with all species together + combined_predictions = model_glance(penguins_data), + # Get predictions of one model per species + tar_target( + species_predictions, + model_augment(penguins_data), + pattern = map(penguins_data) + ) +) diff --git a/files/plans/plan_7.R b/files/plans/plan_7.R index 346cca74..da5f7bc5 100644 --- a/files/plans/plan_7.R +++ b/files/plans/plan_7.R @@ -11,19 +11,26 @@ tar_plan( ), # Clean data penguins_data = clean_penguin_data(penguins_data_raw), - # Build models - models = list( - combined_model = lm( - bill_depth_mm ~ bill_length_mm, data = penguins_data), - species_model = lm( - bill_depth_mm ~ bill_length_mm + species, data = penguins_data), - interaction_model = lm( - bill_depth_mm ~ bill_length_mm * species, data = penguins_data) + # Group data + tar_group_by( + penguins_data_grouped, + penguins_data, + species ), - # Get model summaries + # Get summary of combined model with all species together + combined_summary = model_glance(penguins_data), + # Get summary of one model per species tar_target( - model_summaries, - glance_with_mod_name(models), - pattern = map(models) + species_summary, + model_glance(penguins_data_grouped), + pattern = map(penguins_data_grouped) + ), + # Get predictions of combined model with all species together + combined_predictions = model_glance(penguins_data_grouped), + # Get predictions of one model per species + tar_target( + species_predictions, + model_augment(penguins_data_grouped), + pattern = map(penguins_data_grouped) ) ) diff --git a/files/plans/plan_8.R b/files/plans/plan_8.R index 8a6779ef..9d76b4a4 100644 --- a/files/plans/plan_8.R +++ b/files/plans/plan_8.R @@ -9,27 +9,26 @@ tar_plan( path_to_file("penguins_raw.csv"), read_csv(!!.x, show_col_types = FALSE) ), - # Clean data - penguins_data = clean_penguin_data(penguins_data_raw), - # Build models - models = list( - combined_model = lm( - bill_depth_mm ~ bill_length_mm, data = penguins_data), - species_model = lm( - bill_depth_mm ~ bill_length_mm + species, data = penguins_data), - interaction_model = lm( - bill_depth_mm ~ bill_length_mm * species, data = penguins_data) + # Clean and group data + tar_group_by( + penguins_data, + clean_penguin_data(penguins_data_raw), + species ), - # Get model summaries + # Get summary of combined model with all species together + combined_summary = model_glance(penguins_data), + # Get summary of one model per species tar_target( - model_summaries, - glance_with_mod_name(models), - pattern = map(models) + species_summary, + model_glance(penguins_data), + pattern = map(penguins_data) ), - # Get model predictions + # Get predictions of combined model with all species together + combined_predictions = model_augment(penguins_data), + # Get predictions of one model per species tar_target( - model_predictions, - augment_with_mod_name(models), - pattern = map(models) + species_predictions, + model_augment(penguins_data), + pattern = map(penguins_data) ) ) diff --git a/files/plans/plan_9.R b/files/plans/plan_9.R index 164359b1..99958265 100644 --- a/files/plans/plan_9.R +++ b/files/plans/plan_9.R @@ -16,27 +16,26 @@ tar_plan( path_to_file("penguins_raw.csv"), read_csv(!!.x, show_col_types = FALSE) ), - # Clean data - penguins_data = clean_penguin_data(penguins_data_raw), - # Build models - models = list( - combined_model = lm( - bill_depth_mm ~ bill_length_mm, data = penguins_data), - species_model = lm( - bill_depth_mm ~ bill_length_mm + species, data = penguins_data), - interaction_model = lm( - bill_depth_mm ~ bill_length_mm * species, data = penguins_data) + # Clean and group data + tar_group_by( + penguins_data, + clean_penguin_data(penguins_data_raw), + species ), - # Get model summaries + # Get summary of combined model with all species together + combined_summary = model_glance(penguins_data), + # Get summary of one model per species tar_target( - model_summaries, - glance_with_mod_name(models), - pattern = map(models) + species_summary, + model_glance(penguins_data), + pattern = map(penguins_data) ), - # Get model predictions + # Get predictions of combined model with all species together + combined_predictions = model_glance(penguins_data), + # Get predictions of one model per species tar_target( - model_predictions, - augment_with_mod_name(models), - pattern = map(models) + species_predictions, + model_augment(penguins_data), + pattern = map(penguins_data) ) ) diff --git a/files/tar_functions/model_augment.R b/files/tar_functions/model_augment.R new file mode 100644 index 00000000..68875d00 --- /dev/null +++ b/files/tar_functions/model_augment.R @@ -0,0 +1,16 @@ +model_augment <- function(penguins_data) { + # Make model + model <- lm( + bill_depth_mm ~ bill_length_mm, + data = penguins_data) + # Get species name + species_name <- unique(penguins_data$species) + # If this is the combined dataset with multiple + # species, changed name to 'combined' + if (length(species_name) > 1) { + species_name <- "combined" + } + # Get model summary and add species name + augment(model) |> + mutate(species = species_name, .before = 1) +} diff --git a/files/tar_functions/model_augment_slow.R b/files/tar_functions/model_augment_slow.R new file mode 100644 index 00000000..8dd99fe6 --- /dev/null +++ b/files/tar_functions/model_augment_slow.R @@ -0,0 +1,17 @@ +model_augment_slow <- function(penguins_data) { + Sys.sleep(4) + # Make model + model <- lm( + bill_depth_mm ~ bill_length_mm, + data = penguins_data) + # Get species name + species_name <- unique(penguins_data$species) + # If this is the combined dataset with multiple + # species, changed name to 'combined' + if (length(species_name) > 1) { + species_name <- "combined" + } + # Get model summary and add species name + augment(model) |> + mutate(species = species_name, .before = 1) +} diff --git a/files/tar_functions/model_glance.R b/files/tar_functions/model_glance.R new file mode 100644 index 00000000..c324161f --- /dev/null +++ b/files/tar_functions/model_glance.R @@ -0,0 +1,16 @@ +model_glance <- function(penguins_data) { + # Make model + model <- lm( + bill_depth_mm ~ bill_length_mm, + data = penguins_data) + # Get species name + species_name <- unique(penguins_data$species) + # If this is the combined dataset with multiple + # species, changed name to 'combined' + if (length(species_name) > 1) { + species_name <- "combined" + } + # Get model summary and add species name + glance(model) |> + mutate(species = species_name, .before = 1) +} diff --git a/files/tar_functions/model_glance_orig.R b/files/tar_functions/model_glance_orig.R new file mode 100644 index 00000000..a0c3fdd4 --- /dev/null +++ b/files/tar_functions/model_glance_orig.R @@ -0,0 +1,6 @@ +model_glance_orig <- function(penguins_data) { + model <- lm( + bill_depth_mm ~ bill_length_mm, + data = penguins_data) + broom::glance(model) +} diff --git a/files/tar_functions/model_glance_slow.R b/files/tar_functions/model_glance_slow.R new file mode 100644 index 00000000..ba37fe66 --- /dev/null +++ b/files/tar_functions/model_glance_slow.R @@ -0,0 +1,17 @@ +model_glance_slow <- function(penguins_data) { + Sys.sleep(4) + # Make model + model <- lm( + bill_depth_mm ~ bill_length_mm, + data = penguins_data) + # Get species name + species_name <- unique(penguins_data$species) + # If this is the combined dataset with multiple + # species, changed name to 'combined' + if (length(species_name) > 1) { + species_name <- "combined" + } + # Get model summary and add species name + glance(model) |> + mutate(species = species_name, .before = 1) +} diff --git a/md5sum.txt b/md5sum.txt index e93c694e..7791ed4d 100644 --- a/md5sum.txt +++ b/md5sum.txt @@ -11,9 +11,9 @@ "episodes/organization.Rmd" "74df25779b74013eeb6a8ca7b8934efe" "site/built/organization.md" "2024-12-13" "episodes/packages.Rmd" "2c0eb6138ea6685a0ee279c89b381bc4" "site/built/packages.md" "2024-12-13" "episodes/files.Rmd" "b7f4ef83379a58d5c30d8e011e3b2c0d" "site/built/files.md" "2024-12-13" -"episodes/branch.Rmd" "6f1187d6df3310eb042aaae3a44328dc" "site/built/branch.md" "2024-12-13" -"episodes/parallel.Rmd" "3ec032e9a527138e70e2efb4e5a10410" "site/built/parallel.md" "2024-12-13" -"episodes/quarto.Rmd" "76b257de72894ab24e1d1852b6149bf9" "site/built/quarto.md" "2024-12-13" +"episodes/branch.Rmd" "653595088adc36d9d8f62b44eb999b79" "site/built/branch.md" "2024-12-24" +"episodes/parallel.Rmd" "ac69d3ea56790fa3ed99b29a2c809ade" "site/built/parallel.md" "2024-12-24" +"episodes/quarto.Rmd" "c0cc60ecc04827fc09a3a2e9c9b36bd3" "site/built/quarto.md" "2024-12-24" "instructors/instructor-notes.md" "df3784ee5c0436a9e171071f7965d3fc" "site/built/instructor-notes.md" "2024-12-13" "learners/reference.md" "3f06251c1f932e767ae8f22db25eb5a2" "site/built/reference.md" "2024-12-13" "learners/setup.md" "2c9965f182c4d73141cbf0bef2990f16" "site/built/setup.md" "2024-12-13" diff --git a/parallel.md b/parallel.md index dc47fa28..b81682b5 100644 --- a/parallel.md +++ b/parallel.md @@ -1,6 +1,6 @@ --- title: 'Parallel Processing' -teaching: 10 +teaching: 15 exercises: 2 --- @@ -74,54 +74,74 @@ tar_plan( path_to_file("penguins_raw.csv"), read_csv(!!.x, show_col_types = FALSE) ), - # Clean data - penguins_data = clean_penguin_data(penguins_data_raw), - # Build models - models = list( - combined_model = lm( - bill_depth_mm ~ bill_length_mm, data = penguins_data), - species_model = lm( - bill_depth_mm ~ bill_length_mm + species, data = penguins_data), - interaction_model = lm( - bill_depth_mm ~ bill_length_mm * species, data = penguins_data) + # Clean and group data + tar_group_by( + penguins_data, + clean_penguin_data(penguins_data_raw), + species ), - # Get model summaries + # Get summary of combined model with all species together + combined_summary = model_glance(penguins_data), + # Get summary of one model per species tar_target( - model_summaries, - glance_with_mod_name(models), - pattern = map(models) + species_summary, + model_glance(penguins_data), + pattern = map(penguins_data) ), - # Get model predictions + # Get predictions of combined model with all species together + combined_predictions = model_glance(penguins_data), + # Get predictions of one model per species tar_target( - model_predictions, - augment_with_mod_name(models), - pattern = map(models) + species_predictions, + model_augment(penguins_data), + pattern = map(penguins_data) ) ) +NA ``` There is still one more thing we need to modify only for the purposes of this demo: if we ran the analysis in parallel now, you wouldn't notice any difference in compute time because the functions are so fast. -So let's make "slow" versions of `glance_with_mod_name()` and `augment_with_mod_name()` using the `Sys.sleep()` function, which just tells the computer to wait some number of seconds. +So let's make "slow" versions of `model_glance()` and `model_augment()` using the `Sys.sleep()` function, which just tells the computer to wait some number of seconds. This will simulate a long-running computation and enable us to see the difference between running sequentially and in parallel. Add these functions to `functions.R` (you can copy-paste the original ones, then modify them): ``` r -glance_with_mod_name_slow <- function(model_in_list) { +model_glance_slow <- function(penguins_data) { Sys.sleep(4) - model_name <- names(model_in_list) - model <- model_in_list[[1]] - broom::glance(model) |> - mutate(model_name = model_name) + # Make model + model <- lm( + bill_depth_mm ~ bill_length_mm, + data = penguins_data) + # Get species name + species_name <- unique(penguins_data$species) + # If this is the combined dataset with multiple + # species, changed name to 'combined' + if (length(species_name) > 1) { + species_name <- "combined" + } + # Get model summary and add species name + glance(model) |> + mutate(species = species_name, .before = 1) } -augment_with_mod_name_slow <- function(model_in_list) { +model_augment_slow <- function(penguins_data) { Sys.sleep(4) - model_name <- names(model_in_list) - model <- model_in_list[[1]] - broom::augment(model) |> - mutate(model_name = model_name) + # Make model + model <- lm( + bill_depth_mm ~ bill_length_mm, + data = penguins_data) + # Get species name + species_name <- unique(penguins_data$species) + # If this is the combined dataset with multiple + # species, changed name to 'combined' + if (length(species_name) > 1) { + species_name <- "combined" + } + # Get model summary and add species name + augment(model) |> + mutate(species = species_name, .before = 1) } ``` @@ -145,55 +165,81 @@ tar_plan( path_to_file("penguins_raw.csv"), read_csv(!!.x, show_col_types = FALSE) ), - # Clean data - penguins_data = clean_penguin_data(penguins_data_raw), - # Build models - models = list( - combined_model = lm( - bill_depth_mm ~ bill_length_mm, data = penguins_data), - species_model = lm( - bill_depth_mm ~ bill_length_mm + species, data = penguins_data), - interaction_model = lm( - bill_depth_mm ~ bill_length_mm * species, data = penguins_data) + # Clean and group data + tar_group_by( + penguins_data, + clean_penguin_data(penguins_data_raw), + species ), - # Get model summaries + # Get summary of combined model with all species together + combined_summary = model_glance(penguins_data), + # Get summary of one model per species tar_target( - model_summaries, - glance_with_mod_name_slow(models), - pattern = map(models) + species_summary, + model_glance_slow(penguins_data), + pattern = map(penguins_data) ), - # Get model predictions + # Get predictions of combined model with all species together + combined_predictions = model_glance_slow(penguins_data), + # Get predictions of one model per species tar_target( - model_predictions, - augment_with_mod_name_slow(models), - pattern = map(models) + species_predictions, + model_augment_slow(penguins_data), + pattern = map(penguins_data) ) ) +NA ``` Finally, run the pipeline with `tar_make()` as normal. ``` output -✔ skip target penguins_data_raw_file -✔ skip target penguins_data_raw -✔ skip target penguins_data -✔ skip target models -• start branch model_predictions_5ad4cec5 -• start branch model_predictions_c73912d5 -• start branch model_predictions_91696941 -• start branch model_summaries_5ad4cec5 -• start branch model_summaries_c73912d5 -• start branch model_summaries_91696941 -• built branch model_predictions_5ad4cec5 [4.884 seconds] -• built branch model_predictions_c73912d5 [4.896 seconds] -• built branch model_predictions_91696941 [4.006 seconds] -• built pattern model_predictions -• built branch model_summaries_5ad4cec5 [4.011 seconds] -• built branch model_summaries_c73912d5 [4.011 seconds] -• built branch model_summaries_91696941 [4.011 seconds] -• built pattern model_summaries -• end pipeline [15.153 seconds] +✔ skipped target penguins_data_raw_file +✔ skipped target penguins_data_raw +✔ skipped target penguins_data +✔ skipped target combined_summary +▶ dispatched branch species_summary_1598bb4431372f32 +▶ dispatched branch species_summary_6b9109ba2e9d27fd +● completed branch species_summary_1598bb4431372f32 [4.695 seconds, 368 bytes] +▶ dispatched branch species_summary_625f9fbc7f62298a +● completed branch species_summary_6b9109ba2e9d27fd [4.69 seconds, 372 bytes] +▶ dispatched target combined_predictions +● completed branch species_summary_625f9fbc7f62298a [4.011 seconds, 369 bytes] +● completed pattern species_summary +▶ dispatched branch species_predictions_1598bb4431372f32 +● completed target combined_predictions [4.012 seconds, 371 bytes] +▶ dispatched branch species_predictions_6b9109ba2e9d27fd +● completed branch species_predictions_1598bb4431372f32 [4.013 seconds, 11.585 kilobytes] +▶ dispatched branch species_predictions_625f9fbc7f62298a +● completed branch species_predictions_6b9109ba2e9d27fd [4.012 seconds, 6.252 kilobytes] +● completed branch species_predictions_625f9fbc7f62298a [4.01 seconds, 9.629 kilobytes] +● completed pattern species_predictions +▶ ended pipeline [18.809 seconds] +``` + +``` output +✔ skipped target penguins_data_raw_file +✔ skipped target penguins_data_raw +✔ skipped target penguins_data +✔ skipped target combined_summary +▶ dispatched branch species_summary_1598bb4431372f32 +▶ dispatched branch species_summary_6b9109ba2e9d27fd +● completed branch species_summary_1598bb4431372f32 [4.815 seconds, 367 bytes] +▶ dispatched branch species_summary_625f9fbc7f62298a +● completed branch species_summary_6b9109ba2e9d27fd [4.813 seconds, 370 bytes] +▶ dispatched target combined_predictions +● completed branch species_summary_625f9fbc7f62298a [4.01 seconds, 367 bytes] +● completed pattern species_summary +▶ dispatched branch species_predictions_1598bb4431372f32 +● completed target combined_predictions [4.012 seconds, 370 bytes] +▶ dispatched branch species_predictions_6b9109ba2e9d27fd +● completed branch species_predictions_1598bb4431372f32 [4.014 seconds, 11.585 kilobytes] +▶ dispatched branch species_predictions_625f9fbc7f62298a +● completed branch species_predictions_6b9109ba2e9d27fd [4.01 seconds, 6.25 kilobytes] +● completed branch species_predictions_625f9fbc7f62298a [4.007 seconds, 9.628 kilobytes] +● completed pattern species_predictions +▶ ended pipeline [19.363 seconds] ``` Notice that although the time required to build each individual target is about 4 seconds, the total time to run the entire workflow is less than the sum of the individual target times! That is proof that processes are running in parallel **and saving you time**. diff --git a/quarto.md b/quarto.md index 1782fa2b..6036d294 100644 --- a/quarto.md +++ b/quarto.md @@ -89,37 +89,37 @@ tar_plan( path_to_file("penguins_raw.csv"), read_csv(!!.x, show_col_types = FALSE) ), - # Clean data - penguins_data = clean_penguin_data(penguins_data_raw), - # Build models - models = list( - combined_model = lm( - bill_depth_mm ~ bill_length_mm, data = penguins_data), - species_model = lm( - bill_depth_mm ~ bill_length_mm + species, data = penguins_data), - interaction_model = lm( - bill_depth_mm ~ bill_length_mm * species, data = penguins_data) + # Clean and group data + tar_group_by( + penguins_data, + clean_penguin_data(penguins_data_raw), + species ), - # Get model summaries + # Get summary of combined model with all species together + combined_summary = model_glance(penguins_data), + # Get summary of one model per species tar_target( - model_summaries, - glance_with_mod_name(models), - pattern = map(models) + species_summary, + model_glance(penguins_data), + pattern = map(penguins_data) ), - # Get model predictions + # Get predictions of combined model with all species together + combined_predictions = model_augment(penguins_data), + # Get predictions of one model per species tar_target( - model_predictions, - augment_with_mod_name(models), - pattern = map(models) + species_predictions, + model_augment(penguins_data), + pattern = map(penguins_data) ), # Generate report tar_quarto( penguin_report, path = "penguin_report.qmd", - quiet = FALSE, - packages = c("targets", "tidyverse") + quiet = FALSE ) ) +NA +NA ``` @@ -137,8 +137,7 @@ How does this work? The answer lies **inside** the `penguin_report.qmd` file. Let's look at the start of the file: - -```` markdown +````{.markdown} --- title: "Simpson's Paradox in Palmer Penguins" format: @@ -151,13 +150,18 @@ execute: ```{r} #| label: load #| message: false -targets::tar_load(penguin_models_augmented) -targets::tar_load(penguin_models_summary) +targets::tar_load( + c(combined_summary, + species_summary, + combined_predictions, + species_predictions + ) +) library(tidyverse) ``` -This is an example analysis of penguins on the Palmer Archipelago in Antarctica. +The goal of this analysis is to determine how bill length and depth are related in three species of penguins from Antarctica. ```` @@ -165,9 +169,9 @@ The lines in between `---` and `---` at the very beginning are called the "YAML The R code to be executed is specified by the lines between `` ```{r} `` and `` ``` ``. This is called a "code chunk", since it is a portion of code interspersed within prose text. -Take a closer look at the R code chunk. Notice the two calls to `targets::tar_load()`. Do you remember what that function does? It loads the targets built during the workflow. +Take a closer look at the R code chunk. Notice the use of `targets::tar_load()`. Do you remember what that function does? It loads the targets built during the workflow. -Now things should make a bit more sense: `targets` knows that the report depends on the targets built during the workflow, `penguin_models_augmented` and `penguin_models_summary`, **because they are loaded in the report with `tar_load()`.** +Now things should make a bit more sense: `targets` knows that the report depends on the targets built during the workflow like `combined_summary` and `species_summary` **because they are loaded in the report with `tar_load()`.** ## Generating dynamic content @@ -177,13 +181,13 @@ The call to `tar_load()` at the start of `penguin_report.qmd` is really the key ## Challenge: Spot the dynamic contents -Read through `penguin_report.qmd` and try to find instances where the targets built during the workflow (`penguin_models_augmented` and `penguin_models_summary`) are used to dynamically produce text and plots. +Read through `penguin_report.qmd` and try to find instances where the targets built during the workflow (`combined_summary`, etc.) are used to dynamically produce text and plots. :::::::::::::::::::::::::::::::::: {.solution} -- In the code chunk labeled `results-stats`, statistics from the models like *P*-value and adjusted *R* squared are extracted, then inserted into the text with in-line code like `` `r mod_stats$combined$r.squared` ``. +- In the code chunk labeled `results-stats`, statistics from the models like *R* squared are extracted, then inserted into the text with in-line code like `` `r combined_r2` ``. -- There are two figures, one for the combined model and one for the separate model (code chunks labeled `fig-combined-plot` and `fig-separate-plot`, respectively). These are built using the points predicted from the model in `penguin_models_augmented`. +- There are two figures, one for the combined model and one for the separate models (code chunks labeled `fig-combined-plot` and `fig-separate-plot`, respectively). These are built using the points predicted from the model in `combined_predictions` and `species_predictions`. ::::::::::::::::::::::::::::::::::