-
-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Figure out a way to avoid models[[1]]
#43
Comments
This is tricky, because non-standard evaluation is used to define models for Furthermore, the design of branching in {targets} nudges us to use dataframes (or tibbles) as targets. So when designing custom functions that will be used in branching, it helps to think of how the function will work on one row of input. We can write a custom function that looks clean in the final plan and produces clean output (a tidy dataframe), but the contents of the function are rather complicated since it has to work with a one-row dataframe as input. This will be tedious to explain to novices (and it still requires indexing with Finally, the approach of including models as a list-column in a dataframe is a rather advanced topic. Anyways here is a sketch that builds models in a tibble, then branches over the rows of the tibble: source("R/packages.R")
source("R/functions.R")
summarize_model <- function(model_tibble) {
model_name <- model_tibble$model_name
model <- model_tibble$model[[1]]
glance(model) |>
mutate(model_name = model_name) |>
relocate(model_name, .before = 1)
}
tar_plan(
# Load raw data
tar_file_read(
penguins_data_raw,
path_to_file("penguins_raw.csv"),
read_csv(!!.x, show_col_types = FALSE)
),
# Clean data
penguins_data = clean_penguin_data(penguins_data_raw),
# Build models
models = tibble(
model_name = c("combined_model", "species_model", "interaction_model"),
model = list(
lm(bill_depth_mm ~ bill_length_mm, data = penguins_data),
lm(bill_depth_mm ~ bill_length_mm + species, data = penguins_data),
lm(bill_depth_mm ~ bill_length_mm * species, data = penguins_data)
)
),
# Get model summaries
tar_target(
model_summaries,
summarize_model(models),
pattern = map(models)
)
) I now realize a more natural introduction to branching would be to branch over different sets of input instead of different models. @multimeric keen to hear your thoughts! |
Update: just taught this workshop again, and this part is very difficult to teach since the details are so complicated. We should definitely use a simpler example for branching. Maybe not even use the models at all. |
NEW IDEA: instead of branching over the list of models, split up the original data set by species using |
I think that would be better. Anything that avoids using a list is good: even if we have something that relates to branching over a single vector would be better because it avoids changing the data type. |
Right... of course, the output of
That feels a little awkward because in a "production" situation you would only build the model once. But for teaching {targets} it's probably OK? It sure is easier to reason about with dataframe in and dataframe out. |
I've forgotten some of the context here, but branching over a list is not fundamentally a problem. Lists are familiar to users, teaching that we should use If we wanted to keep using a list of models, we could either put the model into a nested list, even better we could use attributes: tar_target(
models,
list(
lm(bill_depth_mm ~ bill_length_mm, data = penguins_data) |> structure(name="base"),
lm(bill_depth_mm ~ bill_length_mm + species, data = penguins_data) |> structure(name="species"),
lm(bill_depth_mm ~ bill_length_mm * species, data = penguins_data) |> structure(name="interaction")
),
iteration="list"
),
tar_target(
model_summaries,
summarize_model(models),
pattern = map(models)
) And then pull them out with: summarize_model <- function(model) {
model_name <- attr(model, "name")
glance(model) |>
mutate(model_name = model_name) |>
relocate(model_name, .before = 1)
} That said, I'm also happy to go with your grouped data frame idea, that also seems like an elegant solution. |
Closed by #51 |
maybe we could use a dataframe for storing modes instead?
The text was updated successfully, but these errors were encountered: