Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More details of formula usage in mgcv engine docs when using workflow #770

Closed
qiushiyan opened this issue Jul 15, 2022 · 5 comments · Fixed by #1015
Closed

More details of formula usage in mgcv engine docs when using workflow #770

qiushiyan opened this issue Jul 15, 2022 · 5 comments · Fixed by #1015

Comments

@qiushiyan
Copy link
Contributor

qiushiyan commented Jul 15, 2022

We need to include more details about using gam formula in the engine docgen_additive_mod(engien = "mgcv"). The engine doc only shows model fitting examples when using gam formula in fit() directly. When using a workflow with recipes, the gam formula needs to be declared in add_model alongside with the model spec

# no inline function in recipe
rec <- recipe(formula = mpg ~ ., data = mtcars)
spec <- gen_additive_mod() %>% 
    set_engine("mgcv")

wf <- workflow() %>% 
    add_recipe(rec) %>%  
    add_model(spec, formula = mpg ~ wt + gear + cyl + s(disp, k = 10))  # use gam formula here
@simonpcouch
Copy link
Contributor

A relevant Community post with reprex: https://community.rstudio.com/t/error-in-fit-xy-with-gam-model/143065

@Steviey
Copy link

Steviey commented Aug 12, 2022

+1

@siavash-babaei
Copy link

siavash-babaei commented Mar 7, 2023

Assume we have a response variable, outcome, one numerical predictor, pred_num, and one categorical variable, prec_fac.

Assume GAM formula is:

gam_formula <- "outcome ~ ." |> as.formula()

Then, you preprocess it through recipes with:

data_recipe <- recipes::recipe(
  formula = gam_formula,
  data    = data_train
) |>
  recipes::step_dummy(prec_fac) |>
  # Other Steps ...

# Train the recipe
data_recipe_prep <- data_recipe |>
  recipes::prep(training = data_train)

# Apply to training data
data_train_prep <- data_recipe_prep |>
  recipes::bake(new_data = NULL)

# Apply to test data
data_test_prep <- data_recipe_prep |>
  recipes::bake(new_data = data_test)

For things to work elsewhere, say in tune::tune_grid(), you need to add the following to workflows::add_model():

formula_alt = gam_formula |> terms.formula(data = data_train_prep)

So, whenever we have categorical variables in the model formula, you would need to manually preprocess data and use the terms from that.

This change of formulae in particular, is very confusing, and could potentially cause serious inconsistencies. Where do you use gam_formula vs formula_alt and how would it effect a complex workflow? I hope this gets addressed soon.

@simonpcouch
Copy link
Contributor

This may be a workflows or hardhat change rather than parsnip, but it might be worth looking out for indicative input in add_formula() or add_recipe() and warn if the formula looks like it might need to be passed as a model formula but add_model(formula) is missing. This is a bit tough since add*() should be able to be called in either order, so maybe that waits for fit.workflow() to be triggered.

Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Nov 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants