-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove intercept coming from workflows based on engine encodings #1033
Conversation
@@ -45,6 +45,10 @@ | |||
rlang::abort("`composition` should be either 'data.frame' or 'matrix'.") | |||
} | |||
|
|||
if (remove_intercept) { | |||
data <- data[, colnames(data) != "(Intercept)", drop = FALSE] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use a a regex or more general method for determining what should be taken out?
I think that it is unlikely to happen but, depending on how you make the data, you could get multiple columns with nearly the same names:
mat_1 <- model.matrix( ~ ., data = mtcars[, 2:4])
colnames(mat_1)
#> [1] "(Intercept)" "cyl" "disp" "hp"
mat_2 <- model.matrix( ~ ., data = as.data.frame(mat_1))
colnames(mat_2)
#> [1] "(Intercept)" "`(Intercept)`" "cyl" "disp"
#> [5] "hp"
colnames(mat_2) != "(Intercept)"
#> [1] FALSE TRUE TRUE TRUE TRUE
Created on 2023-12-07 with reprex v2.0.2
Maybe something like:
col_names <- c("(Intercept)", "`(Intercept)`", "cyl", "disp", "hp")
grepl("^[[:punct:]]+Intercept[[:punct:]]+", col_names)
#> [1] TRUE TRUE FALSE FALSE FALSE
Created on 2023-12-07 with reprex v2.0.2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did run into this.
I think the only time that we would trigger this (as opposed to somebody including a predictor named (Intercept)
or the like) is when we send the intercept coming from workflows into model.matrix()
. We only do that in .convert_form_to_xy_fit()
.
We could let that happen and just clean up "more thoroughly"via such a regrex afterwards. That is not a good idea because then that intercept-as-predictor is stored in the terms
which are re-used during prediction. That's what caused tidymodels/censored#272. Therefore, we need to remove the intercept coming from workflows before we make the terms.
As for generally beefing up and using the regrex instead of the direct name match as we have it now: I'm inclined to leave it as is because I think it might be helpful if such a case errors rather than "gets fixed". I am relatively confident that we now properly clean up after workflows. If not, I'd like to see it fail and investigate. If there is an additional intercept to clean up via the regrex that's not coming from workflows, maybe that should error too?
This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue. |
closes #1032
parsnip captures in its encodings (example) what to do about the intercept for a specific engine. In a "pure parsnip use case" this only matters when parsnip has to convert data between its own formula interface and the engine's matrix interface, covered in
.convert_form_to_xy()
. (The reason why we might have bothinclude_intercept
andremove_intercept
asTRUE
is because the indicators for factor variables are affected by the absence/presence of the intercept.)When a parsnip model is used inside of a workflow together with a formula as the preprocessor, the encodings matter again (they do not matter for other kinds of preprocessors):
In workflow's pre stage:
finalize_blueprint()
encodesinclude_intercept
in the blueprint (if not supplied by the user, which is the default)fit.action_formula()
uses the (preprocessing) formula and that blueprint to make the mold (containing the outcomes and the predictors) - this may now contain an intercept based on the encodings.In workflow's fit stage:
fit()
if the workflow contains a model formulafix_xy()
otherwiseIn parsnip:
xy_xy()
xy_form()
form_form()
form_xy()
That was all for fitting a model - they matter again when predicting with a model.
For "pure parsnip" and the formula-to-matrix conversion case,
.convert_form_to_xy_new()
deals with the intercept based on the encodings.In workflows, predictors are
forge()
ed based on the blueprint generated during the fit process, containing the information on whether to include the intercept.We've reckoned with that intercept coming in from workflows previously:
xy_form()
: Add argument for one hot encoding to parsnip #332, among many other things, adds a step inside of.convert_xy_to_form_fit()
to remove the intercept from the predictor matrix (as always, based on the encodings) before putting it together with the outcome into a data frame and sending it off with ay ~ .
formula. Thatdot
will mean the engine may add its own intercept as appropriate.xy_xy()
, remove double intercepts for glmnet models #352 adds that encoding-based removal step.And then we touched it once more, in the follow-up PR #353 where we added that step to
prepare_data()
which is used in parsnip's predict call stack, to remove the intercept added by workflow's predict method.Leaves us with the
form_xy()
andform_form()
cases. We've now found that one-too-many intercept forform_xy()
in this comment and forform_form()
in tidymodels/workflows#210.So, after a lot of sleuthing and diagramming, this PR adds that encoding-based removal step to both those cases!
As for prediction: we don't need to touch
prepare_data()
because forxy_form()
case for fitting):.convert_xy_to_form_new()
subsets onx_var
from$preproc
which was generated after the intercept was removed..convert_form_to_xy_new()
makes the model frame with theterms
from$preproce
which were (with this PR) now also generated after the intercept from workflows was removed.Tests are in tidymodels/extratests#151