Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Convert some of the Wrap-up M4 content into exercise #731

Merged
merged 14 commits into from
Oct 27, 2023

Conversation

ArturoAmorQ
Copy link
Collaborator

@ArturoAmorQ ArturoAmorQ commented Oct 17, 2023

Fixes #707.

Follows #711, which made the logistic_regression_non_linear notebook redundant. This PR creates an exercise to show the use of feature engineering using a more realistic dataset. In particular, demonstrates feature interaction when using one-hot encoding.

Note: I had to build the exercises from the solutions to correctly render the index.

Comment on lines +83 to +90
# preprocessor.fit(data)
# feature_names = (
# preprocessor.named_transformers_["onehotencoder"].get_feature_names_out(
# categorical_columns
# )
# ).tolist()
# feature_names += numerical_columns
# feature_names
Copy link
Collaborator Author

@ArturoAmorQ ArturoAmorQ Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For info: I had to comment these lines by hand as it was rising a flake8 error F821 undefined name when building the exercise from the solution.

We may need to think of a better way to avoid this situation in the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add the F821 failure to the ignore list in the flake8 configuration. Since we run all the code of the notebooks, including the solutions when building the jupyterbook we should be safe. The only code we do not check automatically are the solutions to the wrap up quiz but they are in the private repo.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here is that preprocessor is defined in the solution but not in the exercise. So I think it will raise an error when building the jupyterbook.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. We can keep it that way then.

Maybe add a comment to state to reuse the preprocessor variable defined in the solution of the previous question.

@ogrisel
Copy link
Collaborator

ogrisel commented Oct 18, 2023

Similarly to the private review made on the wrap up quiz, I am worried that fitting such a pipeline might require too much RAM and CPU on jupyterhub and as a result might crash or be painfully slow to execute. We should either trim the number of categorical features using min_frequency or max_categories in OneHotEncoder or alternatively use a Nystroem approximation with n_components between 100 and 1000.

Furthermore, I would also replace standard scaling by SplineTransformer to get a more expressive model.

@ArturoAmorQ ArturoAmorQ mentioned this pull request Oct 20, 2023
@ArturoAmorQ
Copy link
Collaborator Author

Furthermore, I would also replace standard scaling by SplineTransformer to get a more expressive model.

I decided not to do so because feature engineering actually degraded the performance of the spline model.

@ogrisel
Copy link
Collaborator

ogrisel commented Oct 26, 2023

Hum, it seems that I broken the jupyter preview...

@ArturoAmorQ
Copy link
Collaborator Author

I will have to merge this PR in its current state to possibly debug the synchro of notebooks and their impact on FUN. We can always iterate on it on future PRs.

@ArturoAmorQ ArturoAmorQ merged commit 008cff4 into INRIA:main Oct 27, 2023
2 checks passed
@ArturoAmorQ ArturoAmorQ deleted the move_Wrap_up_M4_content branch October 27, 2023 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Non linear feature engineering for logistic regression
2 participants