Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fit_transform methods? #546

Open
CameronBieganek opened this issue May 22, 2020 · 4 comments
Open

Add fit_transform methods? #546

CameronBieganek opened this issue May 22, 2020 · 4 comments
Labels
design discussion Discussing design issues enhancement New feature or request

Comments

@CameronBieganek
Copy link

Sometimes it's handy to have a combined fit_transform method. For example, right now I'm doing my one-hot encoding once at the beginning of my analysis (not in a pipeline). This is safe to do, since it's a static transformer that doesn't learn from the data. So for this use case, it would be cool if I could do something like this:

Xencoded = fit_transform(OneHotEncoder(), X)

Or perhaps it would make sense to just overload transform?

Xencoded = transform(OneHotEncoder(), X)
@OkonSamuel OkonSamuel added the enhancement New feature or request label May 22, 2020
@ablaom
Copy link
Member

ablaom commented May 26, 2020

Thanks for that. Yes, sk-learn has this, and so you are presumably not the first to wonder about this.

Users can sometimes get confused when there are lot of methods and many ways to do the same thing. In balancing convenience with simplicity, I feel some reluctance to do this.
However, there is also a performance issue, which you did not raise, which is that it is sometimes cheaper to fit and transform in one go, so we may want to allow model-specific implementations of such methods as well.

I wonder what others think of the suggestion?

@ablaom ablaom added the design discussion Discussing design issues label May 26, 2020
@CameronBieganek
Copy link
Author

I kind of like the approach of overloading transform for this. (If that's possible given the current method table.) I think it has a couple of advantages:

  • It doesn't introduce a new function to the API.
  • For some transformers the split between fit! and transform seems artificial. E.g., what does it mean to fit a OneHotEncoder? It seems like OneHotEncoder is really just a transformer with no fitting required.

@ablaom
Copy link
Member

ablaom commented Jun 2, 2020

If we are going to do this, then I agree this might be the best option.

It seem the method table would allow this.

Still, like to ponder this a bit more before committing.

@davidbp
Copy link
Contributor

davidbp commented Dec 21, 2020

I do not see the motivation for removing the fit with Onehotencoder.

As @CameronBieganek mentions sklearn has the option to fit_transform but it also has the option to first fit and then transform.

For some transformers the split between fit! and transform seems artificial. 
E.g., what does it mean to fit a OneHotEncoder?
 It seems like OneHotEncoder is really just a transformer with no fitting required.

I would argue otherwise. When you are doing Cross-Validation fitting a preprocess can be very handy. For example, one of feature values of a categorical variable might rarely appear. It might happen that the folds in which OneHotEncoder gets fitted do not have a particular feature value, making the model to simply ignore the creation of a new variable that might do more harm than good.

I do not have any counter argument for having a fit_transform method but I would vote against removing the capability to fit prepossesses that adapts to the input data when it is fitted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design discussion Discussing design issues enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants