Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added option to use sklearn's OneHotEncoder to handle unknown categories #174

Closed
wants to merge 12 commits into from

Conversation

Vaseekaran-V
Copy link

@Vaseekaran-V Vaseekaran-V commented Sep 3, 2024

This library is amazing and I noticed a small issue when using the Multiple Correspondence Analysis: since the function uses pd.get_dummies internally to one hot encode the data, I got an error as my testing set had unknown categories in certain categorical features compared to the train set.

Therefore, I have initialized a OneHotEncoder object from sklearn.preprocessing to process the data, if the user wants to opt out of using the get_dummies function.

These are the three attributes that I have specified:

  • get_dummies (if True, will use the original get_dummies method (default is set to False))
  • one_hot_encoder (the OneHotEncoder object)
  • is_one_hot_fitted: (boolean to check if the one_hot_encoder is fitted)

I have updated the _prepare function as well:

def _prepare(self, X):
        if self.one_hot:
            if self.get_dummies:
                X = pd.get_dummies(X, columns=X.columns)
                return X
            else:
                if self.is_one_hot_fitted == False:
                    X_enc = self.one_hot_encoder.fit_transform(X)
                    X_enc = pd.DataFrame(X_enc, columns=self.one_hot_encoder.get_feature_names_out(X.columns))
                    self.is_one_hot_fitted = True
                    return X_enc
                else:
                    X_enc = self.one_hot_encoder.transform(X)
                    X_enc = pd.DataFrame(X_enc, columns=self.one_hot_encoder.get_feature_names_out(X.columns))
                    return X_enc
        return X

Let me know if there is anything else I can do, or whether the workings are correct.

Thanks again for this great library <3

@MaxHalford
Copy link
Owner

Thanks for starting this PR! This is a tricky topic. Have you tried running the unit tests? I think this will will fail due to supplementary columns... I have booked some time on my calendar to look into this. I'll let you know.

@MaxHalford
Copy link
Owner

And thanks for the appreciation :)

@Vaseekaran-V
Copy link
Author

Thanks for starting this PR! This is a tricky topic. Have you tried running the unit tests? I think this will will fail due to supplementary columns... I have booked some time on my calendar to look into this. I'll let you know.

Hi, thank you. I didn't try the unit tests, and as you said, the unit tests are failing. Please let me know if there is anything that I can do, and also, may I know the reason for having supplementary columns?

@Vaseekaran-V
Copy link
Author

I modified the mca file to handle unknown features, as the error in the unit test is the features that are seen in fit are not seen when transforming, so I modified the _prepare function in mca.py:

def _prepare(self, X):
        if self.one_hot:
            if self.get_dummies:
                X = pd.get_dummies(X, columns=X.columns)
                return X
            else:
                if self.is_one_hot_fitted == False:
                    #if the one_hot_encoder is not fitted, to fit and also set the is_one_hot_fitted variable to True
                    X_enc = self.one_hot_encoder.fit_transform(X)
                    X_enc = pd.DataFrame(X_enc, columns=self.one_hot_encoder.get_feature_names_out(X.columns))
                    self.is_one_hot_fitted = True
                    return X_enc
                else:
                    #checking if the columns fed to the onehot encoder and the columns fitted to the onehot encoder are the same
                    oh_cols = set(self.one_hot_encoder.feature_names_in_.tolist())
                    X_cols = set(X.columns.tolist())
                    
                    if oh_cols == X_cols:
                        #if the fitted cols are the same as the inferencing columns, then can transform
                        X_enc = self.one_hot_encoder.transform(X)
                        X_enc = pd.DataFrame(X_enc, columns=self.one_hot_encoder.get_feature_names_out(X.columns))
                        return X_enc
                    else:
                        #if the fitted cols are different to the inferencing columns, then should fit the onehot encoder again, to handle unit tests
                        print(X_cols)
                        print(oh_cols)
                        X_enc = self.one_hot_encoder.fit_transform(X)
                        X_enc = pd.DataFrame(X_enc, columns=self.one_hot_encoder.get_feature_names_out(X.columns))
                        return X_enc
        return X

I checked with the unit tests and didn't have issues on my side. please let me know if this works.

@MaxHalford
Copy link
Owner

Ok thanks for looking it. I will take a good look! I want to also make sure this change you're bringing resolves this issue.

@Vaseekaran-V
Copy link
Author

Ok thanks for looking it. I will take a good look! I want to also make sure this change you're bringing resolves this issue.

Sure, thank you. Saw the error clean code test, and made a change.

@Vaseekaran-V
Copy link
Author

Hi @MaxHalford, is there any update to this?

@MaxHalford
Copy link
Owner

Hey @Vaseekaran-V! I finally found carved some time to look into this. Turns out I found a simpler solution in #181

@MaxHalford MaxHalford closed this Nov 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants