added option to use sklearn's OneHotEncoder to handle unknown categories #174

Vaseekaran-V · 2024-09-03T15:25:50Z

This library is amazing and I noticed a small issue when using the Multiple Correspondence Analysis: since the function uses pd.get_dummies internally to one hot encode the data, I got an error as my testing set had unknown categories in certain categorical features compared to the train set.

Therefore, I have initialized a OneHotEncoder object from sklearn.preprocessing to process the data, if the user wants to opt out of using the get_dummies function.

These are the three attributes that I have specified:

get_dummies (if True, will use the original get_dummies method (default is set to False))
one_hot_encoder (the OneHotEncoder object)
is_one_hot_fitted: (boolean to check if the one_hot_encoder is fitted)

I have updated the _prepare function as well:

def _prepare(self, X):
        if self.one_hot:
            if self.get_dummies:
                X = pd.get_dummies(X, columns=X.columns)
                return X
            else:
                if self.is_one_hot_fitted == False:
                    X_enc = self.one_hot_encoder.fit_transform(X)
                    X_enc = pd.DataFrame(X_enc, columns=self.one_hot_encoder.get_feature_names_out(X.columns))
                    self.is_one_hot_fitted = True
                    return X_enc
                else:
                    X_enc = self.one_hot_encoder.transform(X)
                    X_enc = pd.DataFrame(X_enc, columns=self.one_hot_encoder.get_feature_names_out(X.columns))
                    return X_enc
        return X

Let me know if there is anything else I can do, or whether the workings are correct.

Thanks again for this great library <3

…et dummies fail)

…ethod | added description

…the one_hot_encoder is fitted)

MaxHalford · 2024-09-07T18:31:18Z

Thanks for starting this PR! This is a tricky topic. Have you tried running the unit tests? I think this will will fail due to supplementary columns... I have booked some time on my calendar to look into this. I'll let you know.

MaxHalford · 2024-09-07T18:31:27Z

And thanks for the appreciation :)

Vaseekaran-V · 2024-09-08T07:33:42Z

Thanks for starting this PR! This is a tricky topic. Have you tried running the unit tests? I think this will will fail due to supplementary columns... I have booked some time on my calendar to look into this. I'll let you know.

Hi, thank you. I didn't try the unit tests, and as you said, the unit tests are failing. Please let me know if there is anything that I can do, and also, may I know the reason for having supplementary columns?

…MCA analysis

Vaseekaran-V · 2024-09-08T07:37:58Z

I modified the mca file to handle unknown features, as the error in the unit test is the features that are seen in fit are not seen when transforming, so I modified the _prepare function in mca.py:

def _prepare(self, X):
        if self.one_hot:
            if self.get_dummies:
                X = pd.get_dummies(X, columns=X.columns)
                return X
            else:
                if self.is_one_hot_fitted == False:
                    #if the one_hot_encoder is not fitted, to fit and also set the is_one_hot_fitted variable to True
                    X_enc = self.one_hot_encoder.fit_transform(X)
                    X_enc = pd.DataFrame(X_enc, columns=self.one_hot_encoder.get_feature_names_out(X.columns))
                    self.is_one_hot_fitted = True
                    return X_enc
                else:
                    #checking if the columns fed to the onehot encoder and the columns fitted to the onehot encoder are the same
                    oh_cols = set(self.one_hot_encoder.feature_names_in_.tolist())
                    X_cols = set(X.columns.tolist())
                    
                    if oh_cols == X_cols:
                        #if the fitted cols are the same as the inferencing columns, then can transform
                        X_enc = self.one_hot_encoder.transform(X)
                        X_enc = pd.DataFrame(X_enc, columns=self.one_hot_encoder.get_feature_names_out(X.columns))
                        return X_enc
                    else:
                        #if the fitted cols are different to the inferencing columns, then should fit the onehot encoder again, to handle unit tests
                        print(X_cols)
                        print(oh_cols)
                        X_enc = self.one_hot_encoder.fit_transform(X)
                        X_enc = pd.DataFrame(X_enc, columns=self.one_hot_encoder.get_feature_names_out(X.columns))
                        return X_enc
        return X

I checked with the unit tests and didn't have issues on my side. please let me know if this works.

MaxHalford · 2024-09-08T15:42:59Z

Ok thanks for looking it. I will take a good look! I want to also make sure this change you're bringing resolves this issue.

Vaseekaran-V · 2024-09-08T16:00:10Z

Ok thanks for looking it. I will take a good look! I want to also make sure this change you're bringing resolves this issue.

Sure, thank you. Saw the error clean code test, and made a change.

Vaseekaran-V · 2024-09-22T14:29:43Z

Hi @MaxHalford, is there any update to this?

MaxHalford · 2024-11-17T22:10:21Z

Hey @Vaseekaran-V! I finally found carved some time to look into this. Turns out I found a simpler solution in #181

Vaseekaran-V added 5 commits September 3, 2024 09:27

added onehot encoder to handle unknown categorical values (which pd g…

cc46520

…et dummies fail)

modified code to support one_hot attribute and original get_dummies m…

09d92d2

…ethod | added description

fixed issue to get column names after using OneHotEncoder

37e0f59

small issue in _prepare (didn't return the one-hot encoded values if …

77e0603

…the one_hot_encoder is fitted)

updated the mca notebook in docs/content

9166071

fixed an issue to handle unknown columns during one hot encoding for …

f87c843

…MCA analysis

Vaseekaran-V added 5 commits September 8, 2024 13:15

fixed merging conflicts

af04b67

Merge branch 'MaxHalford-master' | fixing issues during merge

1255661

fixing merge issue in mca notebook in docs

e63d74b

removed code lines kept for debugging

3bcff9c

2 errors caused by print code for logging

bfc9179

fixed a clean code issue

ef68f6b

MaxHalford closed this Nov 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added option to use sklearn's OneHotEncoder to handle unknown categories #174

added option to use sklearn's OneHotEncoder to handle unknown categories #174

Vaseekaran-V commented Sep 3, 2024 •

edited

Loading

MaxHalford commented Sep 7, 2024

MaxHalford commented Sep 7, 2024

Vaseekaran-V commented Sep 8, 2024

Vaseekaran-V commented Sep 8, 2024

MaxHalford commented Sep 8, 2024

Vaseekaran-V commented Sep 8, 2024

Vaseekaran-V commented Sep 22, 2024

MaxHalford commented Nov 17, 2024

added option to use sklearn's OneHotEncoder to handle unknown categories #174

added option to use sklearn's OneHotEncoder to handle unknown categories #174

Conversation

Vaseekaran-V commented Sep 3, 2024 • edited Loading

MaxHalford commented Sep 7, 2024

MaxHalford commented Sep 7, 2024

Vaseekaran-V commented Sep 8, 2024

Vaseekaran-V commented Sep 8, 2024

MaxHalford commented Sep 8, 2024

Vaseekaran-V commented Sep 8, 2024

Vaseekaran-V commented Sep 22, 2024

MaxHalford commented Nov 17, 2024

Vaseekaran-V commented Sep 3, 2024 •

edited

Loading