Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Ordinal encoder component #1389

Closed
dsherry opened this issue Nov 2, 2020 · 3 comments
Closed

Add Ordinal encoder component #1389

dsherry opened this issue Nov 2, 2020 · 3 comments
Assignees
Labels
good first issue Issues which would be a good starting point for new hires. needs design Issues requiring design documentation. new feature Features which don't yet exist. spike To generate additional issues and kick off a sprint.

Comments

@dsherry
Copy link
Contributor

dsherry commented Nov 2, 2020

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html

Will need to run perf tests. Ideally, we can come up with some key examples where ordinal encoding outperforms one-hot encoding.

@dsherry dsherry added new feature Features which don't yet exist. needs design Issues requiring design documentation. labels Nov 2, 2020
@dsherry
Copy link
Contributor Author

dsherry commented Nov 2, 2020

Added "needs design" because we should write out how this would be used. Need user to be able to select which categorical features have an ordering vs those which don't and that may require woodwork support first.

@dsherry dsherry added good first issue Issues which would be a good starting point for new hires. spike To generate additional issues and kick off a sprint. labels Jun 18, 2021
@asniyaz asniyaz self-assigned this Feb 16, 2022
@exalate-issue-sync exalate-issue-sync bot assigned DavidQi and unassigned asniyaz Mar 16, 2022
@thehomebrewnerd
Copy link
Contributor

@dsherry This issue came up recently in some experiments I have been doing. In reviewing the results with @rpeck and @rwedge we noticed that several ordinal columns were getting encoded as regular categorical columns by the EvalML OneHotEncoder, so we would get a feature such as MONTH(Created)_9 for the 9th week of the year. @rpeck suggested we should not be encoding the Ordinal columns in this manner.

Any Woodwork columns that are ordered should be specified with the Ordinal logical type. Setting a column as Ordinal in Woodwork requires the order values to be defined, and the pandas dtype is set as CategoricalDtype with the specification that the values are ordered.

As a concrete example of this, the Featuretools Month primitive outputs an Ordinal column in the feature matrix with the following dtype:

CategoricalDtype(categories=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], ordered=True)

Between the Woodwork logical type and the pandas dtype ordering, it seems like there should be enough information present to determine what columns should have Ordinal encoding applied.

@gsheni FYI

@gsheni
Copy link
Contributor

gsheni commented May 17, 2022

@chukarsten @asniyaz Can we prioritize this and add it to the next EvalML sprint? It is affecting our current work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Issues which would be a good starting point for new hires. needs design Issues requiring design documentation. new feature Features which don't yet exist. spike To generate additional issues and kick off a sprint.
Projects
None yet
Development

No branches or pull requests

6 participants