Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic implementation of OrdinalEncoder. #5646

Merged
merged 24 commits into from
Nov 21, 2023

Conversation

trivialfis
Copy link
Member

@trivialfis trivialfis commented Nov 6, 2023

  • Implement OrdinalEncoder.
  • Implement dask version.
  • Fix dask transformers with DataFrame input by using dask_cudf to construct return df.

Some other scikit-learn features are not available yet, for instance, encoded_missing_value, min_frequency, and max_categories.

The implementation is mostly based on the existing one hot encoder and label encoder.

I'm a bit confused by the output_type parameter and not sure how strictly it's enforced. I looked around, it seems some estimators can ignore this parameter in their returns. Would be great if there's a guideline on how to handle this parameter, along with #5645 .

Close #4456 .

- Implement `OrdinalEncoder`.
- Implement dask version.
- Fix dask transformers with DataFrame input by using `dask_cudf` to construct return df.
@trivialfis trivialfis added feature request New feature or request non-breaking Non-breaking change labels Nov 6, 2023
@trivialfis trivialfis requested a review from a team as a code owner November 6, 2023 21:10
@github-actions github-actions bot added the Cython / Python Cython or Python issue label Nov 6, 2023
@trivialfis
Copy link
Member Author

I think the failure with dask is not related to this PR.

@trivialfis trivialfis added the 3 - Ready for Review Ready for review by team label Nov 14, 2023
@csadorf csadorf self-assigned this Nov 14, 2023
@csadorf
Copy link
Contributor

csadorf commented Nov 14, 2023

I'm a bit confused by the output_type parameter and not sure how strictly it's enforced. I looked around, it seems some estimators can ignore this parameter in their returns. Would be great if there's a guideline on how to handle this parameter, along with #5645 .

I don't think the value should be ignored, but I am not sure how consistently it is being used within the dask implementation. To the best of my understanding, it is implemented correctly here.

Copy link
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really only found some superficial issues and would like to see one more test to explicitly test array inputs. Other than that, I ran some local tests to check whether the output_types are actually respected and it seems to work as expected. Nice job!

python/cuml/dask/preprocessing/encoders.py Outdated Show resolved Hide resolved
python/cuml/dask/preprocessing/encoders.py Outdated Show resolved Hide resolved
python/cuml/dask/preprocessing/encoders.py Outdated Show resolved Hide resolved
python/cuml/preprocessing/ordinalencoder_mg.py Outdated Show resolved Hide resolved
python/cuml/tests/test_ordinal_encoder.py Outdated Show resolved Hide resolved
python/cuml/dask/preprocessing/encoders.py Outdated Show resolved Hide resolved
@csadorf csadorf added 4 - Waiting on Author Waiting for author to respond to review and removed 3 - Ready for Review Ready for review by team labels Nov 14, 2023
@trivialfis
Copy link
Member Author

I think an updated cudf is causing the device ordinal error:

  - cudf                    23.12.00a140  cuda12_py310_231107_g2463b3ad53_140  rapidsai-nightly                    
  + cudf                    23.12.00a697  cuda12_py310_231115_g8a0a08f34f_697  rapidsai-nightly/linux-64        8MB
  - dask-cuda                23.12.00a26  py310_231107_ge5b240c_26             rapidsai-nightly                    
  + dask-cuda                23.12.00a28  py310_231115_gd026d6e_28             rapidsai-nightly/linux-64      189kB
  - dask-cudf               23.12.00a140  cuda12_py310_231107_g2463b3ad53_140  rapidsai-nightly                    
  + dask-cudf               23.12.00a697  cuda12_py310_231115_g8a0a08f34f_697  rapidsai-nightly/linux-64      137kB

@trivialfis
Copy link
Member Author

After the first review, I picked some fixes from docformatter. However, due to the use of generated documents, it's not easy to simply apply all changes or enforce them in CI. But that could be something interesting in the long term.

@csadorf csadorf added 4 - Waiting on Reviewer Waiting for reviewer to review or respond and removed 4 - Waiting on Author Waiting for author to respond to review labels Nov 16, 2023
@trivialfis
Copy link
Member Author

This is ready for another review. :-)

Copy link
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing my comments!

@csadorf
Copy link
Contributor

csadorf commented Nov 21, 2023

/merge

@rapids-bot rapids-bot bot merged commit 21fbf04 into rapidsai:branch-23.12 Nov 21, 2023
49 checks passed
@trivialfis trivialfis deleted the ordinal-encoder branch November 21, 2023 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Waiting on Reviewer Waiting for reviewer to review or respond Cython / Python Cython or Python issue feature request New feature or request non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[QST] OrdinalEncoder
2 participants