Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use drop_duplicates instead of unique for cudf's pandas compatibility mode #5639

Merged

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Nov 2, 2023

In pandas, Series.unique returns a numpy array (for non-extension types) while Series.drop_duplicates returns a Series. The two results should otherwise contain the same set of values. In cudf, historically both methods returned a Series, and at these stages in cuml's pipeline it knows that it is working with cudf objects. However, if cudf has pandas compatibility mode enabled, then unique will return an array to match pandas behavior. In this scenario, the method chaining no longer works because cupy is calling methods on the result of unique assuming that it will be a Series. To fix this, cuml needs to call drop_duplicates instead.

@vyasr vyasr added bug Something isn't working 3 - Ready for Review Ready for review by team Cython / Python Cython or Python issue non-breaking Non-breaking change labels Nov 2, 2023
@vyasr vyasr self-assigned this Nov 2, 2023
@vyasr vyasr requested a review from a team as a code owner November 2, 2023 22:41
Copy link
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle all good, however, I tried to check whether y is guaranteed to be a data frame at this point and was wondering whether there is a reason that we don't just deduplicate directly.

python/cuml/preprocessing/LabelEncoder.py Show resolved Hide resolved
@vyasr
Copy link
Contributor Author

vyasr commented Nov 3, 2023

/merge

@rapids-bot rapids-bot bot merged commit 0296043 into rapidsai:branch-23.12 Nov 3, 2023
50 checks passed
rapids-bot bot pushed a commit that referenced this pull request Nov 14, 2023
I accidentally committed but forgot to push some changes requested by @csadorf in #5639.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Simon Adorf (https://github.com/csadorf)

Approvers:
  - Simon Adorf (https://github.com/csadorf)

URL: #5648
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working Cython / Python Cython or Python issue non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants