Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Index.difference does not uniquify output for duplicate indexes #14489

Closed
wence- opened this issue Nov 23, 2023 · 2 comments · Fixed by #14789
Closed

[BUG] Index.difference does not uniquify output for duplicate indexes #14489

wence- opened this issue Nov 23, 2023 · 2 comments · Fixed by #14789
Labels
bug Something isn't working good first issue Good for newcomers Python Affects Python cuDF API.

Comments

@wence-
Copy link
Contributor

wence- commented Nov 23, 2023

Describe the bug

cudf computes Index.difference with a leftanti join. This preserves duplicates in the left index. In contrast, pandas always produces an index with uniquified values.

Steps/Code to reproduce bug

import cudf
import pandas as pd

left = pd.Index([1, 1, 2, 2])
right = pd.Index([2])

print(left.difference(right))
cleft = cudf.from_pandas(left)
cright = cudf.from_pandas(cright)
print(cleft.difference(cright))

Expected behavior

Match pandas, either by calling drop_duplicates or using libcudf.search.contains (see also #14487).

Notes

Doesn't apply to MultiIndex right now, because that goes through pandas (although that's a perf bug that should be fixed).

@wence- wence- added bug Something isn't working Needs Triage Need team to review and classify pandas Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Nov 23, 2023
@wence- wence- added the good first issue Good for newcomers label Nov 24, 2023
@amiralimi
Copy link
Contributor

Hi @wence- . Can I fix this issue?

@wence-
Copy link
Contributor Author

wence- commented Jan 8, 2024

Hi @wence- . Can I fix this issue?

Please go ahead! When you open a PR please tag me so I don't miss it, thanks!

rapids-bot bot pushed a commit that referenced this issue Jan 25, 2024
This PR fixes an error in `Index.difference` where the function keeps duplicate elements while pandas removes the duplicates. The tests had no inputs with duplicates, so I added new tests too (I added the test from the original issue). 

- closes #14489

Authors:
  - AmirAli Mirian (https://github.com/amiralimi)
  - Ashwin Srinath (https://github.com/shwina)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #14789
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants