Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix index difference to follow the pandas format #14789

Merged
merged 5 commits into from
Jan 25, 2024

Conversation

amiralimi
Copy link
Contributor

@amiralimi amiralimi commented Jan 19, 2024

Description

This PR fixes an error in Index.difference where the function keeps duplicate elements while pandas removes the duplicates. The tests had no inputs with duplicates, so I added new tests too (I added the test from the original issue).

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@amiralimi amiralimi requested a review from a team as a code owner January 19, 2024 00:14
@amiralimi amiralimi requested review from vyasr and shwina January 19, 2024 00:14
Copy link

copy-pr-bot bot commented Jan 19, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the Python Affects Python cuDF API. label Jan 19, 2024
@vyasr
Copy link
Contributor

vyasr commented Jan 19, 2024

/ok to test

@vyasr vyasr added bug Something isn't working non-breaking Non-breaking change labels Jan 19, 2024
@vyasr
Copy link
Contributor

vyasr commented Jan 19, 2024

@wence- could you have a look at this when you get a chance? Thanks!

else:
other = other.copy(deep=False)
difference = cudf.core.index._index_from_data(
cudf.DataFrame._from_data({"None": self._column})
cudf.DataFrame._from_data({"None": self._column.unique()})
.merge(
cudf.DataFrame._from_data({"None": other._column}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also call unique() on the right hand side to make the merge smaller?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this with a few test cases, in some it had a better performance, and in some, it was worse. But pandas does this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'd just add a note here to alert future readers about potential performance issues:

# NOTE: may need to investigate calling `unique()` on the LHS before the merge for better performance 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am somewhat surprised that calling unique on the right column sometimes improves performance. unique calls stable_distinct which builds a hash table to uniquify things.

The leftanti join builds a hash table for the right column and then probes that hash table with the left column to return those rows in the left column that are not in the hash table.

So calling unique() on the right column would just seem to be an extra hash-table build for no gain.

Can you show the test cases you ran to check performance @amiralimi ?

@shwina
Copy link
Contributor

shwina commented Jan 23, 2024

@amiralimi - this is looking good. Could you run the style check, resolve any style issues found, and push a new commit with the updated style? For this, using pre-commit is helpful:

pre-commit install
pre-commit run --all

@amiralimi
Copy link
Contributor Author

@shwina Thanks for helping out. I just ran pre-commit. This is my first time contributing to an open-source project, is there anything I need to do?

@shwina
Copy link
Contributor

shwina commented Jan 23, 2024

/ok to test

@shwina
Copy link
Contributor

shwina commented Jan 23, 2024

/ok to test

@shwina
Copy link
Contributor

shwina commented Jan 24, 2024

/ok to test

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks good to me! It would be nice to see the small performance tests you ran to see if uniquifying the right column was helpful.

else:
other = other.copy(deep=False)
difference = cudf.core.index._index_from_data(
cudf.DataFrame._from_data({"None": self._column})
cudf.DataFrame._from_data({"None": self._column.unique()})
.merge(
cudf.DataFrame._from_data({"None": other._column}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am somewhat surprised that calling unique on the right column sometimes improves performance. unique calls stable_distinct which builds a hash table to uniquify things.

The leftanti join builds a hash table for the right column and then probes that hash table with the left column to return those rows in the left column that are not in the hash table.

So calling unique() on the right column would just seem to be an extra hash-table build for no gain.

Can you show the test cases you ran to check performance @amiralimi ?

@amiralimi
Copy link
Contributor Author

Hi @wence- .
You are correct. I just reran the tests and calling unique on RHS always has worse performance.
This is the test I wrote:

import cudf
import cupy
import time

l1 = cudf.Index(cupy.random.randint(0, 100, 10000000))
r1 = cudf.Index(cupy.random.randint(0, 100, 1000000))

r2 = cudf.Index(cupy.random.randint(0, 1000000, 1000000))
l2 = cudf.Index(cupy.random.randint(0, 1000, 10000000))

l3 = cudf.Index(cupy.random.randint(0, 1000000, 10000000))
r3 = cudf.Index(cupy.random.randint(0, 1000000, 10000000))

l4 = cudf.Index(cupy.random.randint(0, 1000, 10000000))
r4 = cudf.Index(cupy.random.randint(0, 1000000, 10000000))

start = time.time()
l1.difference(r1)
end = time.time()
print(f"test 1: {end - start}")

start = time.time()
l2.difference(r2)
end = time.time()
print(f"test 2: {end - start}")

start = time.time()
l3.difference(r3)
end = time.time()
print(f"test 3: {end - start}")

start = time.time()
l4.difference(r4)
end = time.time()
print(f"test 4: {end - start}")

and this is the output:

// no RHS unique
test 1: 0.04081010818481445
test 2: 0.044335126876831055
test 3: 0.04413151741027832
test 4: 0.01869058609008789

// RHS unique
test 1: 0.04834389686584473
test 2: 0.057317495346069336
test 3: 0.06256484985351562
test 4: 0.03918814659118652

@shwina
Copy link
Contributor

shwina commented Jan 25, 2024

/merge

@rapids-bot rapids-bot bot merged commit 0cd58fb into rapidsai:branch-24.02 Jan 25, 2024
68 checks passed
@shwina
Copy link
Contributor

shwina commented Jan 25, 2024

Thanks @amiralimi - we're honored to have you contribute to cuDF as your first open-source contribution! Hope we'll see more!

@amiralimi
Copy link
Contributor Author

Thanks @shwina . I really enjoyed working on CuDF and will try to contribute more to it.

@wence-
Copy link
Contributor

wence- commented Jan 26, 2024

Thanks!

@amiralimi amiralimi deleted the fix-index-difference branch January 30, 2024 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Index.difference does not uniquify output for duplicate indexes
4 participants