-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix index difference to follow the pandas format #14789
Fix index difference to follow the pandas format #14789
Conversation
/ok to test |
@wence- could you have a look at this when you get a chance? Thanks! |
else: | ||
other = other.copy(deep=False) | ||
difference = cudf.core.index._index_from_data( | ||
cudf.DataFrame._from_data({"None": self._column}) | ||
cudf.DataFrame._from_data({"None": self._column.unique()}) | ||
.merge( | ||
cudf.DataFrame._from_data({"None": other._column}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also call unique()
on the right hand side to make the merge smaller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this with a few test cases, in some it had a better performance, and in some, it was worse. But pandas does this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'd just add a note here to alert future readers about potential performance issues:
# NOTE: may need to investigate calling `unique()` on the LHS before the merge for better performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am somewhat surprised that calling unique on the right column sometimes improves performance. unique
calls stable_distinct
which builds a hash table to uniquify things.
The leftanti
join builds a hash table for the right column and then probes that hash table with the left column to return those rows in the left column that are not in the hash table.
So calling unique()
on the right column would just seem to be an extra hash-table build for no gain.
Can you show the test cases you ran to check performance @amiralimi ?
@amiralimi - this is looking good. Could you run the style check, resolve any style issues found, and push a new commit with the updated style? For this, using
|
@shwina Thanks for helping out. I just ran pre-commit. This is my first time contributing to an open-source project, is there anything I need to do? |
/ok to test |
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks good to me! It would be nice to see the small performance tests you ran to see if uniquifying the right column was helpful.
else: | ||
other = other.copy(deep=False) | ||
difference = cudf.core.index._index_from_data( | ||
cudf.DataFrame._from_data({"None": self._column}) | ||
cudf.DataFrame._from_data({"None": self._column.unique()}) | ||
.merge( | ||
cudf.DataFrame._from_data({"None": other._column}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am somewhat surprised that calling unique on the right column sometimes improves performance. unique
calls stable_distinct
which builds a hash table to uniquify things.
The leftanti
join builds a hash table for the right column and then probes that hash table with the left column to return those rows in the left column that are not in the hash table.
So calling unique()
on the right column would just seem to be an extra hash-table build for no gain.
Can you show the test cases you ran to check performance @amiralimi ?
Hi @wence- . import cudf
import cupy
import time
l1 = cudf.Index(cupy.random.randint(0, 100, 10000000))
r1 = cudf.Index(cupy.random.randint(0, 100, 1000000))
r2 = cudf.Index(cupy.random.randint(0, 1000000, 1000000))
l2 = cudf.Index(cupy.random.randint(0, 1000, 10000000))
l3 = cudf.Index(cupy.random.randint(0, 1000000, 10000000))
r3 = cudf.Index(cupy.random.randint(0, 1000000, 10000000))
l4 = cudf.Index(cupy.random.randint(0, 1000, 10000000))
r4 = cudf.Index(cupy.random.randint(0, 1000000, 10000000))
start = time.time()
l1.difference(r1)
end = time.time()
print(f"test 1: {end - start}")
start = time.time()
l2.difference(r2)
end = time.time()
print(f"test 2: {end - start}")
start = time.time()
l3.difference(r3)
end = time.time()
print(f"test 3: {end - start}")
start = time.time()
l4.difference(r4)
end = time.time()
print(f"test 4: {end - start}") and this is the output:
|
/merge |
Thanks @amiralimi - we're honored to have you contribute to cuDF as your first open-source contribution! Hope we'll see more! |
Thanks @shwina . I really enjoyed working on CuDF and will try to contribute more to it. |
Thanks! |
Description
This PR fixes an error in
Index.difference
where the function keeps duplicate elements while pandas removes the duplicates. The tests had no inputs with duplicates, so I added new tests too (I added the test from the original issue).Index.difference
does not uniquify output for duplicate indexes #14489Checklist