Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix index difference to follow the pandas format #14789

Merged
merged 5 commits into from
Jan 25, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions python/cudf/cudf/core/_base_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -1040,11 +1040,11 @@ def difference(self, other, sort=None):
res_name = _get_result_name(self.name, other.name)

if is_mixed_with_object_dtype(self, other):
difference = self.copy()
difference = self.copy().unique()
else:
other = other.copy(deep=False)
difference = cudf.core.index._index_from_data(
cudf.DataFrame._from_data({"None": self._column})
cudf.DataFrame._from_data({"None": self._column.unique()})
.merge(
cudf.DataFrame._from_data({"None": other._column}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also call unique() on the right hand side to make the merge smaller?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this with a few test cases, in some it had a better performance, and in some, it was worse. But pandas does this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'd just add a note here to alert future readers about potential performance issues:

# NOTE: may need to investigate calling `unique()` on the LHS before the merge for better performance 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am somewhat surprised that calling unique on the right column sometimes improves performance. unique calls stable_distinct which builds a hash table to uniquify things.

The leftanti join builds a hash table for the right column and then probes that hash table with the left column to return those rows in the left column that are not in the hash table.

So calling unique() on the right column would just seem to be an extra hash-table build for no gain.

Can you show the test cases you ran to check performance @amiralimi ?

how="leftanti",
Expand Down
2 changes: 2 additions & 0 deletions python/cudf/cudf/tests/test_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -803,6 +803,7 @@ def test_index_to_series(data):
pd.Series(["1", "2", "a", "3", None], dtype="category"),
range(0, 10),
[],
[1, 1, 2, 2],
],
)
@pytest.mark.parametrize(
Expand All @@ -819,6 +820,7 @@ def test_index_to_series(data):
range(2, 4),
pd.Series(["1", "a", "3", None], dtype="category"),
[],
[2],
],
)
@pytest.mark.parametrize("sort", [None, False])
Expand Down
Loading