Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PERF/ENH] Index.intersection does more hashing work than necessary #14487

Open
Tracked by #14479
wence- opened this issue Nov 23, 2023 · 0 comments
Open
Tracked by #14479

[PERF/ENH] Index.intersection does more hashing work than necessary #14487

wence- opened this issue Nov 23, 2023 · 0 comments
Labels
Performance Performance related issue Python Affects Python cuDF API.

Comments

@wence-
Copy link
Contributor

wence- commented Nov 23, 2023

Index intersection performs an inner merge of the unique values of the left and right indices (the unique is done so that indices with repeated values don't blow up the memory footprint). This does a full hash of both indices, then the merge (hashing again). Finally, if requested, the result is sorted.

This could be replaced, I think with positive performance effect by either:

  • leftsemi join + drop_duplicates
  • libcudf.search.contains + apply_boolean_mask + drop_duplicates

One would have to think through the consequences of either of these wrt any ordering guarantees we might want when sort=False (possibly gated behind pandas-compat mode).

This applies mutatis mutandis to MultiIndex.intersection too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Performance related issue Python Affects Python cuDF API.
Projects
Status: Todo
Development

No branches or pull requests

1 participant