[PERF/ENH] `Index.intersection` does more hashing work than necessary #14487

wence- · 2023-11-23T17:08:58Z

Index intersection performs an inner merge of the unique values of the left and right indices (the unique is done so that indices with repeated values don't blow up the memory footprint). This does a full hash of both indices, then the merge (hashing again). Finally, if requested, the result is sorted.

This could be replaced, I think with positive performance effect by either:

leftsemi join + drop_duplicates
libcudf.search.contains + apply_boolean_mask + drop_duplicates

One would have to think through the consequences of either of these wrt any ordering guarantees we might want when sort=False (possibly gated behind pandas-compat mode).

This applies mutatis mutandis to MultiIndex.intersection too.

The text was updated successfully, but these errors were encountered:

wence- added Performance Performance related issue Python Affects Python cuDF API. labels Nov 23, 2023

This was referenced Nov 23, 2023

[ENH] Audit cudf APIs for use of inappropriate algorithms #14479

Open

[BUG] Index.difference does not uniquify output for duplicate indexes #14489

Closed

vyasr added this to cuDF Python Nov 5, 2024

github-project-automation bot moved this to Todo in cuDF Python Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PERF/ENH] `Index.intersection` does more hashing work than necessary #14487

[PERF/ENH] `Index.intersection` does more hashing work than necessary #14487

wence- commented Nov 23, 2023 •

edited

Loading

[PERF/ENH] Index.intersection does more hashing work than necessary #14487

[PERF/ENH] Index.intersection does more hashing work than necessary #14487

Comments

wence- commented Nov 23, 2023 • edited Loading

[PERF/ENH] `Index.intersection` does more hashing work than necessary #14487

[PERF/ENH] `Index.intersection` does more hashing work than necessary #14487

wence- commented Nov 23, 2023 •

edited

Loading