Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: StringEngine for string dtype indexing ops #56997

Merged
merged 8 commits into from
Jan 23, 2024

Conversation

lukemanley
Copy link
Member

@lukemanley lukemanley commented Jan 21, 2024

Seems to be about 2x faster than ObjectEngine for indexing ops that use the hashmap:

import pandas as pd

N = 100_000
dtype = "string[pyarrow_numpy]"

strings = [f"i-{i}" for i in range(N)]

idx1 = pd.Index(strings[10:], dtype=dtype)
idx2 = pd.Index(strings[:-10], dtype=dtype)

%timeit idx1.get_indexer_for(idx2)

# 52.6 ms ± 834 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  -> main
# 25 ms ± 790 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)    -> PR

@lukemanley lukemanley added Performance Memory or execution speed performance Strings String extension data type and string data Index Related to the Index class or subclasses labels Jan 21, 2024
@lukemanley lukemanley added this to the 3.0 milestone Jan 21, 2024
@lukemanley lukemanley requested a review from WillAyd as a code owner January 21, 2024 20:55
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

pandas/_libs/index.pyi Outdated Show resolved Hide resolved
@mroeschke mroeschke merged commit 3c96b8f into pandas-dev:main Jan 23, 2024
49 of 50 checks passed
@mroeschke
Copy link
Member

Thanks @lukemanley

pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
* add StringEngine for string dtype indexes

* whatsnew

* ensure str

* mypy

* subclass IndexEngine

* update to match class implementation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Index Related to the Index class or subclasses Performance Memory or execution speed performance Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants