-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Categorical indexing performance regression #30744
Comments
Actually, I think I am wrong in the fact that it is the object dtype support that is the additional overhead in this case. It rather is the conversion of the list to a numpy array, to then only see it is an integer array and not a boolean array, and throw the converted array away (and later have to do the conversion again, when actually indexing with the integer list) |
So IIUC, the best thing to do is convert list inputs into array inputs as early as possible? And then re-use that (hopefully well-typed) input later on? |
Convert to an array earlier on. Closes pandas-dev#30744
Yes, that's correct. But seeing that basically every internal ExtensionArray and also external ExtensionArray would want to do this, I am wondering if we rather want to expose something like |
Such a common function might also help with #30738 |
Convert to an array earlier on. Closes #30744
Recent regression in the
categoricals.CategoricalSlicing.time_getitem_list
benchmark: https://pandas.pydata.org/speed/pandas/#categoricals.CategoricalSlicing.time_getitem_list?commits=6efc2379-b9de33e3Reproducible example for this benchmark:
Now, this slowdown is due to the changes in #30308. Categorical
__getitem__
now checks if the key is a boolean indexer: https://github.com/pandas-dev/pandas/pull/30308/files#diff-f3b2ea15ba728b55cab4a1acd97d996dSo this slowdown is of course expected, and also only for Categorical itself (eg pd.Series indexing already handles this boolean checking). So in that light, we can certainly ignore this regression.
But, this led me think: maybe the ExtensionArrays are a good place to start not supporting object dtype as boolean indexer? (and so not add support for it now, which also avoids this performance regression)
The text was updated successfully, but these errors were encountered: