-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
loc very slow on sorted, non-unique index with list of labels ar argument #9466
Comments
these take a slightly different path is the reason. A list of labels is not fast-pathed IIRC. Just a matter working on it. Feel free to submit a pull-request. sortedness assures that these are indexable by faster methods
Not that |
Thanks for the clarification. Indeed, I wouldn't have expected even such a stupid multi-level index...
to be substantially faster than a non-unique one!
(although still orders of magnitude slower than df.loc[df.index == 55555]) By the way: I know df.loc[indexer] will return a DataFrame if you have duplicates. But I would find it more elegant/useful if the distinction was made at the DataFrame level (i.e. if not self.index.is_unique, then a DataFrame is returned even for non-duplicated labels). I may certainly be overlooking tons of feasibility/backward compatibility issues however. |
(sorry, should have been
above, not that it changes anything) |
a multi-index is like an index of indexes, so if each is unique it uses the optimized lookups. FYI, the difference between 1ms and 100us is just a few function calls (e.g. the MI has to do more inference on what exactly you are looking)
|
Il giorno mer, 11/02/2015 alle 10.02 -0800, jreback ha scritto:
Yes... but the funny thing to me is that having a multi-index made of |
Just for the records: this applies both to the case (reported above) of a non-unique index, and to the case of unique index but non-unique list being searched. |
... makes sense to me.
Non-unique index, slower (the second call probably has to scan all the index): still makes sense to me. Sorting should improve things...
... here I'm lost: why this huge difference? The difference is even larger (3 orders of magnitude) in a real database I am working on. Clearly,
(As a sidenote: the reason why I'm doing calls such as df.loc[[a_label]] is that df.loc[a_label] will return sometimes a Series, sometimes a DataFrame. I currently solve this by using df.loc[df.index == a_label], which is however ~3x slower than df.loc[a_label] - but much faster than the above df.loc[[a_label]].)
The text was updated successfully, but these errors were encountered: