Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loc very slow on sorted, non-unique index with list of labels ar argument #9466

Closed
toobaz opened this issue Feb 11, 2015 · 6 comments · Fixed by #22826
Closed

loc very slow on sorted, non-unique index with list of labels ar argument #9466

toobaz opened this issue Feb 11, 2015 · 6 comments · Fixed by #22826
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Milestone

Comments

@toobaz
Copy link
Member

toobaz commented Feb 11, 2015

In [1]: import pandas, numpy

In [2]: df = pandas.DataFrame(numpy.random.random((100000, 4)))

In [3]: %timeit df.loc[55555]
10000 loops, best of 3: 118 µs per loop

In [4]: %timeit df.loc[[55555]]
1000 loops, best of 3: 324 µs per loop

... makes sense to me.

In [5]: df.index = list(range(99999)) + [55555]

In [6]: %timeit df.loc[55555]
100 loops, best of 3: 4.04 ms per loop

In [7]: %timeit df.loc[[55555]]
100 loops, best of 3: 16.8 ms per loop

Non-unique index, slower (the second call probably has to scan all the index): still makes sense to me. Sorting should improve things...

In [8]: df.sort(inplace=True)

In [9]: %timeit df.loc[55555]
1000 loops, best of 3: 239 µs per loop

In [10]: %timeit df.loc[[55555]]
100 loops, best of 3: 17.2 ms per loop

... here I'm lost: why this huge difference? The difference is even larger (3 orders of magnitude) in a real database I am working on. Clearly,

In [12]: df.loc[[55555]] == df.loc[55555]
Out[12]: 
          0     1     2     3
55555  True  True  True  True
55555  True  True  True  True

(As a sidenote: the reason why I'm doing calls such as df.loc[[a_label]] is that df.loc[a_label] will return sometimes a Series, sometimes a DataFrame. I currently solve this by using df.loc[df.index == a_label], which is however ~3x slower than df.loc[a_label] - but much faster than the above df.loc[[a_label]].)

@jreback
Copy link
Contributor

jreback commented Feb 11, 2015

these take a slightly different path is the reason. A list of labels is not fast-pathed IIRC. Just a matter working on it. Feel free to submit a pull-request.

cc @shoyer
cc @immerrr

sortedness assures that these are indexable by faster methods

In [25]: df2 = df.copy()

In [26]: df2 = df.sort_index()

In [27]: df2.index.is_monotonic_increasing
Out[27]: True

In [28]: df.index.is_monotonic_increasing
Out[28]: False

Not that df.loc[[....]] will always return a DataFrame, while df.loc[indexer] will return a DataFrame ONLY if you have duplicates. Generally having duplicates makes things more difficult. You are better off using a multi-level index to avoid this condition.

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Feb 11, 2015
@toobaz
Copy link
Member Author

toobaz commented Feb 11, 2015

Thanks for the clarification. Indeed, I wouldn't have expected even such a stupid multi-level index...

In  [13]: df3 = df.reset_index().reset_index().set_index(["level_0", "index"])

to be substantially faster than a non-unique one!

In  [14]: %timeit df3.loc[[55555]]
Out [14]: 100 loops, best of 3: 3.23 ms per loop

(although still orders of magnitude slower than df.loc[df.index == 55555])

By the way: I know df.loc[indexer] will return a DataFrame if you have duplicates. But I would find it more elegant/useful if the distinction was made at the DataFrame level (i.e. if not self.index.is_unique, then a DataFrame is returned even for non-duplicated labels). I may certainly be overlooking tons of feasibility/backward compatibility issues however.

@toobaz
Copy link
Member Author

toobaz commented Feb 11, 2015

(sorry, should have been

In  [13]: df3 = df.reset_index().reset_index().set_index(["level_0", "index"]).swaplevel(0,1)

above, not that it changes anything)

@jreback
Copy link
Contributor

jreback commented Feb 11, 2015

a multi-index is like an index of indexes, so if each is unique it uses the optimized lookups. FYI, the difference between 1ms and 100us is just a few function calls (e.g. the MI has to do more inference on what exactly you are looking)

In [20]: %timeit df3.loc[[(55555,99999)]]
1000 loops, best of 3: 417 us per loop

In [21]: %timeit df3.loc[[(55555,99999),(99998,99998)]]
1000 loops, best of 3: 432 us per loop

@toobaz
Copy link
Member Author

toobaz commented Feb 12, 2015

Il giorno mer, 11/02/2015 alle 10.02 -0800, jreback ha scritto:

a multi-index is like an index of indexes, so if each is unique it
uses the optimized lookups.

Yes... but the funny thing to me is that having a multi-index made of
non unique indices still yields a ~5x speedup! (possibly a futile
remark, though)

@toobaz
Copy link
Member Author

toobaz commented Feb 28, 2015

Just for the records: this applies both to the case (reported above) of a non-unique index, and to the case of unique index but non-unique list being searched.

@toobaz toobaz changed the title loc very slow on unsorted, non-unique index with list of labels ar argument loc very slow on sorted, non-unique index with list of labels ar argument Sep 23, 2017
@rtlee9 rtlee9 mentioned this issue Sep 25, 2018
4 tasks
@jreback jreback added this to the 0.24.0 milestone Sep 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants