loc very slow on sorted, non-unique index with list of labels ar argument #9466

toobaz · 2015-02-11T14:00:43Z

In [1]: import pandas, numpy

In [2]: df = pandas.DataFrame(numpy.random.random((100000, 4)))

In [3]: %timeit df.loc[55555]
10000 loops, best of 3: 118 µs per loop

In [4]: %timeit df.loc[[55555]]
1000 loops, best of 3: 324 µs per loop

... makes sense to me.

In [5]: df.index = list(range(99999)) + [55555]

In [6]: %timeit df.loc[55555]
100 loops, best of 3: 4.04 ms per loop

In [7]: %timeit df.loc[[55555]]
100 loops, best of 3: 16.8 ms per loop

Non-unique index, slower (the second call probably has to scan all the index): still makes sense to me. Sorting should improve things...

In [8]: df.sort(inplace=True)

In [9]: %timeit df.loc[55555]
1000 loops, best of 3: 239 µs per loop

In [10]: %timeit df.loc[[55555]]
100 loops, best of 3: 17.2 ms per loop

... here I'm lost: why this huge difference? The difference is even larger (3 orders of magnitude) in a real database I am working on. Clearly,

In [12]: df.loc[[55555]] == df.loc[55555]
Out[12]: 
          0     1     2     3
55555  True  True  True  True
55555  True  True  True  True

(As a sidenote: the reason why I'm doing calls such as df.loc[[a_label]] is that df.loc[a_label] will return sometimes a Series, sometimes a DataFrame. I currently solve this by using df.loc[df.index == a_label], which is however ~3x slower than df.loc[a_label] - but much faster than the above df.loc[[a_label]].)

The text was updated successfully, but these errors were encountered:

jreback · 2015-02-11T14:15:18Z

these take a slightly different path is the reason. A list of labels is not fast-pathed IIRC. Just a matter working on it. Feel free to submit a pull-request.

cc @shoyer
cc @immerrr

sortedness assures that these are indexable by faster methods

In [25]: df2 = df.copy()

In [26]: df2 = df.sort_index()

In [27]: df2.index.is_monotonic_increasing
Out[27]: True

In [28]: df.index.is_monotonic_increasing
Out[28]: False

Not that df.loc[[....]] will always return a DataFrame, while df.loc[indexer] will return a DataFrame ONLY if you have duplicates. Generally having duplicates makes things more difficult. You are better off using a multi-level index to avoid this condition.

toobaz · 2015-02-11T15:29:21Z

Thanks for the clarification. Indeed, I wouldn't have expected even such a stupid multi-level index...

In  [13]: df3 = df.reset_index().reset_index().set_index(["level_0", "index"])

to be substantially faster than a non-unique one!

In  [14]: %timeit df3.loc[[55555]]
Out [14]: 100 loops, best of 3: 3.23 ms per loop

(although still orders of magnitude slower than df.loc[df.index == 55555])

By the way: I know df.loc[indexer] will return a DataFrame if you have duplicates. But I would find it more elegant/useful if the distinction was made at the DataFrame level (i.e. if not self.index.is_unique, then a DataFrame is returned even for non-duplicated labels). I may certainly be overlooking tons of feasibility/backward compatibility issues however.

toobaz · 2015-02-11T15:45:47Z

(sorry, should have been

In  [13]: df3 = df.reset_index().reset_index().set_index(["level_0", "index"]).swaplevel(0,1)

above, not that it changes anything)

jreback · 2015-02-11T18:02:26Z

a multi-index is like an index of indexes, so if each is unique it uses the optimized lookups. FYI, the difference between 1ms and 100us is just a few function calls (e.g. the MI has to do more inference on what exactly you are looking)

In [20]: %timeit df3.loc[[(55555,99999)]]
1000 loops, best of 3: 417 us per loop

In [21]: %timeit df3.loc[[(55555,99999),(99998,99998)]]
1000 loops, best of 3: 432 us per loop

toobaz · 2015-02-12T10:24:29Z

Il giorno mer, 11/02/2015 alle 10.02 -0800, jreback ha scritto:

a multi-index is like an index of indexes, so if each is unique it
uses the optimized lookups.

Yes... but the funny thing to me is that having a multi-index made of
non unique indices still yields a ~5x speedup! (possibly a futile
remark, though)

toobaz · 2015-02-28T15:52:40Z

Just for the records: this applies both to the case (reported above) of a non-unique index, and to the case of unique index but non-unique list being searched.

jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Feb 11, 2015

toobaz mentioned this issue Feb 19, 2015

API: Please make ".loc" return type depend on index, not on specific labels #9519

Open

toobaz changed the title ~~loc very slow on unsorted, non-unique index with list of labels ar argument~~ loc very slow on sorted, non-unique index with list of labels ar argument Sep 23, 2017

toobaz mentioned this issue Jan 31, 2018

Int64Index.get_loc() is very slow on unsorted, non-unique index #19478

Closed

inodb mentioned this issue Feb 1, 2018

Add exon information to transcript IDs in Ensembl Controller genome-nexus/genome-nexus#133

Closed

rtlee9 mentioned this issue Sep 25, 2018

Loc enhancements #22826

Merged

4 tasks

jreback added this to the 0.24.0 milestone Sep 25, 2018

jreback closed this as completed in #22826 Oct 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loc very slow on sorted, non-unique index with list of labels ar argument #9466

loc very slow on sorted, non-unique index with list of labels ar argument #9466

toobaz commented Feb 11, 2015

jreback commented Feb 11, 2015

toobaz commented Feb 11, 2015

toobaz commented Feb 11, 2015

jreback commented Feb 11, 2015

toobaz commented Feb 12, 2015

toobaz commented Feb 28, 2015

loc very slow on sorted, non-unique index with list of labels ar argument #9466

loc very slow on sorted, non-unique index with list of labels ar argument #9466

Comments

toobaz commented Feb 11, 2015

jreback commented Feb 11, 2015

toobaz commented Feb 11, 2015

toobaz commented Feb 11, 2015

jreback commented Feb 11, 2015

toobaz commented Feb 12, 2015

toobaz commented Feb 28, 2015