PERF: .ix performance on series #5567

l736x · 2013-11-21T16:11:23Z

Series xs when presented with a multi-index should use the data frame logic (whereby it can use the levels to avoid having to scan the entire set). Need to move the logic of xs to core/generic.py so both Series/Frame can use it.

I stumbled on a weird issue.
The first time I access a series location with ix I have a huge overhead.
This is not the case if I transform the series in a dataframe and the access it.
I use a mutliindex below because it makes the effect more visible, but the same problem is present for regular indexes.

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.12.0'

In [3]: mi = pd.MultiIndex.from_tuples([(x,y) for x in range(1000) for y in range(1000)])

In [4]: s = pd.Series(randn(1000000), index=mi)

In [5]: %time s.ix[999]
CPU times: user 652 ms, sys: 4 ms, total: 656 ms
Wall time: 656 ms
Out[5]:
0    -0.271328
...
999   -0.832013
Length: 1000, dtype: float64

In [6]: %time s.ix[999]
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 249 us
Out[6]:
0    -0.271328
...
999   -0.832013
Length: 1000, dtype: float64

The behavior seems related to the index because recreating the series does not reproduce it, but sorting the index does:

In [7]: s = pd.Series(randn(1000000), index=mi)

In [8]: %time s.ix[999]
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 247 us
Out[8]:
0    -1.207499
...
999   -0.370578
Length: 1000, dtype: float64

In [9]: %time s.ix[999]
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 202 us
Out[9]:
0    -1.207499
...
999   -0.370578
Length: 1000, dtype: float64

In [10]: s = s.sort_index()

In [11]: %time s.ix[999]
CPU times: user 856 ms, sys: 32 ms, total: 888 ms
Wall time: 890 ms
Out[11]:
0    -1.207499
...
999   -0.370578
Length: 1000, dtype: float64

In [12]: %time s.ix[999]
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 213 us
Out[12]:
0    -1.207499
...
999   -0.370578
Length: 1000, dtype: float64

And now the weird thing. I convert the series into a df:

In [16]: s = s.sort_index()

In [17]: %time pd.DataFrame(s).ix[999]
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 3.78 ms
Out[17]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 1 columns):
0    1000  non-null values
dtypes: float64(1)

In [18]: %time pd.DataFrame(s).ix[999]
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 397 us
Out[18]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 1 columns):
0    1000  non-null values
dtypes: float64(1)

There is still a little overhead but nothing compared to the previous case.

It might seem an innocent problem but for large time-series the lag becomes of the order of the second and can eat up a lot of performance.

Lastly, I'm sorry but I don't have easy access to current dev version, so it might be that the problem is already solved. (although I'd be curious to know where it comes from)

Edit: can it be linked to #4198 ?

The text was updated successfully, but these errors were encountered:

jreback · 2013-11-21T16:14:21Z

this is correct and expected behavior. The index is hashed when a lookup happens (e.g. its needed).

If you find that you are doing fancy lookups a lot (say i a loop), then you need to refactor your code to do it in a vectorized way.

l736x · 2013-11-21T16:20:51Z

I understand that the index is hashed lazily but I don't get why s.ix[0] triggers the hash and DataFrame(s).ix[0] doesn't.
Is the first fancy indexing and not the second?
We are talking of a factor > 1000 for two operations that do exactly the same thing. And the fastest is the one that is intuitively more complicated.

jreback · 2013-11-21T16:43:53Z

In master

In [1]: mi = pd.MultiIndex.from_tuples([(x,y) for x in range(1000) for y in range(1000)])

In [2]: s = pd.Series(randn(1000000), index=mi)

In [3]: %timeit s.ix[999]
1 loops, best of 3: 238 ﾵs per loop

In [4]: %timeit s.ix[999]
1000 loops, best of 3: 206 ﾵs per loop

In [5]: s = pd.Series(randn(1000000), index=mi)

In [6]: %timeit s.ix[999]
1000 loops, best of 3: 201 ﾵs per loop

In [7]: %timeit s.ix[999]
1000 loops, best of 3: 198 ﾵs per loop

In [8]: s = s.sort_index()

In [9]: %timeit s.ix[999]
1 loops, best of 3: 290 ﾵs per loop

In [10]: %timeit s.ix[999]
1000 loops, best of 3: 209 ﾵs per loop

DataFrame

In [11]: s = s.sort_index()

In [18]: %timeit pd.DataFrame(s)
1 loops, best of 3: 1.5 s per loop

In [14]: df = pd.DataFrame(s)

In [15]: %timeit df.ix[999]
1000 loops, best of 3: 309 ﾵs per loop

In [16]: %timeit df.ix[999]
1000 loops, best of 3: 321 ﾵs per loop

l736x · 2013-11-21T16:49:47Z

You must use %time, not %timeit because the problem shows up only on the first access and %timeit gives you best of three in any case.

Why are you not doing pd.DataFrame(s).ix[999] ?

jreback · 2013-11-21T17:41:05Z

the cost of the indexing is incurrered in the dataframe construction, and not in the first indexing access

l736x · 2013-11-21T17:51:43Z

Then how do you explain that right after a s.sort_index() pd.DataFrame(s).ix[999] takes 4 ms and s.ix[999] takes 900 ms ?

jreback · 2013-11-21T20:21:07Z

answer is that Series doesn't have a sophisticated handler for xs that takes into account a multi-index, while DataFrame does (e.g. it can figure out the indexer in the levels first, saving a lot of time searching the entire index).

So I would call this an unimplemented optimization on Series xs with a multi-index

l736x · 2013-11-22T09:57:15Z

Ok, I see. Thanks for having dug into it.
Is it going to be solved with the subclassing of dataframe in 0.13 or not?
If not, could we keep it as a desirable performance enhancement?

jreback · 2013-11-22T10:36:10Z

marked for 0.14

jreback · 2014-02-14T03:55:28Z

@l736x can you run these perf figures again on master or 0.13.1

I think everything is fixed....

l736x · 2014-02-15T23:00:54Z

Sorry but 0.13.1 gives the same as before (I didn't try master)

jreback · 2014-02-16T00:20:12Z

@l736x ok....easy fix, see #6364

Series multi-index lookup was going thru a very odd path (because < 0.13 it had 2 becase xs was not a shared method). now much simpler. This is actually tricky to time because of the caching.

l736x · 2014-02-16T14:05:32Z

I tested it and it works great, thanks a lot!

jreback modified the milestones: 0.15.0, 0.14.0 Feb 15, 2014

jreback mentioned this issue Feb 16, 2014

PERF: change Series indexing on multi-indexes to use a fast path (GH5567) #6364

Merged

jreback modified the milestones: 0.14.0, 0.15.0 Feb 16, 2014

jreback closed this as completed in #6364 Feb 16, 2014

tomaarsen mentioned this issue Aug 20, 2022

CLN: Replace incorrect GH#5667 with GH#5567 #48177

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: .ix performance on series #5567

PERF: .ix performance on series #5567

l736x commented Nov 21, 2013

jreback commented Nov 21, 2013

l736x commented Nov 21, 2013

jreback commented Nov 21, 2013

l736x commented Nov 21, 2013

jreback commented Nov 21, 2013

l736x commented Nov 21, 2013

jreback commented Nov 21, 2013

l736x commented Nov 22, 2013

jreback commented Nov 22, 2013

jreback commented Feb 14, 2014

l736x commented Feb 15, 2014

jreback commented Feb 16, 2014

l736x commented Feb 16, 2014

PERF: .ix performance on series #5567

PERF: .ix performance on series #5567

Comments

l736x commented Nov 21, 2013

jreback commented Nov 21, 2013

l736x commented Nov 21, 2013

jreback commented Nov 21, 2013

l736x commented Nov 21, 2013

jreback commented Nov 21, 2013

l736x commented Nov 21, 2013

jreback commented Nov 21, 2013

l736x commented Nov 22, 2013

jreback commented Nov 22, 2013

jreback commented Feb 14, 2014

l736x commented Feb 15, 2014

jreback commented Feb 16, 2014

l736x commented Feb 16, 2014