Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: .ix performance on series #5567

Closed
l736x opened this issue Nov 21, 2013 · 13 comments · Fixed by #6364
Closed

PERF: .ix performance on series #5567

l736x opened this issue Nov 21, 2013 · 13 comments · Fixed by #6364
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Milestone

Comments

@l736x
Copy link
Contributor

l736x commented Nov 21, 2013

Series xs when presented with a multi-index should use the data frame logic (whereby it can use the levels to avoid having to scan the entire set). Need to move the logic of xs to core/generic.py so both Series/Frame can use it.


I stumbled on a weird issue.
The first time I access a series location with ix I have a huge overhead.
This is not the case if I transform the series in a dataframe and the access it.
I use a mutliindex below because it makes the effect more visible, but the same problem is present for regular indexes.

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.12.0'

In [3]: mi = pd.MultiIndex.from_tuples([(x,y) for x in range(1000) for y in range(1000)])

In [4]: s = pd.Series(randn(1000000), index=mi)

In [5]: %time s.ix[999]
CPU times: user 652 ms, sys: 4 ms, total: 656 ms
Wall time: 656 ms
Out[5]:
0    -0.271328
...
999   -0.832013
Length: 1000, dtype: float64

In [6]: %time s.ix[999]
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 249 us
Out[6]:
0    -0.271328
...
999   -0.832013
Length: 1000, dtype: float64

The behavior seems related to the index because recreating the series does not reproduce it, but sorting the index does:

In [7]: s = pd.Series(randn(1000000), index=mi)

In [8]: %time s.ix[999]
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 247 us
Out[8]:
0    -1.207499
...
999   -0.370578
Length: 1000, dtype: float64

In [9]: %time s.ix[999]
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 202 us
Out[9]:
0    -1.207499
...
999   -0.370578
Length: 1000, dtype: float64

In [10]: s = s.sort_index()

In [11]: %time s.ix[999]
CPU times: user 856 ms, sys: 32 ms, total: 888 ms
Wall time: 890 ms
Out[11]:
0    -1.207499
...
999   -0.370578
Length: 1000, dtype: float64

In [12]: %time s.ix[999]
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 213 us
Out[12]:
0    -1.207499
...
999   -0.370578
Length: 1000, dtype: float64

And now the weird thing. I convert the series into a df:

In [16]: s = s.sort_index()

In [17]: %time pd.DataFrame(s).ix[999]
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 3.78 ms
Out[17]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 1 columns):
0    1000  non-null values
dtypes: float64(1)

In [18]: %time pd.DataFrame(s).ix[999]
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 397 us
Out[18]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 1 columns):
0    1000  non-null values
dtypes: float64(1)

There is still a little overhead but nothing compared to the previous case.

It might seem an innocent problem but for large time-series the lag becomes of the order of the second and can eat up a lot of performance.

Lastly, I'm sorry but I don't have easy access to current dev version, so it might be that the problem is already solved. (although I'd be curious to know where it comes from)

Edit: can it be linked to #4198 ?

@jreback
Copy link
Contributor

jreback commented Nov 21, 2013

this is correct and expected behavior. The index is hashed when a lookup happens (e.g. its needed).

If you find that you are doing fancy lookups a lot (say i a loop), then you need to refactor your code to do it in a vectorized way.

@l736x
Copy link
Contributor Author

l736x commented Nov 21, 2013

I understand that the index is hashed lazily but I don't get why s.ix[0] triggers the hash and DataFrame(s).ix[0] doesn't.
Is the first fancy indexing and not the second?
We are talking of a factor > 1000 for two operations that do exactly the same thing. And the fastest is the one that is intuitively more complicated.

@jreback
Copy link
Contributor

jreback commented Nov 21, 2013

In master

In [1]: mi = pd.MultiIndex.from_tuples([(x,y) for x in range(1000) for y in range(1000)])

In [2]: s = pd.Series(randn(1000000), index=mi)

In [3]: %timeit s.ix[999]
1 loops, best of 3: 238 ᄉs per loop

In [4]: %timeit s.ix[999]
1000 loops, best of 3: 206 ᄉs per loop

In [5]: s = pd.Series(randn(1000000), index=mi)

In [6]: %timeit s.ix[999]
1000 loops, best of 3: 201 ᄉs per loop

In [7]: %timeit s.ix[999]
1000 loops, best of 3: 198 ᄉs per loop

In [8]: s = s.sort_index()

In [9]: %timeit s.ix[999]
1 loops, best of 3: 290 ᄉs per loop

In [10]: %timeit s.ix[999]
1000 loops, best of 3: 209 ᄉs per loop

DataFrame

In [11]: s = s.sort_index()

In [18]: %timeit pd.DataFrame(s)
1 loops, best of 3: 1.5 s per loop

In [14]: df = pd.DataFrame(s)

In [15]: %timeit df.ix[999]
1000 loops, best of 3: 309 ᄉs per loop

In [16]: %timeit df.ix[999]
1000 loops, best of 3: 321 ᄉs per loop

@l736x
Copy link
Contributor Author

l736x commented Nov 21, 2013

You must use %time, not %timeit because the problem shows up only on the first access and %timeit gives you best of three in any case.

Why are you not doing pd.DataFrame(s).ix[999] ?

@jreback
Copy link
Contributor

jreback commented Nov 21, 2013

the cost of the indexing is incurrered in the dataframe construction, and not in the first indexing access

@l736x
Copy link
Contributor Author

l736x commented Nov 21, 2013

Then how do you explain that right after a s.sort_index() pd.DataFrame(s).ix[999] takes 4 ms and s.ix[999] takes 900 ms ?

@jreback
Copy link
Contributor

jreback commented Nov 21, 2013

answer is that Series doesn't have a sophisticated handler for xs that takes into account a multi-index, while DataFrame does (e.g. it can figure out the indexer in the levels first, saving a lot of time searching the entire index).

So I would call this an unimplemented optimization on Series xs with a multi-index

@l736x
Copy link
Contributor Author

l736x commented Nov 22, 2013

Ok, I see. Thanks for having dug into it.
Is it going to be solved with the subclassing of dataframe in 0.13 or not?
If not, could we keep it as a desirable performance enhancement?

@jreback
Copy link
Contributor

jreback commented Nov 22, 2013

marked for 0.14

@jreback
Copy link
Contributor

jreback commented Feb 14, 2014

@l736x can you run these perf figures again on master or 0.13.1

I think everything is fixed....

@l736x
Copy link
Contributor Author

l736x commented Feb 15, 2014

Sorry but 0.13.1 gives the same as before (I didn't try master)

@jreback
Copy link
Contributor

jreback commented Feb 16, 2014

@l736x ok....easy fix, see #6364

Series multi-index lookup was going thru a very odd path (because < 0.13 it had 2 becase xs was not a shared method). now much simpler. This is actually tricky to time because of the caching.

@l736x
Copy link
Contributor Author

l736x commented Feb 16, 2014

I tested it and it works great, thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants