-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: .ix performance on series #5567
Comments
this is correct and expected behavior. The index is hashed when a lookup happens (e.g. its needed). If you find that you are doing fancy lookups a lot (say i a loop), then you need to refactor your code to do it in a vectorized way. |
I understand that the index is hashed lazily but I don't get why s.ix[0] triggers the hash and DataFrame(s).ix[0] doesn't. |
In master
DataFrame
|
You must use %time, not %timeit because the problem shows up only on the first access and %timeit gives you best of three in any case. Why are you not doing pd.DataFrame(s).ix[999] ? |
the cost of the indexing is incurrered in the dataframe construction, and not in the first indexing access |
Then how do you explain that right after a s.sort_index() pd.DataFrame(s).ix[999] takes 4 ms and s.ix[999] takes 900 ms ? |
answer is that Series doesn't have a sophisticated handler for So I would call this an unimplemented optimization on Series xs with a multi-index |
Ok, I see. Thanks for having dug into it. |
marked for 0.14 |
@l736x can you run these perf figures again on master or 0.13.1 I think everything is fixed.... |
Sorry but 0.13.1 gives the same as before (I didn't try master) |
I tested it and it works great, thanks a lot! |
Series xs when presented with a multi-index should use the data frame logic (whereby it can use the levels to avoid having to scan the entire set). Need to move the logic of
xs
tocore/generic.py
so both Series/Frame can use it.I stumbled on a weird issue.
The first time I access a series location with ix I have a huge overhead.
This is not the case if I transform the series in a dataframe and the access it.
I use a mutliindex below because it makes the effect more visible, but the same problem is present for regular indexes.
The behavior seems related to the index because recreating the series does not reproduce it, but sorting the index does:
And now the weird thing. I convert the series into a df:
There is still a little overhead but nothing compared to the previous case.
It might seem an innocent problem but for large time-series the lag becomes of the order of the second and can eat up a lot of performance.
Lastly, I'm sorry but I don't have easy access to current dev version, so it might be that the problem is already solved. (although I'd be curious to know where it comes from)
Edit: can it be linked to #4198 ?
The text was updated successfully, but these errors were encountered: