-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: unnecessary (expensive) concat #4740
Comments
Could you share the script you are using to profile? @jbrockmendel |
I'm not sure that's allowed. If it helps, @YarShev and @anmyachev are looking at the same script. |
@jbrockmendel Ah ok. Is it possible to at least get a minimum reproducible example? |
Well, no. I know concat is getting called bc I put a
AFAICT the tooling doesn't provide a nice way to find out what call got me here. |
@jbrockmendel I think In case you don't know, I didn't know that concatenating the indices would itself be so expensive. In that case, maybe what you suggested in your first post would help. The full index may live in the |
@jbrockmendel I tried this approach, it became slower to work on our script (because of Ray's work with object store as @mvashishtha mentioned). In addition, for some operations, for example Is there the easy way to speed up concatenating the MultiIndex itself on pandas side? |
There's a patch that speeds up this particular case, but may slow down other cases (so i haven't decided yet whether to upstream it to pandas):
|
This should help us for |
I have ray worker that is calling
PandasDataframeAxisPartition.deploy_axis_func
and in that doingpandas.concat
on 16 DataFrames with MultiIndex indexes, an expensive concat.AFAIK there isn't a nice way to see what called deploy_axis_func, so this is a bit speculative.
I think the partitions being collected are exactly the partitions of an existing DataFrame, which I think means that frame's index is already materialized somewhere, so reconstructing it inside concat is unnecessary. i.e. in deploy_axis_func we could do something like
If I'm right here, we could see significant savings. e.g. in the script im profiling, ATM 5% is spent in
_get_concat_axis
, and I think a lot more indirectly inside worker processes.Moreover, if this works, we could do the pinning/unpinning before pickling/unpickling and save on pickle costs.
The text was updated successfully, but these errors were encountered: