-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: move actual lookup and dispatch to array_op from frame into internals #39772
REF: move actual lookup and dispatch to array_op from frame into internals #39772
Conversation
pandas/core/internals/managers.py
Outdated
array_op = ops.get_array_op(op) | ||
return self.apply(array_op, right=other) | ||
|
||
def operate_array(self, other: ArrayLike, op, axis: int) -> BlockManager: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note I currently add 3 specialized methods (operate_scalar
, operate_array
and operate_manager
) for the three possible cases that are left to handle here.
But it could of course also be a single operate
method that combines the three with some if/elif/else checks (but those checks would duplicate a bit the checks done in _dispatch_frame_op
where those get called)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this.
no strong opinion on this |
pandas/core/frame.py
Outdated
|
||
elif isinstance(right, Series): | ||
assert right.index.equals(self.index) # Handle other cases later | ||
right = right._values | ||
if axis == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert right.index.equals(self.get_axis(axis))
is the numexpr thing the main way you intend to use this for perf? If so, why not do the check once in get_array_op and it will benefit non-ArrayManager too. Otherwise all this is doing is calling get_array_op in 6 places instead of 1 |
The numexpr check depends on the dtype, which is something |
if its dtype-dependent, how do you avoid re-doing it for each column? IIRC theres a length check involved that could be done just once. |
The dtype check for numexpr might indeed not be easy to do more efficiently in advance. Now, another case I ran into that can use this. Currently, we broadcast Series to DataFrame before calling the array_op in |
@jbrockmendel are you OK with moving forward with this? (Personally, I find this a clean-up in general. DataFrame doesn't need to know how the BlockManager executes the operation (blockwise, column-by-column, ..)) |
Looking at this again fresh, I dont see the point/upside. IIRC there are follow-ups where the actual benefits kick in? More recent comments suggest the numexpr follow-up may not be so clear, and i dont see how the np.seterr optimization is affected by this. For the with-Series cases, in the status quo we use only DataFrame methods, which we generally prefer over doing things in internals. Whats the upside to moving all of that code into internals? The more things we put on AM/BM, the more ways there are for AM/BM to behave differently. This replaces calling get_array_op in one places with calling it in 6. So I have a hard time seeing this as a cleanup. The main perf benefit I can imagine here is by not going through ArrayManager.apply for the scalar case. I think we'd get _most_of that benefit by adding an escape-hatch to ArrayManager.apply that checks for |
I can leave the Further, some arguments that this is a clean-up in general (partly regardless of AM / BM, and regardless of possible performance changes):
See my second bullet point above. Of course AM/BM shouldn't behave differently, but the "op(df, series)" case is one where AM/BM should be implemented differently. If a part of the implementation depends on the manager, that part should ideally live in the internals? |
(and to be clear: although I think this would be a good change, this doesn't block other performance related PRs at the moment, I can also get around with some |
marginally, yes.
This is a fair point. Could implement a
AFAICT the hypothetical |
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
@jbrockmendel you're OK with the current version? |
It isn't obvious to me how you've addressed any of my comments. Can you explain? |
I will probably have misunderstood you then. From your email I understood you were generally on board now with this PR (based on the argument that the BlockManager-specific alignment should live in the BlockManager). But I assume you only meant that you're OK with having the alignment itself live in the BlockManager (as you suggested above to have a Regarding this alignment, after this PR, I would put that inside AFAIK the actionable comment for this PR (except for closing it) is:
I can still do that, if that makes it acceptable (that was basically my question in my previous comment, but unclearly stated) |
Correct. |
Is there then any actionable comment that I can address in this PR? |
|
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
this is quite old, happen to reopen if actively worked on. |
Background: I am looking at ways to optimize ops for ArrayManager, and currently the manager hasn't much control (apart from the
operate_blockwise
): it is the DataFrame in_dispatch_frame_op
that decides whicharray_op
function to use and actually calls it on the columns or dispatches it to the manager (with apply or operate_blockwise). For ArrayManager, more control would be useful (one example: deciding whether to use numexpr or not now takes a lot of time in the benchmarks for ArrayManager, because it gets done column by column, while we could do it more efficiently inside the ArrayManager).So the idea would then be:
right
argument, alignment, etc (which is already done right now inDataFrame._arith_method
->ops.align_method_FRAME
). This happens on the DataFrame level because it involves frame-level logic (alignment) and can be shared for BlockManager/ArrayManagerNote that making the Manager responsible to call the execution of the op, doesn't mean the actual code needs to live in the internals. It can still continue to use to array_ops code in /core/ops (and also the ArrayManager will continue to do that, but with some adaptations, eg with some other arguments like passing through a pre-determined
use_numexpr
).So for the BlockManger, the diff is only small here (it's mainly moving the case of frame/series which works column by column into BlockManager). I think it's actually an OK change for BlockManager on itself as well, but so the main goal is to allow more customization for ArrayManager in follow-up PRs.
Thoughts on the general idea?
cc @jbrockmendel @jreback