Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: move actual lookup and dispatch to array_op from frame into internals #39772

Conversation

jorisvandenbossche
Copy link
Member

Background: I am looking at ways to optimize ops for ArrayManager, and currently the manager hasn't much control (apart from the operate_blockwise): it is the DataFrame in _dispatch_frame_op that decides which array_op function to use and actually calls it on the columns or dispatches it to the manager (with apply or operate_blockwise). For ArrayManager, more control would be useful (one example: deciding whether to use numexpr or not now takes a lot of time in the benchmarks for ArrayManager, because it gets done column by column, while we could do it more efficiently inside the ArrayManager).

So the idea would then be:

  • DataFrame is responsible for all preprocessing: checking the right argument, alignment, etc (which is already done right now in DataFrame._arith_method -> ops.align_method_FRAME). This happens on the DataFrame level because it involves frame-level logic (alignment) and can be shared for BlockManager/ArrayManager
  • Manager is responsible for executing the actual operation on clean arguments: it gets either a scalar, array (with correct length to either match to axis 0 or 1) or another Manager (of the same size).

Note that making the Manager responsible to call the execution of the op, doesn't mean the actual code needs to live in the internals. It can still continue to use to array_ops code in /core/ops (and also the ArrayManager will continue to do that, but with some adaptations, eg with some other arguments like passing through a pre-determined use_numexpr).

So for the BlockManger, the diff is only small here (it's mainly moving the case of frame/series which works column by column into BlockManager). I think it's actually an OK change for BlockManager on itself as well, but so the main goal is to allow more customization for ArrayManager in follow-up PRs.

Thoughts on the general idea?

cc @jbrockmendel @jreback

@jorisvandenbossche jorisvandenbossche added Refactor Internal refactoring of code Internals Related to non-user accessible pandas implementation labels Feb 12, 2021
array_op = ops.get_array_op(op)
return self.apply(array_op, right=other)

def operate_array(self, other: ArrayLike, op, axis: int) -> BlockManager:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note I currently add 3 specialized methods (operate_scalar, operate_array and operate_manager) for the three possible cases that are left to handle here.
But it could of course also be a single operate method that combines the three with some if/elif/else checks (but those checks would duplicate a bit the checks done in _dispatch_frame_op where those get called)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this.

pandas/core/internals/managers.py Outdated Show resolved Hide resolved
pandas/core/frame.py Outdated Show resolved Hide resolved
@jbrockmendel
Copy link
Member

no strong opinion on this


elif isinstance(right, Series):
assert right.index.equals(self.index) # Handle other cases later
right = right._values
if axis == 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert right.index.equals(self.get_axis(axis))

@jbrockmendel
Copy link
Member

is the numexpr thing the main way you intend to use this for perf? If so, why not do the check once in get_array_op and it will benefit non-ArrayManager too. Otherwise all this is doing is calling get_array_op in 6 places instead of 1

@jorisvandenbossche
Copy link
Member Author

The numexpr check depends on the dtype, which is something get_array_op doesn't know about.
Another potential use case is what I am exploring in #39820, another use case if np.errstate handling

@jbrockmendel
Copy link
Member

The numexpr check depends on the dtype

if its dtype-dependent, how do you avoid re-doing it for each column? IIRC theres a length check involved that could be done just once.

@jorisvandenbossche
Copy link
Member Author

The dtype check for numexpr might indeed not be easy to do more efficiently in advance.

Now, another case I ran into that can use this. Currently, we broadcast Series to DataFrame before calling the array_op in _dispatch_frame_op (using _maybe_align_series_as_frame in align_method_FRAME). That's not something we want to do for the ArrayManager, and so this logic of broadcasting could live on the BlockManager.
(of course, I could also add a check for this and skip it for ArrayManager, if we are OK with such special casing in the ops code)

@jorisvandenbossche
Copy link
Member Author

@jbrockmendel are you OK with moving forward with this?

(Personally, I find this a clean-up in general. DataFrame doesn't need to know how the BlockManager executes the operation (blockwise, column-by-column, ..))

@jbrockmendel
Copy link
Member

are you OK with moving forward with this?

Looking at this again fresh, I dont see the point/upside. IIRC there are follow-ups where the actual benefits kick in? More recent comments suggest the numexpr follow-up may not be so clear, and i dont see how the np.seterr optimization is affected by this.

For the with-Series cases, in the status quo we use only DataFrame methods, which we generally prefer over doing things in internals. Whats the upside to moving all of that code into internals? The more things we put on AM/BM, the more ways there are for AM/BM to behave differently.

This replaces calling get_array_op in one places with calling it in 6.

So I have a hard time seeing this as a cleanup.

The main perf benefit I can imagine here is by not going through ArrayManager.apply for the scalar case. I think we'd get _most_of that benefit by adding an escape-hatch to ArrayManager.apply that checks for if not aligned_kwargs outside of the loop and goes down a fastpath in that case.

@jorisvandenbossche
Copy link
Member Author

I can leave the get_array_op inside _dispatch_frame_op for now, and pass this to the Manager method (to avoid calling this 6 times in the internals). I think that should address part of your concerns?

Further, some arguments that this is a clean-up in general (partly regardless of AM / BM, and regardless of possible performance changes):

  • We already call Manager methods from inside _dispatch_frame_op: _mgr.apply(..) for the "op(df, scalar)" case and _mgr.operate_blockwise(..) for the "op(df, df)" case. It's only for the "op(df, series)" case where we iterate over the column arrays (_iter_column_arrays) and call array_op directly inside _dispatch_frame_op.
    So in that regard, this PR makes that more consistent (move the "op(df, series)" case also into the Manager) and rationalizes the names a bit (operate_scalar/operate_manager instead of apply/operate_blockwise, where operate_scalar can be an alias of apply for BM).
  • Currently, it's the DataFrame (more specifically ops.align_method_FRAME) that decides to broadcast a Series to a DataFrame before performing the array ops. This is logic specific to the BlockManager, that therefore could live in the BlockManager (as mentioned above, I can of course easily add cases of if isinstance(mgr, BlockManager) in the dataframe code, but in general that can point to a wrong separation of concerns, IMO).

For the with-Series cases, in the status quo we use only DataFrame methods, which we generally prefer over doing things in internals. Whats the upside to moving all of that code into internals? The more things we put on AM/BM, the more ways there are for AM/BM to behave differently.

See my second bullet point above. Of course AM/BM shouldn't behave differently, but the "op(df, series)" case is one where AM/BM should be implemented differently. If a part of the implementation depends on the manager, that part should ideally live in the internals?
(and as long as we continue to call array_op in both AM/BM, and the only logic in AM/BM is how to call that function on (column/block) arrays, that should ensure they behave consistently)

@jorisvandenbossche
Copy link
Member Author

(and to be clear: although I think this would be a good change, this doesn't block other performance related PRs at the moment, I can also get around with some if isinstance(mgr, Block/ArrayManager) checks)

@jbrockmendel
Copy link
Member

I can leave the get_array_op inside _dispatch_frame_op for now, and pass this to the Manager method (to avoid calling this 6 times in the internals). I think that should address part of your concerns?

marginally, yes.

[calling ops.align_method_FRAME] is logic specific to the BlockManager, that therefore could live in the BlockManager

This is a fair point. Could implement a maybe_align_for_op that does this for BlockManager and is a no-op for ArrayManager (i guess in #40482)

If a part of the implementation depends on the manager, that part should ideally live in the internals?

AFAICT the hypothetical maybe_align_for_op would be exactly that part?

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@jorisvandenbossche
Copy link
Member Author

@jbrockmendel you're OK with the current version?

@jbrockmendel
Copy link
Member

@jbrockmendel you're OK with the current version?

It isn't obvious to me how you've addressed any of my comments. Can you explain?

@jorisvandenbossche
Copy link
Member Author

I will probably have misunderstood you then. From your email I understood you were generally on board now with this PR (based on the argument that the BlockManager-specific alignment should live in the BlockManager). But I assume you only meant that you're OK with having the alignment itself live in the BlockManager (as you suggested above to have a BlockManager.maybe_align_for_op), and not this general PR?

Regarding this alignment, after this PR, I would put that inside BlockManager.operate_array (and then ArrayManager.operate_array would be free to not do this alignment). And then ops.align_method_FRAME doesn't have to call any BlockManager method, as the ops-related interaction with the internals is channeled through the operate_.. methods.

AFAIK the actionable comment for this PR (except for closing it) is:

I can leave the get_array_op inside _dispatch_frame_op for now, and pass this to the Manager method (to avoid calling this 6 times in the internals). I think that should address part of your concerns?

marginally, yes.

I can still do that, if that makes it acceptable (that was basically my question in my previous comment, but unclearly stated)

@jbrockmendel
Copy link
Member

But I assume you only meant that you're OK with having the alignment itself live in the BlockManager (as you suggested above to have a BlockManager.maybe_align_for_op), and not this general PR?

Correct.

@jorisvandenbossche
Copy link
Member Author

Is there then any actionable comment that I can address in this PR?

@jbrockmendel
Copy link
Member

Is there then any actionable comment that I can address in this PR?

  1. Moving get_array_op back to just being called once would be marginally better

  2. did the maybe_align_for_op thing get implemented somewhere along the line?

  3. IIUC operate_arrays is effectively the same for both AM and BM, so we'd be better off keeping that code out of internals entirely.

  4. I don't see how AM.operate_scalar is meaningfully different from AM.apply. Is there some non-trivial overhead it is avoiding?

  5. passing consolidate=False to create_block_manager_from_arrays may marginally improve perf. (though ideally id rather do that at a higher level in something like DataFrame.from_arrays

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label May 26, 2021
@jreback
Copy link
Contributor

jreback commented Nov 28, 2021

this is quite old, happen to reopen if actively worked on.

@jreback jreback closed this Nov 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internals Related to non-user accessible pandas implementation Refactor Internal refactoring of code Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants