REF: move actual lookup and dispatch to array_op from frame into internals #39772

jorisvandenbossche · 2021-02-12T11:10:27Z

Background: I am looking at ways to optimize ops for ArrayManager, and currently the manager hasn't much control (apart from the operate_blockwise): it is the DataFrame in _dispatch_frame_op that decides which array_op function to use and actually calls it on the columns or dispatches it to the manager (with apply or operate_blockwise). For ArrayManager, more control would be useful (one example: deciding whether to use numexpr or not now takes a lot of time in the benchmarks for ArrayManager, because it gets done column by column, while we could do it more efficiently inside the ArrayManager).

So the idea would then be:

DataFrame is responsible for all preprocessing: checking the right argument, alignment, etc (which is already done right now in DataFrame._arith_method -> ops.align_method_FRAME). This happens on the DataFrame level because it involves frame-level logic (alignment) and can be shared for BlockManager/ArrayManager
Manager is responsible for executing the actual operation on clean arguments: it gets either a scalar, array (with correct length to either match to axis 0 or 1) or another Manager (of the same size).

Note that making the Manager responsible to call the execution of the op, doesn't mean the actual code needs to live in the internals. It can still continue to use to array_ops code in /core/ops (and also the ArrayManager will continue to do that, but with some adaptations, eg with some other arguments like passing through a pre-determined use_numexpr).

So for the BlockManger, the diff is only small here (it's mainly moving the case of frame/series which works column by column into BlockManager). I think it's actually an OK change for BlockManager on itself as well, but so the main goal is to allow more customization for ArrayManager in follow-up PRs.

Thoughts on the general idea?

cc @jbrockmendel @jreback

…rnals

jorisvandenbossche · 2021-02-12T11:12:21Z

pandas/core/internals/managers.py

+        array_op = ops.get_array_op(op)
+        return self.apply(array_op, right=other)
+
+    def operate_array(self, other: ArrayLike, op, axis: int) -> BlockManager:


Note I currently add 3 specialized methods (operate_scalar, operate_array and operate_manager) for the three possible cases that are left to handle here.
But it could of course also be a single operate method that combines the three with some if/elif/else checks (but those checks would duplicate a bit the checks done in _dispatch_frame_op where those get called)

pandas/core/internals/managers.py

jreback

+1 on this.

pandas/core/internals/managers.py

pandas/core/frame.py

jbrockmendel · 2021-02-12T18:04:24Z

no strong opinion on this

jreback · 2021-02-15T17:16:01Z

pandas/core/frame.py


        elif isinstance(right, Series):
-            assert right.index.equals(self.index)  # Handle other cases later
-            right = right._values
+            if axis == 1:


assert right.index.equals(self.get_axis(axis))

jbrockmendel · 2021-02-15T18:30:28Z

is the numexpr thing the main way you intend to use this for perf? If so, why not do the check once in get_array_op and it will benefit non-ArrayManager too. Otherwise all this is doing is calling get_array_op in 6 places instead of 1

jorisvandenbossche · 2021-02-15T18:49:31Z

The numexpr check depends on the dtype, which is something get_array_op doesn't know about.
Another potential use case is what I am exploring in #39820, another use case if np.errstate handling

jbrockmendel · 2021-02-15T18:56:39Z

The numexpr check depends on the dtype

if its dtype-dependent, how do you avoid re-doing it for each column? IIRC theres a length check involved that could be done just once.

jorisvandenbossche · 2021-03-15T14:36:18Z

The dtype check for numexpr might indeed not be easy to do more efficiently in advance.

Now, another case I ran into that can use this. Currently, we broadcast Series to DataFrame before calling the array_op in _dispatch_frame_op (using _maybe_align_series_as_frame in align_method_FRAME). That's not something we want to do for the ArrayManager, and so this logic of broadcasting could live on the BlockManager.
(of course, I could also add a check for this and skip it for ArrayManager, if we are OK with such special casing in the ops code)

jorisvandenbossche · 2021-03-16T21:25:31Z

@jbrockmendel are you OK with moving forward with this?

(Personally, I find this a clean-up in general. DataFrame doesn't need to know how the BlockManager executes the operation (blockwise, column-by-column, ..))

jbrockmendel · 2021-03-17T02:55:13Z

are you OK with moving forward with this?

Looking at this again fresh, I dont see the point/upside. IIRC there are follow-ups where the actual benefits kick in? More recent comments suggest the numexpr follow-up may not be so clear, and i dont see how the np.seterr optimization is affected by this.

For the with-Series cases, in the status quo we use only DataFrame methods, which we generally prefer over doing things in internals. Whats the upside to moving all of that code into internals? The more things we put on AM/BM, the more ways there are for AM/BM to behave differently.

This replaces calling get_array_op in one places with calling it in 6.

So I have a hard time seeing this as a cleanup.

The main perf benefit I can imagine here is by not going through ArrayManager.apply for the scalar case. I think we'd get _most_of that benefit by adding an escape-hatch to ArrayManager.apply that checks for if not aligned_kwargs outside of the loop and goes down a fastpath in that case.

jorisvandenbossche · 2021-03-17T11:29:26Z

I can leave the get_array_op inside _dispatch_frame_op for now, and pass this to the Manager method (to avoid calling this 6 times in the internals). I think that should address part of your concerns?

Further, some arguments that this is a clean-up in general (partly regardless of AM / BM, and regardless of possible performance changes):

We already call Manager methods from inside _dispatch_frame_op: _mgr.apply(..) for the "op(df, scalar)" case and _mgr.operate_blockwise(..) for the "op(df, df)" case. It's only for the "op(df, series)" case where we iterate over the column arrays (_iter_column_arrays) and call array_op directly inside _dispatch_frame_op.
So in that regard, this PR makes that more consistent (move the "op(df, series)" case also into the Manager) and rationalizes the names a bit (operate_scalar/operate_manager instead of apply/operate_blockwise, where operate_scalar can be an alias of apply for BM).
Currently, it's the DataFrame (more specifically ops.align_method_FRAME) that decides to broadcast a Series to a DataFrame before performing the array ops. This is logic specific to the BlockManager, that therefore could live in the BlockManager (as mentioned above, I can of course easily add cases of if isinstance(mgr, BlockManager) in the dataframe code, but in general that can point to a wrong separation of concerns, IMO).

For the with-Series cases, in the status quo we use only DataFrame methods, which we generally prefer over doing things in internals. Whats the upside to moving all of that code into internals? The more things we put on AM/BM, the more ways there are for AM/BM to behave differently.

See my second bullet point above. Of course AM/BM shouldn't behave differently, but the "op(df, series)" case is one where AM/BM should be implemented differently. If a part of the implementation depends on the manager, that part should ideally live in the internals?
(and as long as we continue to call array_op in both AM/BM, and the only logic in AM/BM is how to call that function on (column/block) arrays, that should ensure they behave consistently)

jorisvandenbossche · 2021-03-17T12:02:03Z

(and to be clear: although I think this would be a good change, this doesn't block other performance related PRs at the moment, I can also get around with some if isinstance(mgr, Block/ArrayManager) checks)

jbrockmendel · 2021-03-17T22:56:12Z

I can leave the get_array_op inside _dispatch_frame_op for now, and pass this to the Manager method (to avoid calling this 6 times in the internals). I think that should address part of your concerns?

marginally, yes.

[calling ops.align_method_FRAME] is logic specific to the BlockManager, that therefore could live in the BlockManager

This is a fair point. Could implement a maybe_align_for_op that does this for BlockManager and is a no-op for ArrayManager (i guess in #40482)

If a part of the implementation depends on the manager, that part should ideally live in the internals?

AFAICT the hypothetical maybe_align_for_op would be exactly that part?

github-actions · 2021-04-17T00:05:45Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

jorisvandenbossche · 2021-04-21T13:14:59Z

@jbrockmendel you're OK with the current version?

jbrockmendel · 2021-04-21T15:34:11Z

@jbrockmendel you're OK with the current version?

It isn't obvious to me how you've addressed any of my comments. Can you explain?

jorisvandenbossche · 2021-04-21T21:46:12Z

I will probably have misunderstood you then. From your email I understood you were generally on board now with this PR (based on the argument that the BlockManager-specific alignment should live in the BlockManager). But I assume you only meant that you're OK with having the alignment itself live in the BlockManager (as you suggested above to have a BlockManager.maybe_align_for_op), and not this general PR?

Regarding this alignment, after this PR, I would put that inside BlockManager.operate_array (and then ArrayManager.operate_array would be free to not do this alignment). And then ops.align_method_FRAME doesn't have to call any BlockManager method, as the ops-related interaction with the internals is channeled through the operate_.. methods.

AFAIK the actionable comment for this PR (except for closing it) is:

I can leave the get_array_op inside _dispatch_frame_op for now, and pass this to the Manager method (to avoid calling this 6 times in the internals). I think that should address part of your concerns?

marginally, yes.

I can still do that, if that makes it acceptable (that was basically my question in my previous comment, but unclearly stated)

jbrockmendel · 2021-04-22T01:46:37Z

But I assume you only meant that you're OK with having the alignment itself live in the BlockManager (as you suggested above to have a BlockManager.maybe_align_for_op), and not this general PR?

Correct.

jorisvandenbossche · 2021-04-22T07:18:03Z

Is there then any actionable comment that I can address in this PR?

jbrockmendel · 2021-04-22T20:57:59Z

Is there then any actionable comment that I can address in this PR?

Moving get_array_op back to just being called once would be marginally better
did the maybe_align_for_op thing get implemented somewhere along the line?
IIUC operate_arrays is effectively the same for both AM and BM, so we'd be better off keeping that code out of internals entirely.
I don't see how AM.operate_scalar is meaningfully different from AM.apply. Is there some non-trivial overhead it is avoiding?
passing consolidate=False to create_block_manager_from_arrays may marginally improve perf. (though ideally id rather do that at a higher level in something like DataFrame.from_arrays

github-actions · 2021-05-26T00:08:17Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

jreback · 2021-11-28T21:04:10Z

this is quite old, happen to reopen if actively worked on.

REF: move actual lookup and dispatch to array_op from frame into inte…

c454025

…rnals

jorisvandenbossche added Refactor Internal refactoring of code Internals Related to non-user accessible pandas implementation labels Feb 12, 2021

jorisvandenbossche commented Feb 12, 2021

View reviewed changes

jreback reviewed Feb 12, 2021

View reviewed changes

pandas/core/internals/managers.py Outdated Show resolved Hide resolved

jreback requested changes Feb 12, 2021

View reviewed changes

pandas/core/internals/managers.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Feb 12, 2021

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

jorisvandenbossche added 3 commits February 15, 2021 13:03

Merge remote-tracking branch 'upstream/master' into refactop-manager-ops

81b1016

update comments and docstrings

c6ed357

add ArrayManager versions

1f8622e

jorisvandenbossche mentioned this pull request Feb 15, 2021

REF: call ensure_wrapped_if_datetimelike before the array arithmetic_op #39820

Merged

jreback requested changes Feb 15, 2021

View reviewed changes

jorisvandenbossche added 2 commits February 15, 2021 18:57

single index assert for axis=0/1

a487f0f

update docstring

9fe7f44

jorisvandenbossche mentioned this pull request Feb 15, 2021

PERF: use arr.size instead of np.prod(arr.shape) in _can_use_numexpr #39825

Merged

jbrockmendel mentioned this pull request Feb 16, 2021

[ArrayManager] REF: Implement concat with reindexing #39612

Merged

jorisvandenbossche mentioned this pull request Mar 9, 2021

Refactor - ArrayManager overview issue #39146

Closed

11 tasks

Merge remote-tracking branch 'upstream/master' into refactop-manager-ops

3395743

Merge remote-tracking branch 'upstream/master' into refactop-manager-ops

a1495f4

jorisvandenbossche mentioned this pull request Mar 17, 2021

REF: avoid broadcasting of Series to DataFrame in ops for ArrayManager #40482

Closed

github-actions bot added the Stale label Apr 17, 2021

jorisvandenbossche removed the Stale label Apr 20, 2021

Merge remote-tracking branch 'upstream/master' into refactop-manager-ops

b5d7ab2

jorisvandenbossche mentioned this pull request Apr 23, 2021

PERF/REF: Check use of numexpr earlier in the DataFrame operation #41122

Closed

github-actions bot added the Stale label May 26, 2021

jreback closed this Nov 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REF: move actual lookup and dispatch to array_op from frame into internals #39772

REF: move actual lookup and dispatch to array_op from frame into internals #39772

jorisvandenbossche commented Feb 12, 2021

jorisvandenbossche Feb 12, 2021

jreback left a comment

jbrockmendel commented Feb 12, 2021

jreback Feb 15, 2021

jbrockmendel commented Feb 15, 2021

jorisvandenbossche commented Feb 15, 2021

jbrockmendel commented Feb 15, 2021

jorisvandenbossche commented Mar 15, 2021

jorisvandenbossche commented Mar 16, 2021

jbrockmendel commented Mar 17, 2021

jorisvandenbossche commented Mar 17, 2021

jorisvandenbossche commented Mar 17, 2021

jbrockmendel commented Mar 17, 2021

github-actions bot commented Apr 17, 2021

jorisvandenbossche commented Apr 21, 2021

jbrockmendel commented Apr 21, 2021

jorisvandenbossche commented Apr 21, 2021

jbrockmendel commented Apr 22, 2021

jorisvandenbossche commented Apr 22, 2021

jbrockmendel commented Apr 22, 2021

github-actions bot commented May 26, 2021

jreback commented Nov 28, 2021

REF: move actual lookup and dispatch to array_op from frame into internals #39772

REF: move actual lookup and dispatch to array_op from frame into internals #39772

Conversation

jorisvandenbossche commented Feb 12, 2021

jorisvandenbossche Feb 12, 2021

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

jbrockmendel commented Feb 12, 2021

jreback Feb 15, 2021

Choose a reason for hiding this comment

jbrockmendel commented Feb 15, 2021

jorisvandenbossche commented Feb 15, 2021

jbrockmendel commented Feb 15, 2021

jorisvandenbossche commented Mar 15, 2021

jorisvandenbossche commented Mar 16, 2021

jbrockmendel commented Mar 17, 2021

jorisvandenbossche commented Mar 17, 2021

jorisvandenbossche commented Mar 17, 2021

jbrockmendel commented Mar 17, 2021

github-actions bot commented Apr 17, 2021

jorisvandenbossche commented Apr 21, 2021

jbrockmendel commented Apr 21, 2021

jorisvandenbossche commented Apr 21, 2021

jbrockmendel commented Apr 22, 2021

jorisvandenbossche commented Apr 22, 2021

jbrockmendel commented Apr 22, 2021

github-actions bot commented May 26, 2021

jreback commented Nov 28, 2021