-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor - ArrayManager overview issue #39146
Comments
One design question that was still left open is what to do with Series, for which we currently have a
A while ago I thought the second option could be an attractive simplification (because in the end, a Series "just" consists of an array and an index, so why needing a manager?). But that was probably a bit naive ;) The Manager still does quite some things, and moreover, doing a SingleArrayManager keeps the changes more limited (we can still see later if getting rid of Single(Block/Array)Manager is an option we want to explore, independent from the BlockManager vs ArrayManager debate) and for implementing certain features consistently between Series and DataFrame, having both with an underlying manager is probably useful. Now, for the actual
I am currently testing out the approach of a separate SingleArrayManager class to see what is needed to implement it fully. |
Some more benchmark results: groupby
|
Reductions (stat_ops)
|
Element-wise ops (arithmetic)
|
Over the last weeks I have been updating the status of this project (and fixing some regressions), and rerunning the benchmarks. This (long) post gives an overview of the current ASV benchmarks with the ArrayManager. Technical notes: I always ran the benchmark on a commit where I changed the default to ArrayManager (HEAD) vs the previous commit with the normal default of BlockManager (HEAD~1) -> I am going to split the results here by topic / file (each time with a small discussion, repeating some stuff from above), but the results of the full run are also included at the bottom. ToNumpy
|
All results combined
|
Thanks for putting this together @jorisvandenbossche. The discussion seemed quite fair to me, and the only real points of disagreement are entirely subjective (e.g. "X is acceptable IMO"). A few other places where I've found ArrayManager performs exceptionally well (in ways that I don't think that BlockManager can meaningfully optimize):
Assorted Notes:
|
@jorisvandenbossche @jbrockmendel This might be a weird place to ask about this, but given the various roadmap/design docs that mentioned the BlockManager refactor as part of the goals for pandas 2.0, and the wide release of pandas 2.0, is there any one place with a summary of where the project is at with respect to the original goals of moving to a 1d block manager? I am wondering what folks can expect if using Arrow-backed dtypes in pandas 2.0 as it relates to block manager behavior; the changelog for pandas 2.0 is very large and detailed but I couldn't quite find anything that spelled this out at the high level. E.g., I am curious to know the set of operations that are copy-free (no consolidation) when using pandas 2.0 with Arrow-backed dtypes. |
Unless you actively opt in to ArrayManager usage (which hasn't seen much discussion in the last year and change), you won't be affected.
automatic consolidation has been removed, so you won't be getting any copies on that front. For more copy-free-ness, I suggest trying out |
The ArrayManager is now deprecated. Closing. |
Related to the discussion in #10556, and following up on the mailing list discussion "A case for a simplified (non-consolidating) BlockManager with 1D blocks" (archive).
Initial proof of concept for a non-consolidating "ArrayManager" (storing the columns of a DataFrame as a list of 1D arrays instead of blocks) is merged in #36010.
This issue is meant to track the required follow-up work items to get this to a more feature-complete implementation.
Functionality: get all tests passing
quantile
/describe
related (ArrayManager.quantile is not yet implemented) -> ENH: ArrayManager.quantile #40189equals
related (ArrayManager.equals is not yet implemented) -> [ArrayManager] Implement .equals method #39721groupby
related tests (there are still a few parts of groupby that directly uses the blocks) -> [ArrayManager] GroupBy cython aggregations (no fallback) #39885, [ArrayManager] Remaining GroupBy tests (fix count, pass on libreduction for now) #40050concat
related (internals/concat.py
only deals with the simple case when no reindexing is needed for ArrayManager at the moment, the full functionality (similarly to whatconcatenate_block_managers
/ theJoinUnits
now cover) still needs to be implemented) -> [ArrayManager] REF: Implement concat with reindexing #39612setitem
,iset
,insert
are not yet fully implementated for all corner cases + get indexing tests passing)replace
,where
,interpolate
,shift
,diff
,downcast
,putmask
, ... (those could all be refactored one at a time).Such tests can be skipped with eg
@td.skip_array_manager_invalid_test
.Design questions:
Performance
The text was updated successfully, but these errors were encountered: