Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ArrayManager] REF: Implement concat with reindexing #39612

Merged
merged 34 commits into from
Apr 12, 2021

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Feb 5, 2021

xref #39146 (the concat work item)

This PR implements concat for DataFrames using the ArrayManager, and all tests in pandas/tests/reshape/ already pass.

Summary of the changes:

  • In internals/concat.py, I added a concatenate_block_managers equivalent for ArrayManagers. This function rather simply reindexes the managers and then concats the arrays (axis=0) or combines them in a single manager (axis=1).
  • The main problem with that simple approach is that when reindexing the ArrayManager with the given indexer, we do not yet know the resulting dtype for when a new all-NA column needs to be added (i.e. the common dtype we would find when actually concatting). And we can't make this eg an all-NaN object array, as that would influence the find_common_type logic.
  • The way I tried to solve this at the moment is by adding a dummy "proxy" object to signal such an all-NA array (the NullArrayProxy class), only tracking its length. And when actually concatting, we can replace this with an actual all-NA array after we determined the resulting dtype. This proxy object only lives inside the concatting code (it only gets created when we will call concat_compat and inside this function they always get replaced), so the user will never see this.
  • An alternative could be to first determine what the resulting dtype will be in internals/concat.py so we can pass this to the reindex functionality, and after reindexing do the actual concatenation. But IMO that will be more complex; the concat / find_common_type logic gets involved twice (before and after reindexing), and getting those dtypes for which we want to know the common type is also quite tricky before actually reindexing the manager.

Basically, this replaces the 600 lines of code of internals/concat.py with 25 lines of concatenate_array_managers + the NullArrayProxy and _array_from_proxy code in dtypes/concat.py (but I am quite likely still missing some corner cases!)

Note: I did not explicitly test copy=True/False behaviour for concat along the columns with this new code.

cc @jreback @jbrockmendel

@jorisvandenbossche jorisvandenbossche added Refactor Internal refactoring of code Reshaping Concat, Merge/Join, Stack/Unstack, Explode Internals Related to non-user accessible pandas implementation labels Feb 5, 2021
Comment on lines 207 to 214
def concat_arrays(to_concat):
"""
Alternative for concat_compat but specialized for use in the ArrayManager.

Differences: only deals with 1D arrays (no axis keyword) and does not skip
empty arrays to determine the dtype.
In addition ensures that all NullArrayProxies get replaced with actual
arrays.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This concat_arrays is basically a copied+adapated version of concat_compat. I initially adapted concat_compat to cover both cases.

But that resulted in quite some special cases inside concat_compat (checking the proxy objects, while this is not needed for when concat_compat is used outside ArrayManager, + not skipping empties for which I added a keyword).
So in the end I decided to split the changes in a separate function, so both concat_compat and concat_arrays are more readable / easier to follow, but it certainly gives some duplication.

"""
if is_extension_array_dtype(dtype):
return dtype.construct_array_type()._from_sequence(
[dtype.na_value] * arr.n, dtype=dtype
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should maybe consider adding a method to the ExtensionArray interface to create an empty/all-NA array of the given dtype, to avoid this rather inefficient "construct from list of NA scalars".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC this can break for dtypes that cant hold NA. the more robust version is

cls = dtype.construct_array_type()
empty = cls._from_sequence([], dtype=dtype)
indexer = -np.ones(arr., dtype=np.intp)
return empty.take(indexer)

(in internals concat we use this in one place and when i tried to change it to what you have here something broke, dont remember what off the top of my head)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an example dtype that cannot hold NAs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like Sparse[int]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed:

In [5]: dtype = pd.SparseDtype("int")

In [6]: dtype
Out[6]: Sparse[int64, 0]

In [7]: dtype.construct_array_type()._from_sequence([dtype.na_value] * 5, dtype=dtype)
...
ValueError: Cannot convert non-finite values (NA or inf) to integer

Will use your suggested code for now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #39776 for the idea of adding a method to the EA interface for creating an empty/all-NA array from a given dtype.

@jorisvandenbossche jorisvandenbossche changed the title [ArrayManager] Implement concat with reindexing [ArrayManager] REF: Implement concat with reindexing Feb 5, 2021
"""
if is_extension_array_dtype(dtype):
return dtype.construct_array_type()._from_sequence(
[dtype.na_value] * arr.n, dtype=dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this

pandas/core/dtypes/concat.py Outdated Show resolved Hide resolved
pandas/core/dtypes/concat.py Outdated Show resolved Hide resolved
pandas/core/dtypes/concat.py Show resolved Hide resolved
@@ -132,6 +204,75 @@ def is_nonempty(x) -> bool:
return np.concatenate(to_concat, axis=axis)


def concat_arrays(to_concat):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

types if you can

@jorisvandenbossche
Copy link
Member Author

I pushed some updates.

Given that the review comments are relatively minor points (but useful! to be clear), does that mean that you are OK with the general approach (of using the temporary proxy class)?

This should be ready for another review round.

@jorisvandenbossche
Copy link
Member Author

I ran the concat/join/merge benchmarks comparing ArrayManager with BlockManager (it's running the benchmarks on this branch with the default of using BlockManager ("before") vs this branch with one addtitional commit changing the default setting to ArrayManager ("after"), so green is slower and red is faster for the ArrayManager):

$ asv continuous -f 1.01 -b join_merge HEAD~1 HEAD

       before           after         ratio
     [555d7ac1]       [d0c94a51]
     <am-concat~1>       <am-concat>
+        40.4±3ms       93.3±0.3ms     2.31  join_merge.ConcatDataFrames.time_c_ordered(0, True)
+        40.7±3ms       93.2±0.4ms     2.29  join_merge.ConcatDataFrames.time_c_ordered(0, False)
+        44.2±5ms       93.4±0.2ms     2.11  join_merge.ConcatDataFrames.time_f_ordered(0, True)
+        44.6±4ms       93.0±0.2ms     2.09  join_merge.ConcatDataFrames.time_f_ordered(0, False)
+       304±0.8ms        309±0.7ms     1.02  join_merge.MergeCategoricals.time_merge_cat
-         191±2ms        187±0.9ms     0.98  join_merge.MergeAsof.time_by_int('forward', 5)
-       288±0.4ms        282±0.7ms     0.98  join_merge.MergeAsof.time_by_object('nearest', None)
-       289±0.2ms        282±0.1ms     0.98  join_merge.MergeAsof.time_by_object('nearest', 5)
-     18.5±0.06ms      18.0±0.04ms     0.97  join_merge.Merge.time_merge_dataframe_integer_2key(True)
-       206±0.4ms        199±0.4ms     0.97  join_merge.MergeAsof.time_by_object('forward', None)
-       206±0.5ms       199±0.04ms     0.96  join_merge.MergeAsof.time_by_object('forward', 5)
-     6.98±0.09ms      6.60±0.02ms     0.95  join_merge.Merge.time_merge_dataframe_integer_2key(False)
-        90.8±1ms       85.4±0.1ms     0.94  join_merge.MergeAsof.time_by_int('backward', 5)
-         312±3μs          293±1μs     0.94  join_merge.Concat.time_concat_empty_left(0)
-         312±3μs          292±2μs     0.94  join_merge.Concat.time_concat_empty_right(0)
-       102±0.2ms       95.0±0.3ms     0.93  join_merge.MergeAsof.time_by_object('backward', None)
-       102±0.1ms       95.2±0.3ms     0.93  join_merge.MergeAsof.time_by_object('backward', 5)
-      20.2±0.7ms      18.4±0.07ms     0.91  join_merge.Merge.time_merge_2intkey(False)
-         452±3μs        397±0.9μs     0.88  join_merge.Concat.time_concat_mixed_ndims(0)
-      28.2±0.4ms       24.6±0.1ms     0.87  join_merge.Join.time_join_dataframe_index_multi(True)
-         864±3ms          745±5ms     0.86  join_merge.I8Merge.time_i8merge('right')
-         814±3ms         695±10ms     0.85  join_merge.I8Merge.time_i8merge('left')
-      17.9±0.4ms       15.1±0.4ms     0.84  join_merge.Join.time_join_dataframe_index_shuffle_key_bigger_sort(True)
-         193±1μs        161±0.7μs     0.83  join_merge.Concat.time_concat_empty_right(1)
-         194±2μs        161±0.8μs     0.83  join_merge.Concat.time_concat_empty_left(1)
-         796±7μs          662±1μs     0.83  join_merge.Concat.time_concat_mixed_ndims(1)
-        807±10ms          670±5ms     0.83  join_merge.I8Merge.time_i8merge('outer')
-     3.21±0.01ms         2.65±0ms     0.83  join_merge.Merge.time_merge_dataframe_integer_key(True)
-       106±0.3ms       87.5±0.5ms     0.83  join_merge.MergeOrdered.time_merge_ordered
-        810±10ms          667±4ms     0.82  join_merge.I8Merge.time_i8merge('inner')
-      1.54±0.08s          1.26±0s     0.82  join_merge.JoinIndex.time_left_outer_join_index
-      15.1±0.2ms       12.2±0.3ms     0.80  join_merge.Join.time_join_dataframe_index_single_key_small(True)
-        2.74±0ms         2.17±0ms     0.79  join_merge.Merge.time_merge_dataframe_integer_key(False)
-      23.1±0.2ms      17.8±0.08ms     0.77  join_merge.Join.time_join_dataframe_index_multi(False)
-     23.5±0.04ms      18.0±0.03ms     0.77  join_merge.MergeAsof.time_on_uint64('nearest', 5)
-     22.8±0.05ms      17.3±0.07ms     0.76  join_merge.MergeAsof.time_on_uint64('nearest', None)
-     23.2±0.04ms      17.2±0.06ms     0.74  join_merge.MergeAsof.time_on_int('nearest', 5)
-         419±6μs          310±1μs     0.74  join_merge.Append.time_append_homogenous
-     23.0±0.09ms      16.9±0.03ms     0.73  join_merge.MergeAsof.time_on_int32('nearest', 5)
-     22.5±0.04ms      16.5±0.06ms     0.73  join_merge.MergeAsof.time_on_int('nearest', None)
-     22.5±0.07ms      16.4±0.03ms     0.73  join_merge.MergeAsof.time_on_int32('nearest', None)
-     18.8±0.03ms      13.4±0.05ms     0.71  join_merge.MergeAsof.time_on_uint64('forward', 5)
-      15.2±0.1ms      10.8±0.04ms     0.71  join_merge.Join.time_join_dataframe_index_shuffle_key_bigger_sort(False)
-      15.6±0.3ms       11.0±0.1ms     0.71  join_merge.Join.time_join_dataframe_index_single_key_bigger(False)
-     18.5±0.05ms      13.1±0.08ms     0.71  join_merge.MergeAsof.time_on_uint64('forward', None)
-     6.61±0.01ms      4.63±0.02ms     0.70  join_merge.Join.time_join_dataframes_cross(True)
-     17.4±0.04ms      12.0±0.04ms     0.69  join_merge.MergeAsof.time_on_uint64('backward', 5)
-     6.29±0.02ms      4.28±0.01ms     0.68  join_merge.Join.time_join_dataframes_cross(False)
-     17.0±0.07ms      11.5±0.05ms     0.68  join_merge.MergeAsof.time_on_uint64('backward', None)
-     18.7±0.06ms      12.6±0.03ms     0.68  join_merge.MergeAsof.time_on_int('forward', 5)
-      14.5±0.2ms      9.79±0.01ms     0.68  join_merge.Join.time_join_dataframe_index_single_key_small(False)
-      18.4±0.1ms      12.3±0.04ms     0.67  join_merge.MergeAsof.time_on_int32('forward', 5)
-     18.2±0.05ms      12.2±0.04ms     0.67  join_merge.MergeAsof.time_on_int('forward', None)
-     18.1±0.03ms      12.0±0.02ms     0.66  join_merge.MergeAsof.time_on_int32('forward', None)
-     17.6±0.05ms      11.6±0.04ms     0.66  join_merge.MergeAsof.time_on_int32('backward', 5)
-     17.4±0.04ms      11.3±0.03ms     0.65  join_merge.MergeAsof.time_on_int32('backward', None)
-     17.1±0.02ms      11.0±0.06ms     0.64  join_merge.MergeAsof.time_on_int('backward', 5)
-     16.8±0.07ms      10.8±0.04ms     0.64  join_merge.MergeAsof.time_on_int('backward', None)
-       782±0.5ms          491±2ms     0.63  join_merge.MergeCategoricals.time_merge_object
-         977±2μs          596±2μs     0.61  join_merge.Append.time_append_mixed
-     20.9±0.07ms      12.5±0.03ms     0.60  join_merge.Concat.time_concat_small_frames(0)
-     14.5±0.06ms      8.33±0.03ms     0.57  join_merge.Concat.time_concat_small_frames(1)
-         931±3ms          330±3ms     0.35  join_merge.Merge.time_merge_dataframes_cross(True)
-         931±7ms          330±2ms     0.35  join_merge.Merge.time_merge_dataframes_cross(False)
-        58.4±1ms          318±1μs     0.01  join_merge.ConcatDataFrames.time_f_ordered(1, False)
-        55.9±4ms          264±2μs     0.00  join_merge.ConcatDataFrames.time_f_ordered(1, True)
-       91.2±50ms          317±1μs     0.00  join_merge.ConcatDataFrames.time_c_ordered(1, False)
-       90.3±30ms          265±2μs     0.00  join_merge.ConcatDataFrames.time_c_ordered(1, True)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

The results are quite stable on my benchmarking setup (ran the above twice and gave almost same results, also within a run between different parametrizations is quite consistent), so I think they should give a good picture. Some general observations:

  • All benchmarks are on par or faster using ArrayManager, except for the concat( , axis=0) cases.
  • Many of the join/merge/merge_asof cases show some speedup, and I expect that PERF: optimize algos.take for repeated calls #39692 should give a further improvement
  • Some of the concat(.., axis=1) benchmarks give a huge difference (factor "0.00" for after), but this is largely explained by copy (the BlockManager based method copies in this case of no reindexing, while the current ArrayManager version doesn't do that (yet)). If I add a copy, it becomes a more modest speedup of 2-3x (instead of >100x).
  • The concat(.., axis=0) case that shows a 2x slowdown with ArrayManager is the expected case, I think. That specific benchmark is using a DataFrame with a single float dtype (single block), so concatenating 20 2D arrays is always going to be faster as concatenating 200 times 20 1D arrays (the benchmark case has 200 columns, and does pd.concat([df] * 20, axis=0)). From profiling the operation, almost all time is spent in the actual numpy concatenation routine (for both cases), so I don't expect there is much room for improvement here.

Generally I am quite positive about the results.

@jorisvandenbossche
Copy link
Member Author

More comments on this?

@jorisvandenbossche
Copy link
Member Author

If there are no more comments, I am planning to merge this.

@jorisvandenbossche jorisvandenbossche mentioned this pull request Mar 24, 2021
4 tasks
mgrs.append(mgr)

if concat_axis == 1:
# concatting along the rows -> concat the reindexed arrays
# TODO(ArrayManager) doesn't yet preserve the correct dtype
arrays = [
concat_compat([mgrs[i].arrays[j] for i in range(len(mgrs))])
concat_arrays([mgrs[i].arrays[j] for i in range(len(mgrs))])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you now remove concat_compat?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and reading your comment below, why is this not in array manger if its only used there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you now remove concat_compat?

concat_compat is used in several other places as well

why is this not in array manger if its only used there?

Because this is the code for concatting managers, which for BlockManager also resides in internals/concat.py?

@jbrockmendel
Copy link
Member

If there are no more comments, I am planning to merge this.

Again, please do not. I've given up asking on things that are Sufficiently Benign, but this is not in that category.

@jorisvandenbossche
Copy link
Member Author

Again, please do not. I've given up asking on things that are Sufficiently Benign, but this is not in that category.

Brock, I didn't merge this without asking. I know this is a larger PR that I can't just merge, but so that's exactly the reason I asked it above, so you could reply to it before I would actually merge (which you did now ;)).
(maybe my formulation was not that good, I just wanted to trigger a "review again or indicate it's good")

But to the point: can you then indicate that you are still planning to review in more detail or still have concerns? Because now your last comments were all minor things, which gives me the impression that you no longer have those.

@jbrockmendel
Copy link
Member

Looking at JoinUnit, it does have self.block, but it wouldn't be too hard to have it take block.values instead, at which point it would be BlockManager/ArrayManager-agnostic. _get_mgr_concatenation_plan accesses mgr.blklocs and mgr.blknos twice each, but i think the ArrayManager analogues of those are np.ones(len(self.arrays)) and np.arange(len(self.arrays))

So I think we can adapt existing code (and as a result, behavior) to be shared rather than add a bunch of new code

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Mar 26, 2021

_get_mgr_concatenation_plan accesses mgr.blklocs and mgr.blknos twice each

It indeed accesses those twice, but then the rest of the function further processes those .. AFAIK this whole function is fully BlockManager-specific (also the return value includes BlockPlacement, so is BlockManager-specific)

Looking at JoinUnit, it does have self.block, but it wouldn't be too hard to have it take block.values instead, at which point it would be BlockManager/ArrayManager-agnostic.

It's probably indeed possible to make JoinUnit directly hold the values instead of the Block. But what's your concrete idea here? That it would be used instead of the NullProxy?

Personally, I think the implementation I made right now for ArrayManager is much cleaner and simpler to follow. And my impression is that @jbrockmendel agrees with that ("If NullArrayProxy can be used to as an alternative to JoinUnit/ConcatenationPlan that'd be great").
If we don't want to have the temporary duplication between ArrayManager and BlockManager, then there are two options:

  • I ditch my new implementation and rewrite it using bits and pieces of the current BlockManager implementation. Personally, I would not like to do that. The new version is simpler (also because the required logic for ArrayManager is inherently less complex as it doesn't need to deal with possibly unaligned blocks that get concatenated), and adding logic for the ArrayManager to the current implementation will only make the already complex piece of code more complex (and at the point when we move to ArrayManager as default, we would want this simpler version anyway).

  • We refactor the full of concatenation_plan / JoinUnit based on my NullProxy idea. In principle, I think this would be a good idea, as I think it could mean an improvement of the current implementation.
    However, this is a quite big refactor, and I would argue that this is 1) not a priority short-term for getting the ArrayManager feature-complete (and be able to do a full evaluation / comparison with the BlockManager) and 2) given that (assuming we decide to go with ArrayManager) the BlockManager is not intended to be actively developed for a long time (eg only 1-2 years), a big refactor might not be worth it, as any refactor always has the risk to introduce regressions (for long-lived code that is typically worth it, though). Even if we want to do this, I think this can be a follow-up to this PR.

So as a summary, IMO it's worth it to have this temporary duplicated implementation for ArrayManager (just as other parts of ArrayManager also duplicate parts of the BlockManager), as it allows to keep the code for ArrayManager simpler (which is the code we intend to keep long-term).

else:
target_dtype = to_concat_no_proxy[0].dtype
elif not single_dtype:
if any(kind in ["m", "M"] for kind in kinds):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why cant we use find_common_type here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. So, can we think of any combination of dtypes where np.find_common_type(..) (which would get used by our find_common_type in this case) would not result in object dtype if the input contains datetime or timedelta?

Most obvious case I could think of is with integers, and numpy also gives object dtype there:

In [20]: np.find_common_type([np.dtype("datetime64[ns]"), np.dtype("int64")], [])
Out[20]: dtype('O')

The reason I handled it separately here is because in dtypes/concat.py we also handle datetime-likes separately, and in _concat_datetime we also "hardcode" to use object dtype in case of multiple input dtypes.
But if find_common_dtype always gives object dtype for mixed datetimelike input, this can indeed be simplified here. Will try that and see what the tests say.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel so I was indeed able to simplify this, also the if any_ea check was not needed, since find_common_type handles all cases

@jreback
Copy link
Contributor

jreback commented Mar 29, 2021

I am ok with temporary duplication of things. This is a pretty common pattern, e.g. first to make things work, then moving towards removing the duplication in steps.

@jbrockmendel
Copy link
Member

Agreed. I'd like to see the find_common_type thing resolved first, but wont hold this up based on that

@jorisvandenbossche
Copy link
Member Author

Updated this PR, and resolved the find_common_type comment. The py37_32bit failing test seems a flaky hypothesis one

@jorisvandenbossche
Copy link
Member Author

Any more comments?

@jbrockmendel
Copy link
Member

Any more comments?

https://github.com/pandas-dev/pandas/pull/39612/files#r602641371, but doesnt need to be addressed here. LGTM

@jreback
Copy link
Contributor

jreback commented Apr 2, 2021

lgtm

@jbrockmendel
Copy link
Member

LGTM, might do another rebase to be on the safe side

@jorisvandenbossche
Copy link
Member Author

https://github.com/pandas-dev/pandas/pull/39612/files#r602641371, but doesnt need to be addressed here. LGTM

Sorry, I missed that one earlier (GitHub's UI is not fantastic when replying to an older comment, as it doesn't show up in the recent comments in the timeline ..). I opened #40893 to track this inconsistency.

@jorisvandenbossche jorisvandenbossche merged commit f0c4093 into pandas-dev:master Apr 12, 2021
@jorisvandenbossche jorisvandenbossche deleted the am-concat branch April 12, 2021 10:56
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internals Related to non-user accessible pandas implementation Refactor Internal refactoring of code Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants