-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ArrayManager] REF: Implement concat with reindexing #39612
[ArrayManager] REF: Implement concat with reindexing #39612
Conversation
pandas/core/dtypes/concat.py
Outdated
def concat_arrays(to_concat): | ||
""" | ||
Alternative for concat_compat but specialized for use in the ArrayManager. | ||
|
||
Differences: only deals with 1D arrays (no axis keyword) and does not skip | ||
empty arrays to determine the dtype. | ||
In addition ensures that all NullArrayProxies get replaced with actual | ||
arrays. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This concat_arrays
is basically a copied+adapated version of concat_compat
. I initially adapted concat_compat
to cover both cases.
But that resulted in quite some special cases inside concat_compat
(checking the proxy objects, while this is not needed for when concat_compat
is used outside ArrayManager, + not skipping empties for which I added a keyword).
So in the end I decided to split the changes in a separate function, so both concat_compat
and concat_arrays
are more readable / easier to follow, but it certainly gives some duplication.
pandas/core/dtypes/concat.py
Outdated
""" | ||
if is_extension_array_dtype(dtype): | ||
return dtype.construct_array_type()._from_sequence( | ||
[dtype.na_value] * arr.n, dtype=dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should maybe consider adding a method to the ExtensionArray interface to create an empty/all-NA array of the given dtype, to avoid this rather inefficient "construct from list of NA scalars".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC this can break for dtypes that cant hold NA. the more robust version is
cls = dtype.construct_array_type()
empty = cls._from_sequence([], dtype=dtype)
indexer = -np.ones(arr., dtype=np.intp)
return empty.take(indexer)
(in internals concat we use this in one place and when i tried to change it to what you have here something broke, dont remember what off the top of my head)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have an example dtype that cannot hold NAs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like Sparse[int]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed:
In [5]: dtype = pd.SparseDtype("int")
In [6]: dtype
Out[6]: Sparse[int64, 0]
In [7]: dtype.construct_array_type()._from_sequence([dtype.na_value] * 5, dtype=dtype)
...
ValueError: Cannot convert non-finite values (NA or inf) to integer
Will use your suggested code for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #39776 for the idea of adding a method to the EA interface for creating an empty/all-NA array from a given dtype.
pandas/core/dtypes/concat.py
Outdated
""" | ||
if is_extension_array_dtype(dtype): | ||
return dtype.construct_array_type()._from_sequence( | ||
[dtype.na_value] * arr.n, dtype=dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this
pandas/core/dtypes/concat.py
Outdated
@@ -132,6 +204,75 @@ def is_nonempty(x) -> bool: | |||
return np.concatenate(to_concat, axis=axis) | |||
|
|||
|
|||
def concat_arrays(to_concat): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
types if you can
I pushed some updates. Given that the review comments are relatively minor points (but useful! to be clear), does that mean that you are OK with the general approach (of using the temporary proxy class)? This should be ready for another review round. |
I ran the concat/join/merge benchmarks comparing ArrayManager with BlockManager (it's running the benchmarks on this branch with the default of using BlockManager ("before") vs this branch with one addtitional commit changing the default setting to ArrayManager ("after"), so green is slower and red is faster for the ArrayManager): $ asv continuous -f 1.01 -b join_merge HEAD~1 HEAD
before after ratio
[555d7ac1] [d0c94a51]
<am-concat~1> <am-concat>
+ 40.4±3ms 93.3±0.3ms 2.31 join_merge.ConcatDataFrames.time_c_ordered(0, True)
+ 40.7±3ms 93.2±0.4ms 2.29 join_merge.ConcatDataFrames.time_c_ordered(0, False)
+ 44.2±5ms 93.4±0.2ms 2.11 join_merge.ConcatDataFrames.time_f_ordered(0, True)
+ 44.6±4ms 93.0±0.2ms 2.09 join_merge.ConcatDataFrames.time_f_ordered(0, False)
+ 304±0.8ms 309±0.7ms 1.02 join_merge.MergeCategoricals.time_merge_cat
- 191±2ms 187±0.9ms 0.98 join_merge.MergeAsof.time_by_int('forward', 5)
- 288±0.4ms 282±0.7ms 0.98 join_merge.MergeAsof.time_by_object('nearest', None)
- 289±0.2ms 282±0.1ms 0.98 join_merge.MergeAsof.time_by_object('nearest', 5)
- 18.5±0.06ms 18.0±0.04ms 0.97 join_merge.Merge.time_merge_dataframe_integer_2key(True)
- 206±0.4ms 199±0.4ms 0.97 join_merge.MergeAsof.time_by_object('forward', None)
- 206±0.5ms 199±0.04ms 0.96 join_merge.MergeAsof.time_by_object('forward', 5)
- 6.98±0.09ms 6.60±0.02ms 0.95 join_merge.Merge.time_merge_dataframe_integer_2key(False)
- 90.8±1ms 85.4±0.1ms 0.94 join_merge.MergeAsof.time_by_int('backward', 5)
- 312±3μs 293±1μs 0.94 join_merge.Concat.time_concat_empty_left(0)
- 312±3μs 292±2μs 0.94 join_merge.Concat.time_concat_empty_right(0)
- 102±0.2ms 95.0±0.3ms 0.93 join_merge.MergeAsof.time_by_object('backward', None)
- 102±0.1ms 95.2±0.3ms 0.93 join_merge.MergeAsof.time_by_object('backward', 5)
- 20.2±0.7ms 18.4±0.07ms 0.91 join_merge.Merge.time_merge_2intkey(False)
- 452±3μs 397±0.9μs 0.88 join_merge.Concat.time_concat_mixed_ndims(0)
- 28.2±0.4ms 24.6±0.1ms 0.87 join_merge.Join.time_join_dataframe_index_multi(True)
- 864±3ms 745±5ms 0.86 join_merge.I8Merge.time_i8merge('right')
- 814±3ms 695±10ms 0.85 join_merge.I8Merge.time_i8merge('left')
- 17.9±0.4ms 15.1±0.4ms 0.84 join_merge.Join.time_join_dataframe_index_shuffle_key_bigger_sort(True)
- 193±1μs 161±0.7μs 0.83 join_merge.Concat.time_concat_empty_right(1)
- 194±2μs 161±0.8μs 0.83 join_merge.Concat.time_concat_empty_left(1)
- 796±7μs 662±1μs 0.83 join_merge.Concat.time_concat_mixed_ndims(1)
- 807±10ms 670±5ms 0.83 join_merge.I8Merge.time_i8merge('outer')
- 3.21±0.01ms 2.65±0ms 0.83 join_merge.Merge.time_merge_dataframe_integer_key(True)
- 106±0.3ms 87.5±0.5ms 0.83 join_merge.MergeOrdered.time_merge_ordered
- 810±10ms 667±4ms 0.82 join_merge.I8Merge.time_i8merge('inner')
- 1.54±0.08s 1.26±0s 0.82 join_merge.JoinIndex.time_left_outer_join_index
- 15.1±0.2ms 12.2±0.3ms 0.80 join_merge.Join.time_join_dataframe_index_single_key_small(True)
- 2.74±0ms 2.17±0ms 0.79 join_merge.Merge.time_merge_dataframe_integer_key(False)
- 23.1±0.2ms 17.8±0.08ms 0.77 join_merge.Join.time_join_dataframe_index_multi(False)
- 23.5±0.04ms 18.0±0.03ms 0.77 join_merge.MergeAsof.time_on_uint64('nearest', 5)
- 22.8±0.05ms 17.3±0.07ms 0.76 join_merge.MergeAsof.time_on_uint64('nearest', None)
- 23.2±0.04ms 17.2±0.06ms 0.74 join_merge.MergeAsof.time_on_int('nearest', 5)
- 419±6μs 310±1μs 0.74 join_merge.Append.time_append_homogenous
- 23.0±0.09ms 16.9±0.03ms 0.73 join_merge.MergeAsof.time_on_int32('nearest', 5)
- 22.5±0.04ms 16.5±0.06ms 0.73 join_merge.MergeAsof.time_on_int('nearest', None)
- 22.5±0.07ms 16.4±0.03ms 0.73 join_merge.MergeAsof.time_on_int32('nearest', None)
- 18.8±0.03ms 13.4±0.05ms 0.71 join_merge.MergeAsof.time_on_uint64('forward', 5)
- 15.2±0.1ms 10.8±0.04ms 0.71 join_merge.Join.time_join_dataframe_index_shuffle_key_bigger_sort(False)
- 15.6±0.3ms 11.0±0.1ms 0.71 join_merge.Join.time_join_dataframe_index_single_key_bigger(False)
- 18.5±0.05ms 13.1±0.08ms 0.71 join_merge.MergeAsof.time_on_uint64('forward', None)
- 6.61±0.01ms 4.63±0.02ms 0.70 join_merge.Join.time_join_dataframes_cross(True)
- 17.4±0.04ms 12.0±0.04ms 0.69 join_merge.MergeAsof.time_on_uint64('backward', 5)
- 6.29±0.02ms 4.28±0.01ms 0.68 join_merge.Join.time_join_dataframes_cross(False)
- 17.0±0.07ms 11.5±0.05ms 0.68 join_merge.MergeAsof.time_on_uint64('backward', None)
- 18.7±0.06ms 12.6±0.03ms 0.68 join_merge.MergeAsof.time_on_int('forward', 5)
- 14.5±0.2ms 9.79±0.01ms 0.68 join_merge.Join.time_join_dataframe_index_single_key_small(False)
- 18.4±0.1ms 12.3±0.04ms 0.67 join_merge.MergeAsof.time_on_int32('forward', 5)
- 18.2±0.05ms 12.2±0.04ms 0.67 join_merge.MergeAsof.time_on_int('forward', None)
- 18.1±0.03ms 12.0±0.02ms 0.66 join_merge.MergeAsof.time_on_int32('forward', None)
- 17.6±0.05ms 11.6±0.04ms 0.66 join_merge.MergeAsof.time_on_int32('backward', 5)
- 17.4±0.04ms 11.3±0.03ms 0.65 join_merge.MergeAsof.time_on_int32('backward', None)
- 17.1±0.02ms 11.0±0.06ms 0.64 join_merge.MergeAsof.time_on_int('backward', 5)
- 16.8±0.07ms 10.8±0.04ms 0.64 join_merge.MergeAsof.time_on_int('backward', None)
- 782±0.5ms 491±2ms 0.63 join_merge.MergeCategoricals.time_merge_object
- 977±2μs 596±2μs 0.61 join_merge.Append.time_append_mixed
- 20.9±0.07ms 12.5±0.03ms 0.60 join_merge.Concat.time_concat_small_frames(0)
- 14.5±0.06ms 8.33±0.03ms 0.57 join_merge.Concat.time_concat_small_frames(1)
- 931±3ms 330±3ms 0.35 join_merge.Merge.time_merge_dataframes_cross(True)
- 931±7ms 330±2ms 0.35 join_merge.Merge.time_merge_dataframes_cross(False)
- 58.4±1ms 318±1μs 0.01 join_merge.ConcatDataFrames.time_f_ordered(1, False)
- 55.9±4ms 264±2μs 0.00 join_merge.ConcatDataFrames.time_f_ordered(1, True)
- 91.2±50ms 317±1μs 0.00 join_merge.ConcatDataFrames.time_c_ordered(1, False)
- 90.3±30ms 265±2μs 0.00 join_merge.ConcatDataFrames.time_c_ordered(1, True)
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED. The results are quite stable on my benchmarking setup (ran the above twice and gave almost same results, also within a run between different parametrizations is quite consistent), so I think they should give a good picture. Some general observations:
Generally I am quite positive about the results. |
More comments on this? |
If there are no more comments, I am planning to merge this. |
mgrs.append(mgr) | ||
|
||
if concat_axis == 1: | ||
# concatting along the rows -> concat the reindexed arrays | ||
# TODO(ArrayManager) doesn't yet preserve the correct dtype | ||
arrays = [ | ||
concat_compat([mgrs[i].arrays[j] for i in range(len(mgrs))]) | ||
concat_arrays([mgrs[i].arrays[j] for i in range(len(mgrs))]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you now remove concat_compat?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and reading your comment below, why is this not in array manger if its only used there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you now remove concat_compat?
concat_compat
is used in several other places as well
why is this not in array manger if its only used there?
Because this is the code for concatting managers, which for BlockManager also resides in internals/concat.py?
Again, please do not. I've given up asking on things that are Sufficiently Benign, but this is not in that category. |
Brock, I didn't merge this without asking. I know this is a larger PR that I can't just merge, but so that's exactly the reason I asked it above, so you could reply to it before I would actually merge (which you did now ;)). But to the point: can you then indicate that you are still planning to review in more detail or still have concerns? Because now your last comments were all minor things, which gives me the impression that you no longer have those. |
Looking at JoinUnit, it does have So I think we can adapt existing code (and as a result, behavior) to be shared rather than add a bunch of new code |
It indeed accesses those twice, but then the rest of the function further processes those .. AFAIK this whole function is fully BlockManager-specific (also the return value includes BlockPlacement, so is BlockManager-specific)
It's probably indeed possible to make JoinUnit directly hold the values instead of the Block. But what's your concrete idea here? That it would be used instead of the NullProxy? Personally, I think the implementation I made right now for ArrayManager is much cleaner and simpler to follow. And my impression is that @jbrockmendel agrees with that ("If NullArrayProxy can be used to as an alternative to JoinUnit/ConcatenationPlan that'd be great").
So as a summary, IMO it's worth it to have this temporary duplicated implementation for ArrayManager (just as other parts of ArrayManager also duplicate parts of the BlockManager), as it allows to keep the code for ArrayManager simpler (which is the code we intend to keep long-term). |
pandas/core/internals/concat.py
Outdated
else: | ||
target_dtype = to_concat_no_proxy[0].dtype | ||
elif not single_dtype: | ||
if any(kind in ["m", "M"] for kind in kinds): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why cant we use find_common_type here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. So, can we think of any combination of dtypes where np.find_common_type(..)
(which would get used by our find_common_type in this case) would not result in object dtype if the input contains datetime or timedelta?
Most obvious case I could think of is with integers, and numpy also gives object dtype there:
In [20]: np.find_common_type([np.dtype("datetime64[ns]"), np.dtype("int64")], [])
Out[20]: dtype('O')
The reason I handled it separately here is because in dtypes/concat.py
we also handle datetime-likes separately, and in _concat_datetime
we also "hardcode" to use object dtype in case of multiple input dtypes.
But if find_common_dtype
always gives object dtype for mixed datetimelike input, this can indeed be simplified here. Will try that and see what the tests say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel so I was indeed able to simplify this, also the if any_ea
check was not needed, since find_common_type
handles all cases
I am ok with temporary duplication of things. This is a pretty common pattern, e.g. first to make things work, then moving towards removing the duplication in steps. |
Agreed. I'd like to see the find_common_type thing resolved first, but wont hold this up based on that |
Updated this PR, and resolved the find_common_type comment. The py37_32bit failing test seems a flaky hypothesis one |
Any more comments? |
https://github.com/pandas-dev/pandas/pull/39612/files#r602641371, but doesnt need to be addressed here. LGTM |
lgtm |
LGTM, might do another rebase to be on the safe side |
Sorry, I missed that one earlier (GitHub's UI is not fantastic when replying to an older comment, as it doesn't show up in the recent comments in the timeline ..). I opened #40893 to track this inconsistency. |
xref #39146 (the concat work item)
This PR implements
concat
for DataFrames using the ArrayManager, and all tests inpandas/tests/reshape/
already pass.Summary of the changes:
internals/concat.py
, I added aconcatenate_block_managers
equivalent for ArrayManagers. This function rather simply reindexes the managers and then concats the arrays (axis=0) or combines them in a single manager (axis=1).find_common_type
logic.NullArrayProxy
class), only tracking its length. And when actually concatting, we can replace this with an actual all-NA array after we determined the resulting dtype. This proxy object only lives inside the concatting code (it only gets created when we will callconcat_compat
and inside this function they always get replaced), so the user will never see this.internals/concat.py
so we can pass this to thereindex
functionality, and after reindexing do the actual concatenation. But IMO that will be more complex; the concat / find_common_type logic gets involved twice (before and after reindexing), and getting those dtypes for which we want to know the common type is also quite tricky before actually reindexing the manager.Basically, this replaces the 600 lines of code of
internals/concat.py
with 25 lines ofconcatenate_array_managers
+ theNullArrayProxy
and_array_from_proxy
code indtypes/concat.py
(but I am quite likely still missing some corner cases!)Note: I did not explicitly test
copy=True/False
behaviour for concat along the columns with this new code.cc @jreback @jbrockmendel