Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ArrayManager] REF: Implement concat with reindexing #39612

Merged
merged 34 commits into from
Apr 12, 2021
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
6cd1c4d
[ArrayManager] Implement concat with reindexing
jorisvandenbossche Feb 5, 2021
73d9de2
fix mypy
jorisvandenbossche Feb 5, 2021
272d674
pass through allow dups
jorisvandenbossche Feb 5, 2021
555d7ac
simplify _array_from_proxy
jorisvandenbossche Feb 5, 2021
ebad8a4
Merge remote-tracking branch 'upstream/master' into am-concat
jorisvandenbossche Feb 8, 2021
ee495e5
Merge remote-tracking branch 'upstream/master' into am-concat
jorisvandenbossche Feb 9, 2021
19c7f75
fix creation empty + turn into method
jorisvandenbossche Feb 9, 2021
42e1b05
remove overriding of fill_value
jorisvandenbossche Feb 10, 2021
724be3e
Merge remote-tracking branch 'upstream/master' into am-concat
jorisvandenbossche Feb 10, 2021
db3f0ed
Merge remote-tracking branch 'upstream/master' into am-concat
jorisvandenbossche Feb 12, 2021
a2aa388
use ensure_dtype_can_hold_na
jorisvandenbossche Feb 12, 2021
6bdd175
add type annotation
jorisvandenbossche Feb 12, 2021
910e1fe
Merge remote-tracking branch 'upstream/master' into am-concat
jorisvandenbossche Feb 15, 2021
cab90f6
address review
jorisvandenbossche Feb 15, 2021
c22a010
update comment
jorisvandenbossche Feb 15, 2021
eec0161
fixup test
jorisvandenbossche Feb 15, 2021
6c69869
Merge remote-tracking branch 'upstream/master' into am-concat
jorisvandenbossche Mar 18, 2021
04ead63
update/remove skips
jorisvandenbossche Mar 18, 2021
427b6f4
move logic into internals
jorisvandenbossche Mar 18, 2021
8c10a53
fix typing
jorisvandenbossche Mar 18, 2021
ec5bd11
Merge remote-tracking branch 'upstream/master' into am-concat
jorisvandenbossche Mar 23, 2021
f0061f7
update type check
jorisvandenbossche Mar 23, 2021
9ba8854
simply casting to_concat + fix skips
jorisvandenbossche Mar 23, 2021
ad61f2f
further simplify concat_arrays
jorisvandenbossche Mar 23, 2021
d960619
Merge remote-tracking branch 'upstream/master' into am-concat
jorisvandenbossche Mar 24, 2021
a3c2662
remove redundant cast
jorisvandenbossche Mar 24, 2021
0fafb1a
Merge remote-tracking branch 'upstream/master' into am-concat
jorisvandenbossche Mar 31, 2021
f67e9e2
simplify usage of find_common_type
jorisvandenbossche Mar 31, 2021
f655e33
Merge remote-tracking branch 'upstream/master' into am-concat
jorisvandenbossche Mar 31, 2021
d21bd3a
update annotation
jorisvandenbossche Mar 31, 2021
9435c39
Merge remote-tracking branch 'upstream/master' into am-concat
jorisvandenbossche Apr 2, 2021
22ea7d2
Merge remote-tracking branch 'upstream/master' into am-concat
jorisvandenbossche Apr 7, 2021
77b05f4
fixup typing
jorisvandenbossche Apr 7, 2021
81d0954
Merge remote-tracking branch 'upstream/master' into am-concat
jorisvandenbossche Apr 12, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -157,3 +157,4 @@ jobs:
run: |
source activate pandas-dev
pytest pandas/tests/frame/methods --array-manager
pytest pandas/tests/reshape --array-manager
149 changes: 147 additions & 2 deletions pandas/core/dtypes/concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,27 +5,90 @@

import numpy as np

from pandas._libs import lib
from pandas._typing import ArrayLike, DtypeObj

from pandas.core.dtypes.cast import find_common_type
from pandas.core.dtypes.common import (
is_bool_dtype,
is_categorical_dtype,
is_dtype_equal,
is_extension_array_dtype,
is_integer_dtype,
is_sparse,
)
from pandas.core.dtypes.generic import ABCCategoricalIndex, ABCSeries
from pandas.core.dtypes.missing import na_value_for_dtype

from pandas.core.arrays import ExtensionArray
from pandas.core.arrays.sparse import SparseArray
from pandas.core.construction import array, ensure_wrapped_if_datetimelike


class NullArrayProxy:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another option if we dont want a whole new thing would be to use np.empty(shape, dtype="V") (which has obj.nbytes = 0)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's indeed an option as well. It wouldn't really simplify code as we still need all of what is now in to_array and all the checking for this object/dtype in concat_arrays, but would indeed avoid the custom class.
(Personally, I find the custom class a bit more explicit as using a dtype we otherwise don't use, so would choose that, but I am fine either way)

"""
Proxy object for an all-NA array.

Only stores the length of the array, and not the dtype. The dtype
will only be known when actually concatenating (after determining the
common dtype, for which this proxy is ignored).
Using this object avoids that the internals/concat.py needs to determine
the proper dtype and array type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Using this object avoids that the internals/concat.py needs to determine
the proper dtype and array type.
Using this object simplifies the problem (in internals/concat.py) of determining the result dtype of a concatenation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is strictly speaking not fully correct, as internals/concat.py doesn't determine any result dtype itself. It just passes the arrays to dtypes/concat.py::concat_arrays, which determines the dtype.

Was there something unclear in the original sentence that I can try to improve otherwise?

"""

ndim = 1

def __init__(self, n: int):
self.n = n

@property
def shape(self):
return (self.n,)

def to_array(self, dtype: DtypeObj, fill_value=lib.no_default) -> ArrayLike:
"""
Helper function to create the actual all-NA array from the NullArrayProxy
object.

Parameters
----------
arr : NullArrayProxy
dtype : the dtype for the resulting array
fill_value : scalar NA-like value
By default uses the ExtensionDtype's na_value or np.nan. For numpy
arrays, this can be overridden to be something else (eg None).

Returns
-------
np.ndarray or ExtensionArray
"""
if is_extension_array_dtype(dtype):
empty = dtype.construct_array_type()._from_sequence([], dtype=dtype)
indexer = -np.ones(self.n, dtype=np.intp)
return empty.take(indexer, allow_fill=True)
else:
# when introducing missing values, int becomes float, bool becomes object
if is_integer_dtype(dtype):
dtype = np.dtype("float64")
elif is_bool_dtype(dtype):
dtype = np.dtype(object)
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved

if fill_value is lib.no_default:
fill_value = na_value_for_dtype(dtype)

arr = np.empty(self.n, dtype=dtype)
arr.fill(fill_value)
return ensure_wrapped_if_datetimelike(arr)


jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
def _cast_to_common_type(arr: ArrayLike, dtype: DtypeObj) -> ArrayLike:
"""
Helper function for `arr.astype(common_dtype)` but handling all special
cases.
"""
if isinstance(arr, NullArrayProxy):
jreback marked this conversation as resolved.
Show resolved Hide resolved
return arr.to_array(dtype)

if (
is_categorical_dtype(arr.dtype)
and isinstance(dtype, np.dtype)
Expand Down Expand Up @@ -132,6 +195,73 @@ def is_nonempty(x) -> bool:
return np.concatenate(to_concat, axis=axis)


def concat_arrays(to_concat) -> ArrayLike:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type this List[Any]?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that should be fine.

"""
Alternative for concat_compat but specialized for use in the ArrayManager.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we have different impl for array vs block manager then let's push this code down into those areas (so its localized). ok for a followup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem is the logic is too scattered and very hard to grok where things are happening.


Differences: only deals with 1D arrays (no axis keyword) and does not skip
empty arrays to determine the dtype.
In addition ensures that all NullArrayProxies get replaced with actual
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved
arrays.

Parameters
----------
to_concat : list of arrays

Returns
-------
np.ndarray or ExtensionArray
"""
# ignore the all-NA proxies to determine the resulting dtype
to_concat_no_proxy = [x for x in to_concat if not isinstance(x, NullArrayProxy)]

kinds = {obj.dtype.kind for obj in to_concat_no_proxy}
single_dtype = len({x.dtype for x in to_concat_no_proxy}) == 1
any_ea = any(is_extension_array_dtype(x.dtype) for x in to_concat_no_proxy)

if any_ea:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks really similar to the non-AM code. it cant be shared?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my original comment about it at #39612 (comment)

Yes, it's really similar. But it's also slightly different in many places.

Now, when I tried this originally, I only checked for the null proxy object when needed (giving lots of if/elses, making it hard to read). But of course I could also simply always check for it. And for the non-ArrayManager cases this proxy object will never be present, so apart from doing a bunch of unnecessary checks, it shouldn't matter.

If you are OK with complicating the base concat_compat with a bunch of handling for the proxy objects (even when not needed), I am also certainly fine with combining both and not writing the separate concat_arrays

if not single_dtype:
target_dtype = find_common_type([x.dtype for x in to_concat_no_proxy])
to_concat = [_cast_to_common_type(arr, target_dtype) for arr in to_concat]
else:
target_dtype = to_concat_no_proxy[0].dtype
to_concat = [
arr.to_array(target_dtype) if isinstance(arr, NullArrayProxy) else arr
for arr in to_concat
]

if isinstance(to_concat[0], ExtensionArray):
cls = type(to_concat[0])
return cls._concat_same_type(to_concat)
else:
return np.concatenate(to_concat)

elif any(kind in ["m", "M"] for kind in kinds):
return _concat_datetime(to_concat)

if not single_dtype:
target_dtype = np.find_common_type(
[arr.dtype for arr in to_concat_no_proxy], []
)
else:
target_dtype = to_concat_no_proxy[0].dtype
to_concat = [
arr.to_array(target_dtype) if isinstance(arr, NullArrayProxy) else arr
for arr in to_concat
]

result = np.concatenate(to_concat)

# TODO(ArrayManager) this is currently inconsistent between Series and DataFrame
# so we should decide whether to keep the below special case or remove it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this behavior consistent between AM and BM?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, not sure anymore where I saw this, as for both BlockManager/ArrayManager this gives object dtype with both Series or DataFrame (at least with latest version of this PR). So will remove the comment ..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, found my notes about this: this is not directly related to Block vs ArrayManager, but something I noticed here that is inconsistent between Series vs DataFrame and empty vs non-empty.

For Series, with non-empty, we actually coerce to numeric, while if we have empty bool + float, we get object dtype:

In [7]: pd.concat([pd.Series([True], dtype=bool), pd.Series([1.0], dtype=float)])
Out[7]: 
0    1.0
0    1.0
dtype: float64

In [8]: pd.concat([pd.Series([], dtype=bool), pd.Series([], dtype=float)])
Out[8]: Series([], dtype: object)

While DataFrame uses object dtype in both cases.

(will open a separate issue about it)

But so the reason this came up specifically for ArrayManager, is that we now follow the same rules for Series/DataFrame, and thus also get numeric for non-empty for the DataFrame case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #39817 about this

if len(result) == 0:
# all empties -> check for bool to not coerce to float
if len(kinds) != 1:
if "b" in kinds:
result = result.astype(object)
return result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comments above, the non-internals code should ideally not have these hacks here. This makes it really hard to follow anything. Can you do something about this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall I simply cut and paste this function to internals/concat.py, and then all is "solved"? (then the null array proxy is not used outside of internals)

(note that my question is slightly ironic, because it doesn't change anything fundamentally, just about semantics what is considered "internal". But I don't really understand what the problem here is).



def union_categoricals(
to_union, sort_categories: bool = False, ignore_order: bool = False
):
Expand Down Expand Up @@ -322,20 +452,35 @@ def _concat_datetime(to_concat, axis=0):
a single array, preserving the combined dtypes
"""
to_concat = [ensure_wrapped_if_datetimelike(x) for x in to_concat]
to_concat_no_proxy = [x for x in to_concat if not isinstance(x, NullArrayProxy)]
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved

single_dtype = len({x.dtype for x in to_concat}) == 1
single_dtype = len({x.dtype for x in to_concat_no_proxy}) == 1

# multiple types, need to coerce to object
if not single_dtype:
# ensure_wrapped_if_datetimelike ensures that astype(object) wraps
# in Timestamp/Timedelta
to_concat = [
arr.to_array(object, fill_value=None)
if isinstance(arr, NullArrayProxy)
else arr
for arr in to_concat
]

return _concatenate_2d([x.astype(object) for x in to_concat], axis=axis)

if axis == 1:
# TODO(EA2D): kludge not necessary with 2D EAs
to_concat = [x.reshape(1, -1) if x.ndim == 1 else x for x in to_concat]
else:
to_concat = [
arr.to_array(to_concat_no_proxy[0].dtype)
if isinstance(arr, NullArrayProxy)
else arr
for arr in to_concat
]

result = type(to_concat[0])._concat_same_type(to_concat, axis=axis)
result = type(to_concat_no_proxy[0])._concat_same_type(to_concat, axis=axis)

if result.ndim == 2 and is_extension_array_dtype(result.dtype):
# TODO(EA2D): kludge not necessary with 2D EAs
Expand Down
26 changes: 21 additions & 5 deletions pandas/core/internals/array_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
is_extension_array_dtype,
is_numeric_dtype,
)
from pandas.core.dtypes.concat import NullArrayProxy
from pandas.core.dtypes.dtypes import ExtensionDtype, PandasDtype
from pandas.core.dtypes.generic import ABCDataFrame, ABCSeries
from pandas.core.dtypes.missing import isna
Expand Down Expand Up @@ -725,10 +726,20 @@ def reindex_indexer(
# ignored keywords
consolidate: bool = True,
only_slice: bool = False,
# ArrayManager specific keywords
do_integrity_check=True,
use_na_proxy=False,
) -> T:
axis = self._normalize_axis(axis)
return self._reindex_indexer(
new_axis, indexer, axis, fill_value, allow_dups, copy
new_axis,
indexer,
axis,
fill_value,
allow_dups,
copy,
do_integrity_check,
use_na_proxy,
)

def _reindex_indexer(
Expand All @@ -739,6 +750,8 @@ def _reindex_indexer(
fill_value=None,
allow_dups: bool = False,
copy: bool = True,
do_integrity_check=True,
use_na_proxy=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

annotate

) -> T:
"""
Parameters
Expand Down Expand Up @@ -773,7 +786,9 @@ def _reindex_indexer(
new_arrays = []
for i in indexer:
if i == -1:
arr = self._make_na_array(fill_value=fill_value)
arr = self._make_na_array(
fill_value=fill_value, use_na_proxy=use_na_proxy
)
else:
arr = self.arrays[i]
new_arrays.append(arr)
Expand All @@ -793,7 +808,7 @@ def _reindex_indexer(
new_axes = list(self._axes)
new_axes[axis] = new_axis

return type(self)(new_arrays, new_axes)
return type(self)(new_arrays, new_axes, do_integrity_check=do_integrity_check)

def take(self, indexer, axis: int = 1, verify: bool = True, convert: bool = True):
"""
Expand All @@ -820,10 +835,11 @@ def take(self, indexer, axis: int = 1, verify: bool = True, convert: bool = True
new_axis=new_labels, indexer=indexer, axis=axis, allow_dups=True
)

def _make_na_array(self, fill_value=None):
def _make_na_array(self, fill_value=None, use_na_proxy=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need ArrayLike to include NullArrayProxy?

if use_na_proxy:
return NullArrayProxy(self.shape_proper[0])
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved
if fill_value is None:
fill_value = np.nan

dtype, fill_value = infer_dtype_from_scalar(fill_value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the rest of this method ndarray-only?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

values = np.empty(self.shape_proper[0], dtype=dtype)
values.fill(fill_value)
Expand Down
61 changes: 47 additions & 14 deletions pandas/core/internals/concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
is_sparse,
is_timedelta64_dtype,
)
from pandas.core.dtypes.concat import concat_compat
from pandas.core.dtypes.concat import concat_arrays, concat_compat
from pandas.core.dtypes.missing import isna_all

import pandas.core.algorithms as algos
Expand All @@ -37,6 +37,51 @@
from pandas.core.arrays.sparse.dtype import SparseDtype


def concatenate_array_managers(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above

mgrs_indexers, axes: List[Index], concat_axis: int, copy: bool
) -> Manager:
"""
Concatenate array managers into one.

Parameters
----------
mgrs_indexers : list of (ArrayManager, {axis: indexer,...}) tuples
axes : list of Index
concat_axis : int
copy : bool

Returns
-------
ArrayManager
"""
# reindex all arrays
mgrs = []
for mgr, indexers in mgrs_indexers:
for ax, indexer in indexers.items():
mgr = mgr.reindex_indexer(
axes[ax],
indexer,
axis=ax,
allow_dups=True,
do_integrity_check=False,
use_na_proxy=True,
)
mgrs.append(mgr)

# concatting along the rows -> concat the reindexed arrays
if concat_axis == 1:
arrays = [
concat_arrays([mgrs[i].arrays[j] for i in range(len(mgrs))])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you now remove concat_compat?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and reading your comment below, why is this not in array manger if its only used there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you now remove concat_compat?

concat_compat is used in several other places as well

why is this not in array manger if its only used there?

Because this is the code for concatting managers, which for BlockManager also resides in internals/concat.py?

for j in range(len(mgrs[0].arrays))
]
return ArrayManager(arrays, [axes[1], axes[0]], do_integrity_check=False)
# concatting along the columns -> combine reindexed arrays in a single manager
else:
assert concat_axis == 0
arrays = list(itertools.chain.from_iterable([mgr.arrays for mgr in mgrs]))
return ArrayManager(arrays, [axes[1], axes[0]], do_integrity_check=False)


def concatenate_block_managers(
mgrs_indexers, axes: List[Index], concat_axis: int, copy: bool
) -> Manager:
Expand All @@ -55,19 +100,7 @@ def concatenate_block_managers(
BlockManager
"""
if isinstance(mgrs_indexers[0][0], ArrayManager):

if concat_axis == 1:
# TODO for now only fastpath without indexers
mgrs = [t[0] for t in mgrs_indexers]
arrays = [
concat_compat([mgrs[i].arrays[j] for i in range(len(mgrs))], axis=0)
for j in range(len(mgrs[0].arrays))
]
return ArrayManager(arrays, [axes[1], axes[0]])
elif concat_axis == 0:
mgrs = [t[0] for t in mgrs_indexers]
arrays = list(itertools.chain.from_iterable([mgr.arrays for mgr in mgrs]))
return ArrayManager(arrays, [axes[1], axes[0]])
return concatenate_array_managers(mgrs_indexers, axes, concat_axis, copy)

concat_plans = [
_get_mgr_concatenation_plan(mgr, indexers) for mgr, indexers in mgrs_indexers
Expand Down
Loading