Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add ExtensionArray.to_numpy to have control over conversion to numpy array #30322

Merged
merged 18 commits into from
Jan 7, 2020
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 24 additions & 11 deletions pandas/core/arrays/boolean.py
Original file line number Diff line number Diff line change
Expand Up @@ -314,29 +314,42 @@ def __getitem__(self, item):
return self._data[item]
return type(self)(self._data[item], self._mask[item])

def _coerce_to_ndarray(self, dtype=None, na_value: "Scalar" = libmissing.NA):
def to_numpy(self, dtype=None, copy=False, na_value: "Scalar" = libmissing.NA):
"""
Coerce to an ndarray of object dtype or bool dtype (if force_bool=True).
Convert to a numpy array.

By default converts to a numpy object array. Specify the `dtype` and
`na_value` keywords to customize the conversion.

Parameters
----------
dtype : dtype, default object
The numpy dtype to convert to
The numpy dtype to convert to.
copy : bool, default False
Whether to ensure that the returned value is a not a view on
the array. Note that ``copy=False`` does not *ensure* that
``to_numpy()`` is no-copy. Rather, ``copy=True`` ensure that
a copy is made, even if not strictly necessary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we give some guidance on when no-copy is possible? Is it only when there are no missing values and we're going to the numpy dtype (bool in this case)?

And thinking forward, can a pyarrow array with no NAs be converted to an ndarray without any copies?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it is only possible with no NAs and bool dtype. I first thought int8 would also be possible, but numpy doesn't seem to do such conversion without copy.

And thinking forward, can a pyarrow array with no NAs be converted to an ndarray without any copies?

For boolean not (since it is bits, not bytes), but in general (eg for IntegerArray without nulls) yes

na_value : scalar, optional
Scalar missing value indicator to use in numpy array. Defaults
to the native missing value indicator of this array (pd.NA).

Returns
-------
np.ndarray
"""
if dtype is None:
dtype = object
if is_bool_dtype(dtype):
if not self.isna().any():
return self._data
else:
if self.isna().any():
if is_bool_dtype(dtype) and na_value is libmissing.NA:
raise ValueError(
"cannot convert to bool numpy array in presence of missing values"
)
data = self._data.astype(dtype)
data[self._mask] = na_value
# don't pass copy to astype -> always need a copy since we are mutating
data = self._data.astype(dtype)
data[self._mask] = na_value
else:
data = self._data.astype(dtype, copy=copy)
return data

__array_priority__ = 1000 # higher than ndarray so ops dispatch to us
Expand All @@ -347,7 +360,7 @@ def __array__(self, dtype=None):
We return an object array here to preserve our scalar values
"""
# by default (no dtype specified), return an object array
return self._coerce_to_ndarray(dtype=dtype)
return self.to_numpy(dtype=dtype)

def __arrow_array__(self, type=None):
"""
Expand Down Expand Up @@ -523,7 +536,7 @@ def astype(self, dtype, copy=True):
if is_float_dtype(dtype):
na_value = np.nan
# coerce
data = self._coerce_to_ndarray(na_value=na_value)
data = self.to_numpy(na_value=na_value)
return astype_nansafe(data, dtype, copy=None)

def value_counts(self, dropna=True):
Expand Down
15 changes: 14 additions & 1 deletion pandas/core/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -780,7 +780,7 @@ def array(self) -> ExtensionArray:

return result

def to_numpy(self, dtype=None, copy=False):
def to_numpy(self, dtype=None, copy=False, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we instead add na_value here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is changing the signature of all EA, but are you adding tests? (e.g. StringArray / IntArray), or as followup?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is changing the signature of all EA, but are you adding tests? (e.g. StringArray / IntArray), or as followup?

Note this is not changing the signature of the EA method (this is the Series/Index method)

I think in general it might be useful to pass through kwargs, in that way ExtensionArray authors can have more specific control over the conversion to numpy (if we do this, we should add a test for it for one of the test EAs, like DecimalArray, with an additional keyword in the to_numpy method).

Specifically for na_value, we could maybe add it to the actual signature. Although it is right now not implemented for most of the dtypes (only for boolean EA dtype).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, actually meant that. yeah I would be explict here I think. the question is can you? e.g. you want the default to be libmissing.NA, but that is not the default for anything but BA now? I think it would be better to be explicit here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what makes it more complicated here. It's really a EA-specific keyword, and the general solution for that is passing through keywords (as done here), I think.
Of course, for our own EAs, we can make exceptions and explicitly add them.

Now, since the other dtypes don't yet implement this, the default of libmissing.NA might not necessarily be a problem (this default is not valid for other dtypes, but it's not used for those dtypes anyway). Now, having this keyword with such default might also be confusing, since the NA won't be used for most keywords.

If we add it here explicitly, I would probably use the lib._no_default trick

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add it here explicitly, I would probably use the lib._no_default trick

I think this would be the better long term soln ,yes?

"""
A NumPy ndarray representing the values in this Series or Index.

Expand All @@ -795,6 +795,11 @@ def to_numpy(self, dtype=None, copy=False):
another array. Note that ``copy=False`` does not *ensure* that
``to_numpy()`` is no-copy. Rather, ``copy=True`` ensure that
a copy is made, even if not strictly necessary.
**kwargs
Additional keywords passed through to the ``to_numpy`` method
of the underlying array (for extension arrays).

.. versionadded:: 1.0.0

Returns
-------
Expand Down Expand Up @@ -864,6 +869,14 @@ def to_numpy(self, dtype=None, copy=False):
array(['1999-12-31T23:00:00.000000000', '2000-01-01T23:00:00...'],
dtype='datetime64[ns]')
"""
if is_extension_array_dtype(self.dtype) and hasattr(self.array, "to_numpy"):
return self.array.to_numpy(dtype, copy=copy, **kwargs)
else:
if kwargs:
msg = "to_numpy() got an unexpected keyword argument '{}'".format(
list(kwargs.keys())[0]
)
raise TypeError(msg)
if is_datetime64tz_dtype(self.dtype) and dtype is None:
# note: this is going to change very soon.
# I have a WIP PR making this unnecessary, but it's
Expand Down
64 changes: 64 additions & 0 deletions pandas/tests/arrays/test_boolean.py
Original file line number Diff line number Diff line change
Expand Up @@ -251,6 +251,70 @@ def test_coerce_to_numpy_array():
np.array(arr, dtype="bool")


@pytest.mark.parametrize("box", [True, False], ids=["series", "array"])
def test_to_numpy(box):
con = pd.Series if box else pd.array
# default (with or without missing values) -> object dtype
arr = con([True, False, True], dtype="boolean")
result = arr.to_numpy()
expected = np.array([True, False, True], dtype="object")
tm.assert_numpy_array_equal(result, expected)

arr = con([True, False, None], dtype="boolean")
result = arr.to_numpy()
expected = np.array([True, False, pd.NA], dtype="object")
tm.assert_numpy_array_equal(result, expected)

# no missing values -> can convert to bool, otherwise raises
arr = con([True, False, True], dtype="boolean")
result = arr.to_numpy(dtype="bool")
expected = np.array([True, False, True], dtype="bool")
tm.assert_numpy_array_equal(result, expected)

arr = con([True, False, None], dtype="boolean")
with pytest.raises(ValueError, match="cannot convert to bool numpy"):
result = arr.to_numpy(dtype="bool")

# specify dtype and na_value
arr = con([True, False, None], dtype="boolean")
result = arr.to_numpy(dtype=object, na_value=None)
expected = np.array([True, False, None], dtype="object")
tm.assert_numpy_array_equal(result, expected)

result = arr.to_numpy(dtype=bool, na_value=False)
expected = np.array([True, False, False], dtype="bool")
tm.assert_numpy_array_equal(result, expected)

result = arr.to_numpy(dtype="int64", na_value=-99)
expected = np.array([1, 0, -99], dtype="int64")
tm.assert_numpy_array_equal(result, expected)

result = arr.to_numpy(dtype="float64", na_value=np.nan)
expected = np.array([1, 0, np.nan], dtype="float64")
tm.assert_numpy_array_equal(result, expected)

# converting to int or float without specifying na_value raises
with pytest.raises(TypeError):
arr.to_numpy(dtype="int64")
with pytest.raises(TypeError):
arr.to_numpy(dtype="float64")


def test_to_numpy_copy():
# to_numpy can be zero-copy if no missing values
arr = pd.array([True, False, True], dtype="boolean")
result = arr.to_numpy(dtype=bool)
result[0] = False
tm.assert_extension_array_equal(
arr, pd.array([False, False, True], dtype="boolean")
)

arr = pd.array([True, False, True], dtype="boolean")
result = arr.to_numpy(dtype=bool, copy=True)
result[0] = False
tm.assert_extension_array_equal(arr, pd.array([True, False, True], dtype="boolean"))


def test_astype():
# with missing values
arr = pd.array([True, False, None], dtype="boolean")
Expand Down