-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: how to handle NA in conversion to numpy arrays #30038
Comments
+1 for a standard I'm really uncertain about |
Great question @jorisvandenbossche . This all relates to the semantics of I think there are a few cases to consider:
In general, I'm in favor of raising exceptions whenever we try to convert |
Series.values can return an ndarray or an extension array (e.g. Categorical). I don't think we'll want to change the behavior of that anywhere. We have
At the moment, aliases like |
So, IMHO, if someone uses
So if I understand you right, if you have But for |
All the cases we are talking about are new ExtensionArrays, and for those
Yes, that is correct. But, as Tom said above, we don't have a "float newtype", so all discussions about converting to float is at this moment per definition conversion to numpy (until we add a float ExtensionArray).
I think I am also in favor of this. |
Rereading this again, it seems the conclusion is somewhat that we want to raise for this (so no automatic conversion of pd.NA to np.nan). I am wondering how that impacts projects like eg scikit-learn. They will need to use |
Or rather, the user would need to do this before handing the DataFrame off to a scikit-learn estimator? But yes, I think the best thing for now is to not implicitly convert NA to NaN. And I don't think that In [13]: a = pd.array([1, None])
In [14]: a.astype("float")
Out[14]: array([ 1., nan])
In [15]: np.asarray(a, dtype="float")
Out[15]: array([ 1., nan]) |
Quickly check how scikit-learn is handling nullable integers with 0.25.3 and master. With 0.25.3, this actually works by being converted to floats:
This works because the conversion to array gives object dtype array with np.nan, which can then be converted to float:
However, with master, you don't get an object array with np.nan, but with pd.NA, which cannot be converted to float. So you get an error:
But, this failure already happens on master due to the change of np.nan -> pd.NA, so regardless of the discussion here to be more strict or not in the conversion of IntegerArray to float array. |
Thanks for checking. Do you think that If we think |
I am also on the fence, but still slightly leaning towards raising. Yes, for scikit-learn, converting pd.NA to np.nan in Of course, that means that scikit-learn would need to adapt and explicitly implement support for nullable dtypes, instead of relying on the implicit conversion as is done now. |
Comparisons are probably the biggest difference between NA and NaN in an ndarray. Reductions are another one (NA rather than nan for the result, and nansum raise for NA). Given that we're both on slightly different sides of the fence, raising on |
And by raising on In [6]: s = pd.Series(pd.array([1, 2, None]))
In [7]: s.astype("float")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-7-3b1bb0b634ef> in <module>
----> 1 s.astype("float")
~/sandbox/pandas/pandas/core/generic.py in astype(self, dtype, copy, errors)
5663 else:
5664 # else, only a single dtype is given
-> 5665 new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors)
5666 return self._constructor(new_data).__finalize__(self)
5667
~/sandbox/pandas/pandas/core/internals/managers.py in astype(self, dtype, copy, errors)
581
582 def astype(self, dtype, copy: bool = False, errors: str = "raise"):
--> 583 return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
584
585 def convert(self, **kwargs):
~/sandbox/pandas/pandas/core/internals/managers.py in apply(self, f, filter, **kwargs)
441 applied = b.apply(f, **kwargs)
442 else:
--> 443 applied = getattr(b, f)(**kwargs)
444 result_blocks = _extend_blocks(applied, result_blocks)
445
~/sandbox/pandas/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors)
585 if self.is_extension:
586 # TODO: Should we try/except this astype?
--> 587 values = self.values.astype(dtype)
588 else:
589 if issubclass(dtype.type, str):
~/sandbox/pandas/pandas/core/arrays/integer.py in astype(self, dtype, copy)
567
568 # coerce
--> 569 data = self.to_numpy(dtype=dtype)
570 return astype_nansafe(data, dtype, copy=False)
571
~/sandbox/pandas/pandas/core/arrays/integer.py in to_numpy(self, dtype, copy, na_value)
395 if not is_object_dtype(dtype) and na_value is libmissing.NA:
396 raise ValueError(
--> 397 f"cannot convert to '{dtype}'-dtype NumPy array "
398 f"with missing values."
399 )
ValueError: cannot convert to 'float64'-dtype NumPy array with missing values. I think you need something like In [8]: pd.Series(s.to_numpy(dtype="float", na_value=np.nan), index=s.index, name=s.name)
Out[8]:
0 1.0
1 2.0
2 NaN
dtype: float64 |
Can we make an exception for |
Mmm, we can certainly have |
I see now that for BooleanArray.astype, we actually already do it that way (special case float to substitute with NaNs). |
Hi, I was wondering whether it would be possible to consider the special case when an extension type doesn't actually have any missing values. E.g. pd.Series([0,1,2], dtype="Int64").to_numpy() returns
when it may in many cases be more convenient if the numpy dtype was integer rather than object. |
We try to avoid value-dependent behavior where the metadata (shape, dtype, etc.) depend on the values of the array. |
Makes sense too. I was thinking about internal use cases we have, like aggregating a series (possibly extension type) with a function like unique, i.e. creating a list-like result for each group. It can get pretty messy at the moment massaging that into a consistent form (dtype) that can be serialized with Arrow e.g. (sometimes you'll get a numpy array and sometimes an extension array, converting that extension array to numpy depends on its specific dtype etc. But I understand you'll want to leave that to the user... |
For your specific use case, note that extension arrays can be converted to Arrow (eg IntegerArray cleanly converts to an arrow int array), so it might be an option to leave the list-like results from unique as they are (ndarray or pandas extension array) until converting to arrow (of course, this all depends on the specificities of your use case if this would make sense). It's a bit a trade-off here between consistent behaviour vs what is most practical for users. Because I agree with Tom that we want to avoid value-dependent behaviour (it is good to always know that a certain pandas dtype gets converted to a certain numpy dtype regardless of the content of the array), this is also the reason we decided to choose this behaviour in the first place. On the other hand, for all cases where you don't actually have missing values, having it convert to a numpy int dtype instead of object dtype might make interoperability with numpy a lot easier. |
I agree, the current implementation returns predictable dtypes at least,
which makes it feasible on the user-side to map to specific numpy dtypes
where possible. That might get a little messy, but if the dtypes weren't as
consistent as they are, the same messiness would probably just pop-up
elsewhere.
…On Wed, 15 Apr 2020 at 11:43, Joris Van den Bossche < ***@***.***> wrote:
For your specific use case, note that extension arrays can be converted to
Arrow (eg IntegerArray cleanly converts to an arrow int array), so it might
be an option to leave the list-like results from unique as they are
(ndarray or pandas extension array) until converting to arrow (of course,
this all depends on the specificities of your use case if this would make
sense).
It's a bit a trade-off here between consistent behaviour vs what is most
practical for users. Because I agree with Tom that we want to avoid
value-dependent behaviour (it is good to always know that a certain pandas
dtype gets converted to a certain numpy dtype regardless of the content of
the array), this is also the reason we decided to choose this behaviour in
the first place. On the other hand, for all cases where you don't actually
have missing values, having it convert to a numpy int dtype instead of
object dtype might make interoperability with numpy a lot easier.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#30038 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABOPBTFTVGZQCYFVHT3YYLRMV6SXANCNFSM4JVE3EOA>
.
|
In #29964 and #29961 NA in IntegerArray and BooleanArray), the question comes up how to handle
pd.NA
's in conversion to numpy arrays.Such conversion occurs mainly in
__array__
(fornp.(as)array(..)
) and.astype()
. For example:Questions that come up here:
By default, when converting to object dtype, what "NA value" should be used? Before this was
NaN
orNone
, now it could logically bepd.NA
.A possible reason to choose None instead of pd.NA is that third party code that needs a numpy array will typically not be able to handle pd.NA while None is much more normal. On the other hand, there is also still time for such third party code to adapt. And it will probably be good to keep
list(arr)
(iteration/getitem) andnp.array(arr, dtype=object)
consisetnt.When converting to a float dtype, are we fine to automatically convert
pd.NA
tonp.nan
? Or do we think the user should explicitly opt in for this?We will probably want to add a
to_numpy
to those Integer/BooleanArray to be able to make those choices explicit, eg with following signature:where you can explicitly say which value to use for the NAs in the final numpy array (and the
Series.numpy
can then forward such keyword).That way, a user can do
arr.to_numpy(dtype=object, na_value=None
) to get a numpy array with None instead of pd.NA, orarr.to_numpy(dtype=float, na_value=np.nan)
to get a float array with NaNs.But even if we have that function (which I think we should), the above questions about the defaults are still to be answered (eg for
__array__
we cannot have such ana_value
keyword, so we need to make a default choice).cc @TomAugspurger @Dr-Irv
The text was updated successfully, but these errors were encountered: