API: how to handle NA in conversion to numpy arrays #30038

jorisvandenbossche · 2019-12-04T08:31:24Z

In #29964 and #29961 NA in IntegerArray and BooleanArray), the question comes up how to handle pd.NA's in conversion to numpy arrays.

Such conversion occurs mainly in __array__ (for np.(as)array(..)) and .astype(). For example:

In [3]: arr = pd.array([1, 2, pd.NA], dtype="Int64")  

In [4]: np.asarray(arr) 
Out[4]: array([1, 2, None/pd.NA/..?], dtype=object)

In [5]: arr.astype(float)  
Out[5]: array([ 1.,  2., nan])  # <--- allow automatic NA to NaN conversion?

Questions that come up here:

By default, when converting to object dtype, what "NA value" should be used? Before this was NaN or None, now it could logically be pd.NA.
A possible reason to choose None instead of pd.NA is that third party code that needs a numpy array will typically not be able to handle pd.NA while None is much more normal. On the other hand, there is also still time for such third party code to adapt. And it will probably be good to keep list(arr) (iteration/getitem) and np.array(arr, dtype=object) consisetnt.
When converting to a float dtype, are we fine to automatically convert pd.NA to np.nan ? Or do we think the user should explicitly opt in for this?

We will probably want to add a to_numpy to those Integer/BooleanArray to be able to make those choices explicit, eg with following signature:

def to_numpy(self, dtype=object, na_value=...):
    ...

where you can explicitly say which value to use for the NAs in the final numpy array (and the Series.numpy can then forward such keyword).
That way, a user can do arr.to_numpy(dtype=object, na_value=None) to get a numpy array with None instead of pd.NA, or arr.to_numpy(dtype=float, na_value=np.nan) to get a float array with NaNs.

But even if we have that function (which I think we should), the above questions about the defaults are still to be answered (eg for __array__ we cannot have such a na_value keyword, so we need to make a default choice).

cc @TomAugspurger @Dr-Irv

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-12-04T11:44:19Z

+1 for a standard array.to_numpy(dtype=None, na_value=None) method. Perhaps we'll want to align this with Series.to_numpy, and make the signature .to_numpy(dtype=None, copy=False, na_value=None).

I'm really uncertain about np.asarray(arr, dtype=float) / array.astype(float). On the one hand, how annoying will it be to have np.asarray(arr) be essentially useless for some timespan, as projects won't really know what to do? However, it really is inconsistent to have .astype(float) imply NA -> NaN, while not allowing float(NA) to mean NaN. So at this moment, I'd probably vote for IntegerArray.astype(float) / np.asarray(integer_array, dtype=float) raising when there are NA values present, but I'm not sure.

Dr-Irv · 2019-12-04T14:52:29Z

Great question @jorisvandenbossche . This all relates to the semantics of pd.NA vs. np.nan, and the legacy issue of pandas using np.nan to mean "missing value" independent of type.

I think there are a few cases to consider:

Currently (pandas 0.25.x) pd.Series.values returns a numpy array, so that has to change, right? That then has two suboptions:
a. It still returns a numpy array, in which case I don't see how you solve the conversion issue except by assuming pd.NA values goes to NaN, or by raising an exception
b. It returns a different iterative object (a pandas array type), in which case we don't have to worry about it.
How to convert any array to a numpy array. I like the array.to_numpy(dtype=None, na_value=None) option.
For astype, we are using pandas types, so I think that pd.NA should sustain across type conversions. So IntegerArray.astype(float) would preserve pd.NA from the integers over to the floats, since the result is still a Series.
For np.asarray, raise an exception if any pd.NA values are present, which makes users aware that pd.NA doesn't convert to a representation that numpy supports, and the workaround is to use the to_numpy method. I don't know if that's possible to do, in which case we return the pd.NA type and numpy will make the dtype of the result an object.

In general, I'm in favor of raising exceptions whenever we try to convert pd.NA to something else, which forces users to understand that pd.NA is a different thing than None or np.nan, and then we provide new methods (or arguments, as appropriate) to force users to make a choice of how the conversion happens.

TomAugspurger · 2019-12-04T19:29:48Z

Series.values can return an ndarray or an extension array (e.g. Categorical). I don't think we'll want to change the behavior of that anywhere. We have .array and .to_numpy() as alternatives, and .to_numpy() will grow an option to control the NA value.

So IntegerArray.astype(float) would preserve pd.NA from the integers over to the floats, since the result is still a Series.

At the moment, aliases like float, or 'float' fall back to np.dtype(thing) when there's no extension type for them. So IntegerArray.astype(float) will return a NumPy array. And Series[Int64].astype(float) will return a Series[numpy.float64] (a series backed by a float64-dtype NumPy array).

Dr-Irv · 2019-12-04T21:17:31Z

Series.values can return an ndarray or an extension array (e.g. Categorical). I don't think we'll want to change the behavior of that anywhere. We have .array and .to_numpy() as alternatives, and .to_numpy() will grow an option to control the NA value.

So, IMHO, if someone uses Series.values and the result is a numpy array (ndarray), and the series has pd.NA in it, I think we should raise an exception, because we shouldn't make any assumptions about how pd.NA is stored in a numpy array.

At the moment, aliases like float, or 'float' fall back to np.dtype(thing) when there's no extension type for them. So IntegerArray.astype(float) will return a NumPy array. And Series[Int64].astype(float) will return a Series[numpy.float64] (a series backed by a float64-dtype NumPy array).

So if I understand you right, if you have [WhateverType]Array.astype(float), we return a numpy array, then I would suggest doing the same behavior as above. Raise an exception if the array has pd.NA in it.

But for Series.astype(newtype), the result is always a Series, so in this case we can preserve the pd.NA values independent of newtype. Right?

jorisvandenbossche · 2019-12-04T21:30:44Z

So, IMHO, if someone uses Series.values and the result is a numpy array (ndarray)

All the cases we are talking about are new ExtensionArrays, and for those Series.values will actually always be the ExtensionArray itself. For conversion to numpy you need something as np.(as)array(..) or astype. So I don't think we need to discuss .values.

But for Series.astype(newtype), the result is always a Series, so in this case we can preserve the pd.NA values independent of newtype. Right?

Yes, that is correct. But, as Tom said above, we don't have a "float newtype", so all discussions about converting to float is at this moment per definition conversion to numpy (until we add a float ExtensionArray).

In general, I'm in favor of raising exceptions whenever we try to convert pd.NA to something else, which forces users to understand that pd.NA is a different thing than None or np.nan, and then we provide new methods (or arguments, as appropriate) to force users to make a choice of how the conversion happens.

I think I am also in favor of this.

jorisvandenbossche · 2020-01-07T13:19:29Z

Rereading this again, it seems the conclusion is somewhat that we want to raise for this (so no automatic conversion of pd.NA to np.nan).

I am wondering how that impacts projects like eg scikit-learn. They will need to use to_numpy then explicitly when they want to handle eg nullable integer (I should check if it actually works now, as they work on the dataframe level, you probably get a object array, that they then might try to convert to float).

TomAugspurger · 2020-01-07T13:37:09Z

Or rather, the user would need to do this before handing the DataFrame off to a scikit-learn estimator?

But yes, I think the best thing for now is to not implicitly convert NA to NaN. And I don't think that np.asarray(integer_array, dtype="float") or integer_array.astype(float) is explicit enough to warrant converting NA to NaN (though this is a close call). So the following should probably raise

In [13]: a = pd.array([1, None])

In [14]: a.astype("float")
Out[14]: array([ 1., nan])

In [15]: np.asarray(a, dtype="float")
Out[15]: array([ 1., nan])

jorisvandenbossche · 2020-01-07T14:10:06Z

Quickly check how scikit-learn is handling nullable integers with 0.25.3 and master.

With 0.25.3, this actually works by being converted to floats:

In [5]: df = pd.DataFrame({'a': [.1,.2,.3], 'b': pd.Series([1, 2, None], dtype='Int64')}) 

In [6]: from sklearn.preprocessing import StandardScaler

In [7]: scaler = StandardScaler()

In [8]: scaler.fit_transform(df)  
Out[8]: 
array([[-1.22474487e+00, -1.00000000e+00],
       [-3.39934989e-16,  1.00000000e+00],
       [ 1.22474487e+00,             nan]])

This works because the conversion to array gives object dtype array with np.nan, which can then be converted to float:

In [9]: np.array(df)  
Out[9]: 
array([[0.1, 1],
       [0.2, 2],
       [0.3, nan]], dtype=object)

In [10]: np.array(df, dtype=float) 
Out[10]: 
array([[0.1, 1. ],
       [0.2, 2. ],
       [0.3, nan]])

However, with master, you don't get an object array with np.nan, but with pd.NA, which cannot be converted to float. So you get an error:

In [4]: scaler.fit_transform(df)   
...
~/miniconda3/envs/dev/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    513                     array = array.astype(dtype, casting="unsafe", copy=False)
    514                 else:
--> 515                     array = np.asarray(array, order=order, dtype=dtype)
    516             except ComplexWarning:
    517                 raise ValueError("Complex data not supported\n"

~/miniconda3/envs/dev/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     83 
     84     """
---> 85     return array(a, dtype, copy=False, order=order)
     86 
     87 

TypeError: float() argument must be a string or a number, not 'NAType'

But, this failure already happens on master due to the change of np.nan -> pd.NA, so regardless of the discussion here to be more strict or not in the conversion of IntegerArray to float array.
(because it goes to DataFrame.__array__, which will always give object dtype if there are nullable integer columns. We could of course also change DataFrame.__array__ if we want ..)

TomAugspurger · 2020-01-07T14:51:34Z

Thanks for checking. Do you think that np.asarray(integer_array, dtype="float") is explicit enough to convert NA to NaN? I'm going back and forth, but after talking to @adrinjalali I may be convinced that it is explicit enough.

If we think np.asarray(thing, dtype="float") is explicit enough, then we can implement that for the arrays / Series / DataFrame, and I think your example would work again.

jorisvandenbossche · 2020-01-07T15:08:07Z

I am also on the fence, but still slightly leaning towards raising.

Yes, for scikit-learn, converting pd.NA to np.nan in np.asarray(thing, dtype="float") is the desired behaviour, because scikit-learn uses np.nan as missing data, and basically will not do much else with missing data than propagating (in preprocessors), filling (in imputers) or raising (in most models).
But in general, an array with np.nan can behave differently than an array with pd.NA, eg comparison operations will give different results.

Of course, that means that scikit-learn would need to adapt and explicitly implement support for nullable dtypes, instead of relying on the implicit conversion as is done now.

TomAugspurger · 2020-01-07T15:13:00Z

Comparisons are probably the biggest difference between NA and NaN in an ndarray. Reductions are another one (NA rather than nan for the result, and nansum raise for NA).

Given that we're both on slightly different sides of the fence, raising on np.asarray(integer_array, dtype="float") when there are missing values is probably the safe choice for now.

TomAugspurger · 2020-01-07T16:07:01Z

And by raising on .astype('float') it's pretty awkward to go from Series[Int64] -> Series[float64].

In [6]: s = pd.Series(pd.array([1, 2, None]))

In [7]: s.astype("float")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-3b1bb0b634ef> in <module>
----> 1 s.astype("float")

~/sandbox/pandas/pandas/core/generic.py in astype(self, dtype, copy, errors)
   5663         else:
   5664             # else, only a single dtype is given
-> 5665             new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors)
   5666             return self._constructor(new_data).__finalize__(self)
   5667

~/sandbox/pandas/pandas/core/internals/managers.py in astype(self, dtype, copy, errors)
    581
    582     def astype(self, dtype, copy: bool = False, errors: str = "raise"):
--> 583         return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
    584
    585     def convert(self, **kwargs):

~/sandbox/pandas/pandas/core/internals/managers.py in apply(self, f, filter, **kwargs)
    441                 applied = b.apply(f, **kwargs)
    442             else:
--> 443                 applied = getattr(b, f)(**kwargs)
    444             result_blocks = _extend_blocks(applied, result_blocks)
    445

~/sandbox/pandas/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors)
    585         if self.is_extension:
    586             # TODO: Should we try/except this astype?
--> 587             values = self.values.astype(dtype)
    588         else:
    589             if issubclass(dtype.type, str):

~/sandbox/pandas/pandas/core/arrays/integer.py in astype(self, dtype, copy)
    567
    568         # coerce
--> 569         data = self.to_numpy(dtype=dtype)
    570         return astype_nansafe(data, dtype, copy=False)
    571

~/sandbox/pandas/pandas/core/arrays/integer.py in to_numpy(self, dtype, copy, na_value)
    395             if not is_object_dtype(dtype) and na_value is libmissing.NA:
    396                 raise ValueError(
--> 397                     f"cannot convert to '{dtype}'-dtype NumPy array "
    398                     f"with missing values."
    399                 )

ValueError: cannot convert to 'float64'-dtype NumPy array with missing values.

I think you need something like

In [8]: pd.Series(s.to_numpy(dtype="float", na_value=np.nan), index=s.index, name=s.name)
Out[8]:
0    1.0
1    2.0
2    NaN
dtype: float64

jorisvandenbossche · 2020-01-07T16:11:55Z

Can we make an exception for astype for now (as long as we don't have a nullable float dtype)?
That of course gives inconsistency between different conversion ways, but astype is typically a pandas -> pandas conversion, while __array__/to_numpy is always a pandas -> numpy conversion, which will never be able to support pd.NA in floating dtype.

TomAugspurger · 2020-01-07T17:27:31Z

Mmm, we can certainly have IntergarArray.astype(float) call self.to_numpy(dtype=float, na_value="float"). That's a bit inconsistent, but at least more useful.

jorisvandenbossche · 2020-01-07T17:37:17Z

I see now that for BooleanArray.astype, we actually already do it that way (special case float to substitute with NaNs).

buhrmann · 2020-04-14T12:46:16Z

Hi, I was wondering whether it would be possible to consider the special case when an extension type doesn't actually have any missing values. E.g.

pd.Series([0,1,2], dtype="Int64").to_numpy()

returns

array([0, 1, 2], dtype=object)

when it may in many cases be more convenient if the numpy dtype was integer rather than object.

TomAugspurger · 2020-04-14T12:58:40Z

We try to avoid value-dependent behavior where the metadata (shape, dtype, etc.) depend on the values of the array.

buhrmann · 2020-04-15T09:27:21Z

Makes sense too. I was thinking about internal use cases we have, like aggregating a series (possibly extension type) with a function like unique, i.e. creating a list-like result for each group. It can get pretty messy at the moment massaging that into a consistent form (dtype) that can be serialized with Arrow e.g. (sometimes you'll get a numpy array and sometimes an extension array, converting that extension array to numpy depends on its specific dtype etc. But I understand you'll want to leave that to the user...

jorisvandenbossche · 2020-04-15T09:42:52Z

For your specific use case, note that extension arrays can be converted to Arrow (eg IntegerArray cleanly converts to an arrow int array), so it might be an option to leave the list-like results from unique as they are (ndarray or pandas extension array) until converting to arrow (of course, this all depends on the specificities of your use case if this would make sense).

It's a bit a trade-off here between consistent behaviour vs what is most practical for users. Because I agree with Tom that we want to avoid value-dependent behaviour (it is good to always know that a certain pandas dtype gets converted to a certain numpy dtype regardless of the content of the array), this is also the reason we decided to choose this behaviour in the first place. On the other hand, for all cases where you don't actually have missing values, having it convert to a numpy int dtype instead of object dtype might make interoperability with numpy a lot easier.

buhrmann · 2020-04-15T09:57:57Z

I agree, the current implementation returns predictable dtypes at least, which makes it feasible on the user-side to map to specific numpy dtypes where possible. That might get a little messy, but if the dtypes weren't as consistent as they are, the same messiness would probably just pop-up elsewhere.

…

On Wed, 15 Apr 2020 at 11:43, Joris Van den Bossche < ***@***.***> wrote: For your specific use case, note that extension arrays can be converted to Arrow (eg IntegerArray cleanly converts to an arrow int array), so it might be an option to leave the list-like results from unique as they are (ndarray or pandas extension array) until converting to arrow (of course, this all depends on the specificities of your use case if this would make sense). It's a bit a trade-off here between consistent behaviour vs what is most practical for users. Because I agree with Tom that we want to avoid value-dependent behaviour (it is good to always know that a certain pandas dtype gets converted to a certain numpy dtype regardless of the content of the array), this is also the reason we decided to choose this behaviour in the first place. On the other hand, for all cases where you don't actually have missing values, having it convert to a numpy int dtype instead of object dtype might make interoperability with numpy a lot easier. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#30038 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABOPBTFTVGZQCYFVHT3YYLRMV6SXANCNFSM4JVE3EOA> .

jorisvandenbossche added API Design Compat pandas objects compatability with Numpy or Python functions ExtensionArray Extending pandas with custom dtypes or arrays. labels Dec 4, 2019

jorisvandenbossche added this to the 1.0 milestone Dec 4, 2019

TomAugspurger added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Dec 4, 2019

TomAugspurger mentioned this issue Dec 4, 2019

Use new NA scalar in BooleanArray #29961

Merged

This was referenced Dec 5, 2019

BUG/TST: fix arrow roundtrip / parquet tests for recent pyarrow #30077

Merged

API: Uses pd.NA in IntegerArray #29964

Merged

jorisvandenbossche mentioned this issue Dec 18, 2019

ENH: add ExtensionArray.to_numpy to have control over conversion to numpy array #30322

Merged

TomAugspurger mentioned this issue Jan 7, 2020

IntegerArray.to_numpy #30792

Merged

WillAyd closed this as completed in #30792 Jan 8, 2020

jorisvandenbossche mentioned this issue Feb 26, 2020

API: distinguish NA vs NaN in floating dtypes #32265

Open

angela97lin mentioned this issue Nov 12, 2020

Update components to accept Woodwork inputs alteryx/evalml#1423

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: how to handle NA in conversion to numpy arrays #30038

API: how to handle NA in conversion to numpy arrays #30038

jorisvandenbossche commented Dec 4, 2019

TomAugspurger commented Dec 4, 2019

Dr-Irv commented Dec 4, 2019

TomAugspurger commented Dec 4, 2019

Dr-Irv commented Dec 4, 2019

jorisvandenbossche commented Dec 4, 2019

jorisvandenbossche commented Jan 7, 2020

TomAugspurger commented Jan 7, 2020

jorisvandenbossche commented Jan 7, 2020

TomAugspurger commented Jan 7, 2020

jorisvandenbossche commented Jan 7, 2020

TomAugspurger commented Jan 7, 2020

TomAugspurger commented Jan 7, 2020

jorisvandenbossche commented Jan 7, 2020

TomAugspurger commented Jan 7, 2020

jorisvandenbossche commented Jan 7, 2020

buhrmann commented Apr 14, 2020

TomAugspurger commented Apr 14, 2020

buhrmann commented Apr 15, 2020

jorisvandenbossche commented Apr 15, 2020

buhrmann commented Apr 15, 2020 via email

API: how to handle NA in conversion to numpy arrays #30038

API: how to handle NA in conversion to numpy arrays #30038

Comments

jorisvandenbossche commented Dec 4, 2019

TomAugspurger commented Dec 4, 2019

Dr-Irv commented Dec 4, 2019

TomAugspurger commented Dec 4, 2019

Dr-Irv commented Dec 4, 2019

jorisvandenbossche commented Dec 4, 2019

jorisvandenbossche commented Jan 7, 2020

TomAugspurger commented Jan 7, 2020

jorisvandenbossche commented Jan 7, 2020

TomAugspurger commented Jan 7, 2020

jorisvandenbossche commented Jan 7, 2020

TomAugspurger commented Jan 7, 2020

TomAugspurger commented Jan 7, 2020

jorisvandenbossche commented Jan 7, 2020

TomAugspurger commented Jan 7, 2020

jorisvandenbossche commented Jan 7, 2020

buhrmann commented Apr 14, 2020

TomAugspurger commented Apr 14, 2020

buhrmann commented Apr 15, 2020

jorisvandenbossche commented Apr 15, 2020

buhrmann commented Apr 15, 2020 via email