Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: how to handle NA in conversion to numpy arrays #30038

Closed
jorisvandenbossche opened this issue Dec 4, 2019 · 20 comments · Fixed by #30792
Closed

API: how to handle NA in conversion to numpy arrays #30038

jorisvandenbossche opened this issue Dec 4, 2019 · 20 comments · Fixed by #30792
Labels
API Design Compat pandas objects compatability with Numpy or Python functions ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@jorisvandenbossche
Copy link
Member

In #29964 and #29961 NA in IntegerArray and BooleanArray), the question comes up how to handle pd.NA's in conversion to numpy arrays.

Such conversion occurs mainly in __array__ (for np.(as)array(..)) and .astype(). For example:

In [3]: arr = pd.array([1, 2, pd.NA], dtype="Int64")  

In [4]: np.asarray(arr) 
Out[4]: array([1, 2, None/pd.NA/..?], dtype=object)

In [5]: arr.astype(float)  
Out[5]: array([ 1.,  2., nan])  # <--- allow automatic NA to NaN conversion?

Questions that come up here:

  • By default, when converting to object dtype, what "NA value" should be used? Before this was NaN or None, now it could logically be pd.NA.
    A possible reason to choose None instead of pd.NA is that third party code that needs a numpy array will typically not be able to handle pd.NA while None is much more normal. On the other hand, there is also still time for such third party code to adapt. And it will probably be good to keep list(arr) (iteration/getitem) and np.array(arr, dtype=object) consisetnt.

  • When converting to a float dtype, are we fine to automatically convert pd.NA to np.nan ? Or do we think the user should explicitly opt in for this?

We will probably want to add a to_numpy to those Integer/BooleanArray to be able to make those choices explicit, eg with following signature:

def to_numpy(self, dtype=object, na_value=...):
    ... 

where you can explicitly say which value to use for the NAs in the final numpy array (and the Series.numpy can then forward such keyword).
That way, a user can do arr.to_numpy(dtype=object, na_value=None) to get a numpy array with None instead of pd.NA, or arr.to_numpy(dtype=float, na_value=np.nan) to get a float array with NaNs.

But even if we have that function (which I think we should), the above questions about the defaults are still to be answered (eg for __array__ we cannot have such a na_value keyword, so we need to make a default choice).

cc @TomAugspurger @Dr-Irv

@jorisvandenbossche jorisvandenbossche added API Design Compat pandas objects compatability with Numpy or Python functions ExtensionArray Extending pandas with custom dtypes or arrays. labels Dec 4, 2019
@jorisvandenbossche jorisvandenbossche added this to the 1.0 milestone Dec 4, 2019
@TomAugspurger
Copy link
Contributor

+1 for a standard array.to_numpy(dtype=None, na_value=None) method. Perhaps we'll want to align this with Series.to_numpy, and make the signature .to_numpy(dtype=None, copy=False, na_value=None).

I'm really uncertain about np.asarray(arr, dtype=float) / array.astype(float). On the one hand, how annoying will it be to have np.asarray(arr) be essentially useless for some timespan, as projects won't really know what to do? However, it really is inconsistent to have .astype(float) imply NA -> NaN, while not allowing float(NA) to mean NaN. So at this moment, I'd probably vote for IntegerArray.astype(float) / np.asarray(integer_array, dtype=float) raising when there are NA values present, but I'm not sure.

@TomAugspurger TomAugspurger added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Dec 4, 2019
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 4, 2019

Great question @jorisvandenbossche . This all relates to the semantics of pd.NA vs. np.nan, and the legacy issue of pandas using np.nan to mean "missing value" independent of type.

I think there are a few cases to consider:

  1. Currently (pandas 0.25.x) pd.Series.values returns a numpy array, so that has to change, right? That then has two suboptions:
    a. It still returns a numpy array, in which case I don't see how you solve the conversion issue except by assuming pd.NA values goes to NaN, or by raising an exception
    b. It returns a different iterative object (a pandas array type), in which case we don't have to worry about it.
  2. How to convert any array to a numpy array. I like the array.to_numpy(dtype=None, na_value=None) option.
  3. For astype, we are using pandas types, so I think that pd.NA should sustain across type conversions. So IntegerArray.astype(float) would preserve pd.NA from the integers over to the floats, since the result is still a Series.
  4. For np.asarray, raise an exception if any pd.NA values are present, which makes users aware that pd.NA doesn't convert to a representation that numpy supports, and the workaround is to use the to_numpy method. I don't know if that's possible to do, in which case we return the pd.NA type and numpy will make the dtype of the result an object.

In general, I'm in favor of raising exceptions whenever we try to convert pd.NA to something else, which forces users to understand that pd.NA is a different thing than None or np.nan, and then we provide new methods (or arguments, as appropriate) to force users to make a choice of how the conversion happens.

@TomAugspurger
Copy link
Contributor

Series.values can return an ndarray or an extension array (e.g. Categorical). I don't think we'll want to change the behavior of that anywhere. We have .array and .to_numpy() as alternatives, and .to_numpy() will grow an option to control the NA value.

So IntegerArray.astype(float) would preserve pd.NA from the integers over to the floats, since the result is still a Series.

At the moment, aliases like float, or 'float' fall back to np.dtype(thing) when there's no extension type for them. So IntegerArray.astype(float) will return a NumPy array. And Series[Int64].astype(float) will return a Series[numpy.float64] (a series backed by a float64-dtype NumPy array).

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 4, 2019

Series.values can return an ndarray or an extension array (e.g. Categorical). I don't think we'll want to change the behavior of that anywhere. We have .array and .to_numpy() as alternatives, and .to_numpy() will grow an option to control the NA value.

So, IMHO, if someone uses Series.values and the result is a numpy array (ndarray), and the series has pd.NA in it, I think we should raise an exception, because we shouldn't make any assumptions about how pd.NA is stored in a numpy array.

At the moment, aliases like float, or 'float' fall back to np.dtype(thing) when there's no extension type for them. So IntegerArray.astype(float) will return a NumPy array. And Series[Int64].astype(float) will return a Series[numpy.float64] (a series backed by a float64-dtype NumPy array).

So if I understand you right, if you have [WhateverType]Array.astype(float), we return a numpy array, then I would suggest doing the same behavior as above. Raise an exception if the array has pd.NA in it.

But for Series.astype(newtype), the result is always a Series, so in this case we can preserve the pd.NA values independent of newtype. Right?

@jorisvandenbossche
Copy link
Member Author

So, IMHO, if someone uses Series.values and the result is a numpy array (ndarray)

All the cases we are talking about are new ExtensionArrays, and for those Series.values will actually always be the ExtensionArray itself. For conversion to numpy you need something as np.(as)array(..) or astype. So I don't think we need to discuss .values.

But for Series.astype(newtype), the result is always a Series, so in this case we can preserve the pd.NA values independent of newtype. Right?

Yes, that is correct. But, as Tom said above, we don't have a "float newtype", so all discussions about converting to float is at this moment per definition conversion to numpy (until we add a float ExtensionArray).

In general, I'm in favor of raising exceptions whenever we try to convert pd.NA to something else, which forces users to understand that pd.NA is a different thing than None or np.nan, and then we provide new methods (or arguments, as appropriate) to force users to make a choice of how the conversion happens.

I think I am also in favor of this.

@jorisvandenbossche
Copy link
Member Author

Rereading this again, it seems the conclusion is somewhat that we want to raise for this (so no automatic conversion of pd.NA to np.nan).

I am wondering how that impacts projects like eg scikit-learn. They will need to use to_numpy then explicitly when they want to handle eg nullable integer (I should check if it actually works now, as they work on the dataframe level, you probably get a object array, that they then might try to convert to float).

@TomAugspurger
Copy link
Contributor

Or rather, the user would need to do this before handing the DataFrame off to a scikit-learn estimator?

But yes, I think the best thing for now is to not implicitly convert NA to NaN. And I don't think that np.asarray(integer_array, dtype="float") or integer_array.astype(float) is explicit enough to warrant converting NA to NaN (though this is a close call). So the following should probably raise

In [13]: a = pd.array([1, None])

In [14]: a.astype("float")
Out[14]: array([ 1., nan])

In [15]: np.asarray(a, dtype="float")
Out[15]: array([ 1., nan])

@jorisvandenbossche
Copy link
Member Author

Quickly check how scikit-learn is handling nullable integers with 0.25.3 and master.

With 0.25.3, this actually works by being converted to floats:

In [5]: df = pd.DataFrame({'a': [.1,.2,.3], 'b': pd.Series([1, 2, None], dtype='Int64')}) 

In [6]: from sklearn.preprocessing import StandardScaler

In [7]: scaler = StandardScaler()

In [8]: scaler.fit_transform(df)  
Out[8]: 
array([[-1.22474487e+00, -1.00000000e+00],
       [-3.39934989e-16,  1.00000000e+00],
       [ 1.22474487e+00,             nan]])

This works because the conversion to array gives object dtype array with np.nan, which can then be converted to float:

In [9]: np.array(df)  
Out[9]: 
array([[0.1, 1],
       [0.2, 2],
       [0.3, nan]], dtype=object)

In [10]: np.array(df, dtype=float) 
Out[10]: 
array([[0.1, 1. ],
       [0.2, 2. ],
       [0.3, nan]])

However, with master, you don't get an object array with np.nan, but with pd.NA, which cannot be converted to float. So you get an error:

In [4]: scaler.fit_transform(df)   
...
~/miniconda3/envs/dev/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    513                     array = array.astype(dtype, casting="unsafe", copy=False)
    514                 else:
--> 515                     array = np.asarray(array, order=order, dtype=dtype)
    516             except ComplexWarning:
    517                 raise ValueError("Complex data not supported\n"

~/miniconda3/envs/dev/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     83 
     84     """
---> 85     return array(a, dtype, copy=False, order=order)
     86 
     87 

TypeError: float() argument must be a string or a number, not 'NAType'

But, this failure already happens on master due to the change of np.nan -> pd.NA, so regardless of the discussion here to be more strict or not in the conversion of IntegerArray to float array.
(because it goes to DataFrame.__array__, which will always give object dtype if there are nullable integer columns. We could of course also change DataFrame.__array__ if we want ..)

@TomAugspurger
Copy link
Contributor

Thanks for checking. Do you think that np.asarray(integer_array, dtype="float") is explicit enough to convert NA to NaN? I'm going back and forth, but after talking to @adrinjalali I may be convinced that it is explicit enough.

If we think np.asarray(thing, dtype="float") is explicit enough, then we can implement that for the arrays / Series / DataFrame, and I think your example would work again.

@jorisvandenbossche
Copy link
Member Author

I am also on the fence, but still slightly leaning towards raising.

Yes, for scikit-learn, converting pd.NA to np.nan in np.asarray(thing, dtype="float") is the desired behaviour, because scikit-learn uses np.nan as missing data, and basically will not do much else with missing data than propagating (in preprocessors), filling (in imputers) or raising (in most models).
But in general, an array with np.nan can behave differently than an array with pd.NA, eg comparison operations will give different results.

Of course, that means that scikit-learn would need to adapt and explicitly implement support for nullable dtypes, instead of relying on the implicit conversion as is done now.

@TomAugspurger
Copy link
Contributor

Comparisons are probably the biggest difference between NA and NaN in an ndarray. Reductions are another one (NA rather than nan for the result, and nansum raise for NA).

Given that we're both on slightly different sides of the fence, raising on np.asarray(integer_array, dtype="float") when there are missing values is probably the safe choice for now.

@TomAugspurger
Copy link
Contributor

And by raising on .astype('float') it's pretty awkward to go from Series[Int64] -> Series[float64].

In [6]: s = pd.Series(pd.array([1, 2, None]))

In [7]: s.astype("float")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-3b1bb0b634ef> in <module>
----> 1 s.astype("float")

~/sandbox/pandas/pandas/core/generic.py in astype(self, dtype, copy, errors)
   5663         else:
   5664             # else, only a single dtype is given
-> 5665             new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors)
   5666             return self._constructor(new_data).__finalize__(self)
   5667

~/sandbox/pandas/pandas/core/internals/managers.py in astype(self, dtype, copy, errors)
    581
    582     def astype(self, dtype, copy: bool = False, errors: str = "raise"):
--> 583         return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
    584
    585     def convert(self, **kwargs):

~/sandbox/pandas/pandas/core/internals/managers.py in apply(self, f, filter, **kwargs)
    441                 applied = b.apply(f, **kwargs)
    442             else:
--> 443                 applied = getattr(b, f)(**kwargs)
    444             result_blocks = _extend_blocks(applied, result_blocks)
    445

~/sandbox/pandas/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors)
    585         if self.is_extension:
    586             # TODO: Should we try/except this astype?
--> 587             values = self.values.astype(dtype)
    588         else:
    589             if issubclass(dtype.type, str):

~/sandbox/pandas/pandas/core/arrays/integer.py in astype(self, dtype, copy)
    567
    568         # coerce
--> 569         data = self.to_numpy(dtype=dtype)
    570         return astype_nansafe(data, dtype, copy=False)
    571

~/sandbox/pandas/pandas/core/arrays/integer.py in to_numpy(self, dtype, copy, na_value)
    395             if not is_object_dtype(dtype) and na_value is libmissing.NA:
    396                 raise ValueError(
--> 397                     f"cannot convert to '{dtype}'-dtype NumPy array "
    398                     f"with missing values."
    399                 )

ValueError: cannot convert to 'float64'-dtype NumPy array with missing values.

I think you need something like

In [8]: pd.Series(s.to_numpy(dtype="float", na_value=np.nan), index=s.index, name=s.name)
Out[8]:
0    1.0
1    2.0
2    NaN
dtype: float64

@jorisvandenbossche
Copy link
Member Author

Can we make an exception for astype for now (as long as we don't have a nullable float dtype)?
That of course gives inconsistency between different conversion ways, but astype is typically a pandas -> pandas conversion, while __array__/to_numpy is always a pandas -> numpy conversion, which will never be able to support pd.NA in floating dtype.

@TomAugspurger
Copy link
Contributor

Mmm, we can certainly have IntergarArray.astype(float) call self.to_numpy(dtype=float, na_value="float"). That's a bit inconsistent, but at least more useful.

@jorisvandenbossche
Copy link
Member Author

I see now that for BooleanArray.astype, we actually already do it that way (special case float to substitute with NaNs).

@buhrmann
Copy link

Hi, I was wondering whether it would be possible to consider the special case when an extension type doesn't actually have any missing values. E.g.

pd.Series([0,1,2], dtype="Int64").to_numpy()

returns

array([0, 1, 2], dtype=object)

when it may in many cases be more convenient if the numpy dtype was integer rather than object.

@TomAugspurger
Copy link
Contributor

We try to avoid value-dependent behavior where the metadata (shape, dtype, etc.) depend on the values of the array.

@buhrmann
Copy link

Makes sense too. I was thinking about internal use cases we have, like aggregating a series (possibly extension type) with a function like unique, i.e. creating a list-like result for each group. It can get pretty messy at the moment massaging that into a consistent form (dtype) that can be serialized with Arrow e.g. (sometimes you'll get a numpy array and sometimes an extension array, converting that extension array to numpy depends on its specific dtype etc. But I understand you'll want to leave that to the user...

@jorisvandenbossche
Copy link
Member Author

For your specific use case, note that extension arrays can be converted to Arrow (eg IntegerArray cleanly converts to an arrow int array), so it might be an option to leave the list-like results from unique as they are (ndarray or pandas extension array) until converting to arrow (of course, this all depends on the specificities of your use case if this would make sense).

It's a bit a trade-off here between consistent behaviour vs what is most practical for users. Because I agree with Tom that we want to avoid value-dependent behaviour (it is good to always know that a certain pandas dtype gets converted to a certain numpy dtype regardless of the content of the array), this is also the reason we decided to choose this behaviour in the first place. On the other hand, for all cases where you don't actually have missing values, having it convert to a numpy int dtype instead of object dtype might make interoperability with numpy a lot easier.

@buhrmann
Copy link

buhrmann commented Apr 15, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Compat pandas objects compatability with Numpy or Python functions ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants