Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: ExtensionArrays and conversion to "native" types (eg in tolist, to_dict, iteration, ..) #29738

Open
jorisvandenbossche opened this issue Nov 20, 2019 · 18 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays.

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 20, 2019

We try to consistently return python objects (instead of numpy scalars) in certain functions like tolist, to_dict, itertuples/items, .. (we have had quite some issues fixing this in several cases).

However, currently we don't do that for extension dtypes (and don't have any mechanism to ask for this):

In [33]: type(pd.Series([1, 2], dtype='int64').tolist()[0]) 
Out[33]: int

In [34]: type(pd.Series([1, 2], dtype='Int64').tolist()[0])  
Out[34]: numpy.int64

In [36]: type(pd.Series([1, 2], dtype='int64').to_dict()[0]) 
Out[36]: int

In [37]: type(pd.Series([1, 2], dtype='Int64').to_dict()[0])
Out[37]: numpy.int64
In [45]: s = pd.Series([1, 2], dtype='int64') 

In [46]: type(list(s.iteritems())[0][1])  
Out[46]: int

In [47]: s = pd.Series([1, 2], dtype='Int64')      

In [48]: type(list(s.iteritems())[0][1])  
Out[48]: numpy.int64

Should we add some API to ExtensionArray to provide this? Eg a method to iterate through the elements that returns "native" objects?

@jorisvandenbossche jorisvandenbossche added the ExtensionArray Extending pandas with custom dtypes or arrays. label Nov 20, 2019
@jorisvandenbossche
Copy link
Member Author

Actually, for Series, the __iter__ also returns native types, so maybe fixing that is enough: ensure that __iter__ on IntegerArray etc do return the python objects (so by not plainly using __getitem__, which correctly returns numpy scalars):

In [57]: s = pd.Series([1, 2], dtype='int64')  

In [58]: type(s[0])  
Out[58]: numpy.int64

In [59]: type(list(iter(s))[0])  
Out[59]: int

In [60]: s = pd.Series([1, 2], dtype='Int64')         

In [61]: type(s[0])    
Out[61]: numpy.int64

In [62]: type(list(iter(s))[0])  
Out[62]: numpy.int64   # <-- fixing this might fix the other cases?

@jorisvandenbossche jorisvandenbossche added this to the Contributions Welcome milestone Nov 20, 2019
@jorisvandenbossche
Copy link
Member Author

Let's consider this a bug in __iter__ then, because I think all mentioned cases can be solved with (for IntegerArray):

--- a/pandas/core/arrays/integer.py
+++ b/pandas/core/arrays/integer.py
@@ -456,7 +456,7 @@ class IntegerArray(ExtensionArray, ExtensionOpsMixin):
             if self._mask[i]:
                 yield self.dtype.na_value
             else:
-                yield self._data[i]
+                yield self._data[i].item()

@marco-neumann-by
Copy link

So ExtensionArray.__iter__ should return native types then? Is this also true for ExtensionArray.__getitem__ with some integer? I think it should at least be documented then so that other (external) implementations can get this right.

@jorisvandenbossche
Copy link
Member Author

Is this also true for ExtensionArray.getitem with some integer?

If we mimic what Series with plain numpy dtype does, then getitem should keep returning the numpy scalar.

@victorbr92
Copy link

Hi, @marco-neumann-jdas and @jorisvandenbossche, are you guys working on this issue ? Could I help somehow and make a PR for it (it was marked as a good first issue).
I tested the changes proposed by @jorisvandenbossche locally and it fixes the reported issue indeed.

Brosinc132 added a commit to Brosinc132/pandas that referenced this issue Dec 10, 2019
…ze the types on iteratively going through the collection to cast to the appropriate type. In addition added a basic test case test_native_calls_types to assert that the change works
@marco-neumann-by
Copy link

I am not working on it.
Keep in mind that this issue is not only about fixing the behavior of IntegerArray but also to adjust the docs of ExtensionArray.__iter__ to state that EVERY EA should return Python-native types.

@mroeschke
Copy link
Member

Similarly, should pd.NA (ExtensionDtype.na_value) be converted to None as the native equivalent?

In [1]: pd.Series([1, 2, None], dtype="Int64").tolist()
Out[1]: [1, 2, <NA>] # should <NA> be None?

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@Peque
Copy link
Contributor

Peque commented Dec 11, 2022

Stumbled upon this issue when trying to serialize a DataFrame resulted from .isocalendar():

import json

from pandas import date_range

df = date_range("2021-01-01", freq="D", periods=7).isocalendar()
json.dumps(df.to_dict(orient="list"))
# TypeError: Object of type uint32 is not JSON serializable

Of course, this can be reduced to an example like those presented above by @jorisvandenbossche:

type(pd.Series([1, 2], dtype='UInt32').tolist()[0])
# numpy.int32

Just leaving this comment in case someone looks for "isocalendar" or "not JSON serializable".

@lukemanley
Copy link
Member

@Peque - your example cases have been fixed on the main branch and will be included in 2.0. On main, your first example works without raising and your second example returns int.

@lukemanley
Copy link
Member

Similarly, should pd.NA (ExtensionDtype.na_value) be converted to None as the native equivalent?

@mroeschke - was there ever an answer to this? pd.NA is still being returned

@Peque
Copy link
Contributor

Peque commented Dec 11, 2022

@lukemanley Great to know and thanks for sharing! 😊

@mroeschke
Copy link
Member

@mroeschke - was there ever an answer to this? pd.NA is still being returned

Personally I think it makes sense to convert pd.NA to None as the Python native type. @jorisvandenbossche might have thoughts on this as well

@lukemanley
Copy link
Member

#50796 changed Series.to_dict to return None in place of pd.NA which seems to be the preferred behavior as it is consistent with returning native types. I took a look at doing the same for tolist and __iter__.

I'll note that changing the value for __iter__ would result in the following list comprehension to start raising:

arr = pd.array([1, 2, pd.NA], dtype="Int64")

[v+1 for v in arr]  # <- TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'

I'm curious if that is too big of a change? Since series.tolist() == list(series) is tested behavior I suspect tolist and __iter__ need to remain consistent with each other.

Any suggestions for moving this forward from here? Its probably not ideal that to_dict and tolist are inconsistent with pd.NA at the moment.

cc @phofl @mroeschke

@lukemanley
Copy link
Member

One more comment here. I think there may be a case for making tolist return None in place of pd.NA but not __iter__. This would make tolist consistent with to_dict. It would also be consistent with how numpy and pyarrow operate. Both numpy and pyarrow have .tolist methods that return python native types and __iter__ methods that return non-native types when iterating through an array.

If we were to take this approach, the currently tested series.tolist() == list(series) behavior would change.

Example with pyarrow:

In [1]: arr = pa.array([1, None])

In [2]: arr.tolist()
Out[2]: [1, None]

In [3]: list(arr)
Out[3]: [<pyarrow.Int64Scalar: 1>, <pyarrow.Int64Scalar: None>]

@mroeschke
Copy link
Member

My initial feeling is that both __iter__ and list should both return python types with justification that if a user calls list I think they would expect the elements to behave like native Python objects

@lukemanley
Copy link
Member

My initial feeling is that both iter and list should both return python types with justification that if a user calls list I think they would expect the elements to behave like native Python objects

Sorry for all the questions.

Just want to confirm that you're talking about both Series.__iter__ and EA.__iter__ (e.g. BaseMaskedArray). I ask because with non-EA, there is a native/non-native type difference between Series.__iter__ and Series.values.__iter__:

import pandas as pd

ser  = pd.Series([1])

for v in ser:
    print(type(v))      # -> <class 'int'>

for v in ser.values:
    print(type(v))      # -> <class 'numpy.int64'>

@mroeschke
Copy link
Member

Series.values is already a numpy array, so I think there's an understanding that np.array.__iter__ is not yield similar Python native types as Series.__iter__, so yes I would expect Series.__iter__ and EA.__iter__ to align.

Note I think Series[datetime64/timedelta64] is the only time where we return pandas objects which make sense due to Timestamp and Timedelta objects holding more resolution that datetime.datetime, `datetime.timedelta.

@lukemanley
Copy link
Member

Thanks. One concern with replacing pd.NA with None in __iter__ is that code like this will start to break:

import pandas as pd

idx = pd.Index([1, 2, pd.NA], dtype="Int64")
ser = pd.Series(1, index=idx)

for v in ser.index:
    print(ser[v])       # -> KeyError: None

or simply:

import pandas as pd

idx = pd.Index([1, 2, pd.NA], dtype="Int64")

[v in idx for v in idx]    # -> [True, True, False]

e.g. test_array_iterface

import pandas as pd
import numpy as np

arr = pd.array([1, 2, pd.NA], dtype="Int64")

np.array(arr) == np.array(list(arr))    # -> [True, True False]

The test suite alone has hundreds of failures when replacing pd.NA with None in __iter__. If the test suite is indicative of what users may experience, I suspect this would be a big change and maybe not a desirable one given the examples above. Might there be a case to return native types for non-na values but still return pd.NA for missing values? That happens to be the behavior of ArrowExtensionArray at the moment:

import pandas as pd

arr = pd.array([1, pd.NA], dtype='int64[pyarrow]')

for v in arr:
    print(type(v))

# <class 'int'>
# <class 'pandas._libs.missing.NAType'>

If you still think __iter__ should return None I will go through the test suite and see how much actually needs to change to get everything to pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
8 participants