Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: pd.array(index_or_series[object]) should infer like Series and Index constructors #39117

Closed
jbrockmendel opened this issue Jan 11, 2021 · 12 comments
Labels
API - Consistency Internal Consistency of API/Behavior API Design Constructors Series/DataFrame/Index/pd.array Constructors Needs Discussion Requires discussion from core team before further action

Comments

@jbrockmendel
Copy link
Member

No description provided.

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 11, 2021
@TomAugspurger
Copy link
Contributor

What's the motivation here? The docs in https://pandas.pydata.org/docs/reference/api/pandas.array.html state that dtype is optional, and

If not specified, there are two possibilities:

1. When data is a Series, Index, or ExtensionArray, the dtype will be taken from the data.
2. Otherwise, pandas will attempt to infer the dtype from the data.

I'd prefer to avoid special casing object-dtype here, unless we have a compelling reason to (especially since object-dtype should become less common now that we have more extension types).

@jbrockmendel
Copy link
Member Author

What's the motivation here?

To get pd.array behavior to match Index and Series xref #27460

I'd prefer to avoid special casing object-dtype here

I'm on board with the sentiment, but point 1. you quotes means we're special-casing ndarray vs EA/Index/Series. Among other things, this means that pd.array(obj) and pd.array(extract_array(obj, extract_numpy=True)) behave differently.

@jorisvandenbossche
Copy link
Member

To get pd.array behavior to match Index and Series xref #27460

It's not possible for pd.array to match Index/Series inference behaviour, for now, because it's explicitly meant to default to infer the nullable dtypes.
Rather, the goal is that eventually Index/Series behaviour will match pd.array

@jorisvandenbossche
Copy link
Member

we're special-casing ndarray vs EA/Index/Series

Although unfortunate, I think that's inevitable because ndarray cannot hold all the data types that the others can hold. Currently eg nullable integer, or tz-aware datetime64, or period, etc, become an ndarray without proper dtype. So for those I think it makes sense to do more inference on ndarrays.

@jbrockmendel
Copy link
Member Author

Although unfortunate, I think that's inevitable because ndarray cannot hold all the data types that the others can hold. Currently eg nullable integer, or tz-aware datetime64, or period, etc, become an ndarray without proper dtype. So for those I think it makes sense to do more inference on ndarrays.

There's a step in the logic here I don't understand. EA can hold dtypes that ndarray cannot, but this is about dtypes that both can hold.

@jorisvandenbossche
Copy link
Member

@jbrockmendel can you give some practical code examples? I think that would help a lot to clear up the misunderstanding/confusion (eg I don't know which dtypes you are speaking about)

@jbrockmendel
Copy link
Member Author

can you give some practical code examples?

The motivation comes from DatetimeLikeArrayMixin._validate_listlike, which uses pd.array for inference and is under the hood of a bunch of methods.

dti = pd.date_range("2016-01-01", periods=3)
dta = dti._data

dta._validate_listlike(dta.astype(object)  # <- works, as dta.astype(object) is ndarray
dta._validate_listlike(dti.astype(object))  # <- raises TypeError

Of course, "dont use pd.array for this" is also a viable approach.

@jbrockmendel
Copy link
Member Author

eg I don't know which dtypes you are speaking about

I am only talking about object dtype, for which we do lib.infer_dtype with ndarray but not Index/Series/PandasArray

@jorisvandenbossche
Copy link
Member

Can you make the example even more concrete?

In the title you mention Index or Series of object dtype, if I understand correctly. With a quick test, I don't see different type inference between Series and array constructor for such input:

In [1]: s = pd.Series([1, 2, 3], dtype=object)

In [2]: pd.Series(s)
Out[2]: 
0    1
1    2
2    3
dtype: object

In [3]: pd.array(s)
Out[3]: 
<PandasArray>
[1, 2, 3]
Length: 3, dtype: object

Both preserve the object dtype when passed a Series?

It's actually when passed an ndarray that both infer differently:

In [4]: pd.Series(np.asarray(s))
Out[4]: 
0    1
1    2
2    3
dtype: object

In [5]: pd.array(np.asarray(s))
Out[5]: 
<IntegerArray>
[1, 2, 3]
Length: 3, dtype: Int64

@jorisvandenbossche
Copy link
Member

Your example was using datetimes, and also for that I don't see a difference in behaviour for Series vs array:

In [13]: arr = np.array([pd.Timestamp("2020-01-01")], dtype=object)

In [14]: s = pd.Series(arr, dtype=object)

# both Series and array infer object-dtype np.ndarray
In [15]: pd.Series(arr)
Out[15]: 
0   2020-01-01
dtype: datetime64[ns]

In [16]: pd.array(arr)
Out[16]: 
<DatetimeArray>
['2020-01-01 00:00:00']
Length: 1, dtype: datetime64[ns]

# and both Series and array do not infer object-dtype Series
In [17]: pd.Series(s)
Out[17]: 
0    2020-01-01 00:00:00
dtype: object

In [18]: pd.array(s)
Out[18]: 
<PandasArray>
[Timestamp('2020-01-01 00:00:00')]
Length: 1, dtype: object

but there is actually one for Index (both for Index constructor as when passing index object to the Series constructor):

# the Index constructor infers both for ndarray and Series
In [23]: pd.Index(s)
Out[23]: DatetimeIndex(['2020-01-01'], dtype='datetime64[ns]', freq=None)

In [24]: pd.Index(arr)
Out[24]: DatetimeIndex(['2020-01-01'], dtype='datetime64[ns]', freq=None)

# and passing an Index to the Series constructor also infers (in contrast to passing a Series)
In [19]: idx = pd.Index(arr, dtype=object)

In [20]: idx
Out[20]: Index([2020-01-01 00:00:00], dtype='object')

In [21]: pd.Series(idx)
Out[21]: 
0   2020-01-01
dtype: datetime64[ns]

In [22]: pd.array(idx)
Out[22]: 
<PandasArray>
[Timestamp('2020-01-01 00:00:00')]
Length: 1, dtype: object

@jorisvandenbossche
Copy link
Member

Additional example: Series constructor does not infer the type when the object-dtype Index holds integers (so in contrast with the example above of an object-dtype Index with timestamps), but the Index constructor does also infer in that case:

In [6]: s = pd.Series([1, 2, 3], dtype=object)

In [7]: idx = pd.Index(s, dtype=object)

In [8]: pd.Series(idx)
Out[8]: 
0    1
1    2
2    3
dtype: object

In [9]: pd.Index(idx)
Out[9]: Int64Index([1, 2, 3], dtype='int64')

@simonjayhawkins simonjayhawkins added API - Consistency Internal Consistency of API/Behavior API Design Needs Discussion Requires discussion from core team before further action and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 19, 2021
@jbrockmendel jbrockmendel added the Constructors Series/DataFrame/Index/pd.array Constructors label Nov 24, 2022
@jbrockmendel
Copy link
Member Author

I can't figure out what past-me had in mind here. Best guess is it involved some of the now-deprecated-and-removed string inference that used to be done in the Series constructor. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior API Design Constructors Series/DataFrame/Index/pd.array Constructors Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

4 participants