Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: ExtensionArray.astype to only return extension arrays ? #24877

Open
jorisvandenbossche opened this issue Jan 22, 2019 · 13 comments
Open

API: ExtensionArray.astype to only return extension arrays ? #24877

jorisvandenbossche opened this issue Jan 22, 2019 · 13 comments
Labels
API Design Astype Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action PDEP missing values Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint

Comments

@jorisvandenbossche
Copy link
Member

Related to the discussion happening in #24674 (comment), further xref #24773, #22384

The question being posed here is what the return type of ExtensionArray.astype(..) should be.

Currently, it can be both a numpy.ndarray or an ExtensionArray. And the astype method in the base class is actually advertised as "Cast to a NumPy array with 'dtype'" (this is not fully correct, as currently casting eg a DatetimeArray to period dtype will give a PeriodArray EA, not a numpy array).

However, this gives some discussion about what eg DatetimeArray.astype("datetime64[ns]") should return, as we actually have a DatetimeArray EA that supports that dtype.
You get similar inconsistencies / dubious cases in eg DatetimeArray.astype('int64') which currently returns an int64 ndarray, while pd.array(DatetimeArray(), dtype='int64') will give a PandasArray[int64].

So one idea would be to restrict ExtensionArray to be a method to convert from one type of ExtensionArray to another type of ExtensionArray, i.e. the output type would be expected to always be an ExtensionArray.
(similarly, in the end, as how ndarray.astype only gives other ndarrays, or pyarrow.Array.cast only gives other pyarrow arrays; there are other (more explicit) methods to convert to a numpy array)

In practice this would mean returning a PandasArray for numpy dtypes instead of a numpy ndarray. Regarding storing the result of it in a Series/DataFrame should not change, because then we unpack such a PandasArray anyway.

cc @TomAugspurger @jreback @jbrockmendel @shoyer @h-vetinari

@jorisvandenbossche jorisvandenbossche added API Design ExtensionArray Extending pandas with custom dtypes or arrays. labels Jan 22, 2019
@jorisvandenbossche jorisvandenbossche added this to the 0.24.0 milestone Jan 22, 2019
@TomAugspurger
Copy link
Contributor

This is worth considering. .astype is a surprisingly complex method, and anything we can do to limit its scope seems worthwhile. Reserving to_numpy() or __array__ for casting to a NumPy array seems best.

Regarding storing the result of it in a Series/DataFrame should not change, because then we unpack such a PandasArray anyway.

In theory :) In practice, see #24866 (though I'm not really happy with that PR right now).

@TomAugspurger
Copy link
Contributor

It's also worth spelling out whether we would change Categorical.astype (probably? Should be easy to deprecate).

SparseArray.astype does the "right" thing (returning an ExtensionArray) already .

@jreback
Copy link
Contributor

jreback commented Jan 22, 2019

is be in favor of this

@h-vetinari
Copy link
Contributor

h-vetinari commented Jan 22, 2019

As mentioned in the comment of #24674, I'd obviously be in favour of this. I like the consistency of calls to methods on pandas objects staying withing the pandas-universe, and getting to numpy-land needing an explicit .to_numpy().

It would also circumvent some numpy issues like numpy/numpy#12550, which I discovered while working on the still-WIP PR for #23833.

@jschendel
Copy link
Member

Would require some small adjustments to corner cases in IntervalArray.astype:

  • astype(object) currently returns an ndarray.
  • it's possible to astype a purely NA IntervalArray to a float ndarray.

@jorisvandenbossche
Copy link
Member Author

Regarding IntervalArray

astype(object) currently returns an ndarray.

Yes, that is currenlty the case for all Arrays. It would then return a PandasArray[object].

it's possible to astype a purely NA IntervalArray to a float ndarray.

That currently indeed works. But should we actually allow this? It feels strange to me to special case the all NA case, as in general, casting to float raises an error that it cannot cast IntervalArray to float64 (although I see in the code that this is actually not being special cases, but simply the result of doing np.asarray(self).astype(dtype), which makes sense.


It seems we have agreement on this? In that case, I think ideally we should do this for 0.24.0.

@jreback
Copy link
Contributor

jreback commented Jan 23, 2019

This might fine. But let's actually get 0.24.0 out the door. This is an 11-hour change.

@jreback jreback modified the milestones: 0.24.0, 0.25.0 Jan 23, 2019
@h-vetinari
Copy link
Contributor

We're again approaching the 11th hour of the release, but this issue should not be postponed again IMO.

@jreback
Copy link
Contributor

jreback commented Jun 23, 2019

@h-vetinari if there is a PR we could consider it

@jreback jreback modified the milestones: 0.25.0, 1.0 Jun 28, 2019
@TomAugspurger
Copy link
Contributor

This probably isn't a blocker for 1.0 right?

@TomAugspurger TomAugspurger modified the milestones: 1.0, Contributions Welcome Jan 6, 2020
@mroeschke mroeschke added Needs Discussion Requires discussion from core team before further action Enhancement labels Jun 25, 2021
@jorisvandenbossche
Copy link
Member Author

In light of my exploration of dtype casting (#22384), I am also looking again into this one. Started a draft PR at #43699

@jbrockmendel
Copy link
Member

I'm not wild about this.

  1. If a user passes dtype=np.dtype(np.int64), they can reasonably expect to get that dtype back, not PandasDtype("int64")

  2. This would mean we end up with PandasArray in a lot more places internally that we then need to unpack, which is pure overhead.

  3. If we implement this on EA, what happens with Series.astype

ser = pd.Series(range(5), dtype="Int64")
ser2 = ser.astype(np.int64)

So now Series.astype (or something else in the call stack) has to do something like

if isinstance(self._values, ExtensionArray) and isinstance(dtype, np.dtype):
    res = self._values.to_numpy(dtype)
else:
    res = self._values.astype(dtype)

which isn't that onerous, but de facto we have to use something like this pattern everywhere that we currently do a .astype while working with ArrayLike values.

@jreback
Copy link
Contributor

jreback commented Sep 23, 2021

agree with @jbrockmendel here

i don't see it feasible to return PandasArray ever

i don't see why you would always return only EA when we have perfectly usable numpy dtypes

@jbrockmendel jbrockmendel added the PDEP missing values Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint label Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Astype Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action PDEP missing values Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants