-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: astype mechanism for extension arrays #22384
Comments
To be clear, there are a couple levels we could take on here. At a minimum, I think |
Let's try to summarise the (ideal) requirements:
Does that sound right? In the PR I mentioned the idea of a "negotation protocol" (#22343 (comment)) similar to how operators work in Python:
I used here The other concern raised is the divergence between However, I think it should be possible (if the EA author wants that) to have this more consistent, and we could implement a base class Another possibility would be that we provide some helper function for EA authors (and for internal use) that basically does the above, but simply inside the For example, the
with
But that then counts on the EA's actually doing this. |
Sorry I missed your comment earlier Joris. Your list of ideal requirements sounds correct. Haven't thought much about the implementation yet. Regardless, not something we can solve for 1.0 so pushing. |
@jorisvandenbossche Based on something I wrote here (#31204 (comment)) with respect to You wrote:
I think we ought to consider doing the reverse, namely: When doing an astype operation (eg Series(EA).astype(dtype)), if the target dtype knows how to convert the values to its own dtype -> if it knows, it returns a new EA/array; otherwise it returns NotImplemented
In the case of As I wrote there, I would think that the behavior of |
Just a few comments on this, not sure if they're helpful. Input parameters
Should these arguments be allowed with Return
import pandas as pd
series = pd.Series(["a", "c", "b", "c", "a", "b", "a", "c"])
explicit = pd.Categorical(series) # dtype = categorical array
implicit = series.astype("category") # dtype = series
Consistency with
|
You would astype to a CategoricalDtype, not Categorical. |
Ok, I got confused with |
@jorisvandenbossche some suggestions to add to the list of desired behavior:
|
I see this as exactly a use case of Is there a viable option to punt on this before we go around in circles? |
We can at least move it to a separate discussion. Will open a new issue or PR for this. |
An overview of some dtype casting combinations, and the kind / safety of casting that they could allow:
|
General observation: it might actually be sufficient to consider only two kinds of casting: supported and unsupported casts (i.e. casts that always raise a TypeError). And for the supported casts, we could always require a safe cast by default (i.e. value preserving), with the option to turn off checks to get unsafe cast. If you look in the overview above for the casts that can work in some cases, numpy considers some safe and some same_kind or unsafe. But in practice, for all those cases, it can be equality-preserving (depending on the actual value). So for pandas point of view, I would not make this distinction up-front about safe/unsafe casts (based on the dtypes), but only handle runtime errors (out of bound values, values that would overflow or get truncated, etc). |
I think we should agree on a couple principles, and work from those:
Using these rules, especially 2, I think I agree with Joris' point:
|
@seberg im curious if any of the types of safety discussed here are going to be handled differently in new numpy casting stuff you're working on |
I agree with this, and I would like to bring up what I wrote at #22384 (comment) , since I think it is related to the current discussion: We have to decide where the s1 = pd.Series(somelist, dtype=dtype1)
s2 = s1.astype(dtype2) where I believe that in pandas today, the following happens (please correct me if I'm wrong):
I think the second behavior should change. When I believe that if we think of it this way, we accomplish the goal of making This then lets us separate discussions into questions about how each receiving type should convert objects of different types based on the receiving type. Hope this makes sense. |
The casting safety is something I have not really attempted to fix/attack. The one general comment that I have (which is probably either useless or enlightening) is that I consider "common dtype" and casting safety as completely unrelated. The other one is that "safe" could also mean safe from a type perspective (i.e. From a user-perspective, I also think that an "attempt" casting safety would make sense (I guess your I designed parts of the new API in a way that we could add new casting safety levels at any time, but attempting a cast that may be safe is a bit different, since it requires propagating the information to the actual casting function: It is possible to grow in NumPy, but we have to design new API!
This one is a tricky one :(. NumPy force-coerces (effectively force casts) for (Not sure these notes actually help, basically, I have some thoughts about it, but little API decisions. And I think NumPy should try to move a bit eventually, but I am not exactly sure how/where.) EDIT: sorry, the comment about common dtype is esoteric, probably just ignore it... It has some reason in: |
Maybe I should just say what I do for NumPy's normal casting: I return the casting level from its implementation. I.e. a cast is described by an object:
So the (The distinct |
A suggestion for the list of principles: |
Per #44117, please consider adding keyword argument e.g. Currently, the following raises a >>> a = pd.Series(['a', 1, 3, ''], dtype=np.int)
ValueError: invalid literal for int() with base 10: 'a' However, it would be nice to have (something like) the following capability: >>> a = pd.Series(['a', 1, 3, '']).astype(np.int, errors="coerce")
>>> a
0 NaN
1 1
2 3
3 NaN
dtype: int32 Such an enhancement would centralize functionality existing in the multiple (ASIDE: |
xref #45151 |
Re-reading this thread, I think it can be split into two more-or-less distinct issues:
For 1), ATM the base class EA.astype dispatches to _from_sequence, so de facto I think we've landed on a Reasonable Default (tm). Is there still a need for something like a negotiation protocol? I move that we declare this problem solved until a compelling use case is spotted. For 2), I'm increasingly skeptical of a one-size-fits-all keyword argument with a fixed set of options. I'm wavering between a) "one-size-fits-most to handle the most common 90%", b) "something customizable", c) "the status quo isn't so bad" |
So for the second point:
I opened a new, dedicated issue (-> #45588), so we can further focus here on the first point (potential dispatch mechanism)
@adamgranthendry yes, we have always had a bit the divide between "specialized" conversion functions (eg |
On the actual topic of a "mechanism" to dispatch to the source or target dtype to perform the actual cast, I made some explanatory code mock-ups last week, to think through what a potential mechanism would look like or what the advantages / disadvantages would be. Numpy-like CastImplementation registryFirst, I tried to make a mockup of a mechanims that mimics numpy (https://numpy.org/neps/nep-0042-new-dtypes.html#the-cast-operation), but a little bit simplified. But it keeps the idea of a The abstract API for a base class and the registry: class CastImplementation:
def resolve_dtypes(
self, dtypes: Tuple[ExtensionDtype, ExtensionDtype], **kwargs
) -> Tuple[ExtensionDtype, ExtensionDtype]:
"""given the from/to dtypes, for which dtypes is the cast implemented"""
pass
def do_cast(
self, array: ExtensionArray, dtype: ExtensionDtype, **kwargs
) -> ExtensionDtype:
"""does the actual cast"""
pass
cast_registry = {}
def register_cast_impl(dtypes, cast_impl: CastImplementation | None):
"""Register cast implementation for (source, target) dtype combination.
Register as `None` for explicitly registering it as an unsupported cast
""""
if dtypes in cast_registry:
raise ValueError("cast already registered")
cast_registry[(dtype_from, dtype_to)] = cast_impl
def get_cast_impl(dtypes) -> CastImplementation | None:
return cast_registry.get(dtypes, DefaultFallBackCast) (the Dtypes can then implement and register those, here as simple example a subset of what would be registered for (Float, Int): class FloatToIntCast(CastImplementation):
def resolve_dtypes(self, dtypes, **kwargs):
# non-parametrized dtypes, so always return the input
return dtypes
def do_cast(self, array, dtype, **kwargs):
return astype_float_to_int(array, dtype, **kwargs)
for dtype_from in [pd.Float32Dtype, pd.Float64Dtype]:
for dtype_to in [pd.Int8Dtype, pd.Int16Dtype, pd.Int32Dtype, pd.Int64Dtype]:
register_cast_impl((dtype_from, dtype_to), FloatToIntCast) and then the actual def astype(arr, dtype, **kwargs):
# get the cast implementation class (based on dtype classes, not instances)
cast = get_cast_impl((type(arr.dtype), type(dtype)))
# if cast is known to be not supported:
if cast is None:
raise TypeError(f"Cannot cast from {arr.dtype} to {dtype}.")
# in case of parametrized dtypes, check from which to which dtype we can cast
resolved_dtype_from, resolved_dtype_to = cast.resolve_dtypes((arr.dtype, dtype))
# if needed, first cast the input array to a different parametrization
if arr.dtype != resolved_dtype_from:
cast_from = get_cast_impl((type(arr.dtype), type(arr.dtype)))
arr = cast_from.do_cast(arr, resolved_dtype_from)
# do the main cast
arr = cast.do_cast(arr, resolved_dtype_to)
# if needed, cast the result array to a different parametrization
if dtype != resolved_dtype_to:
cast_to = get_cast_impl((type(resolved_dtype_to), type(resolved_dtype_to)))
arr = cast_to.do_cast(arr, dtype)
return arr This is a quite complicated API (and also quite different from other interfaces we currently have in pandas with the class), and maybe too complicated for what we need. But some nice things about it:
While in the code snippet above the cast implementation is a class, this could maybe also be simplified to registering cast functions for given dtype (kind of function dispatch based on the (source, target) dtypes). Simplified cast from/to ExtensionDtype methodsSomething simpler, but still with the possibility to have both the target dtype as source dtype implement the actual cast, would be to add two methods for this to the class ExtensionDtype:
def _cast_from(self, array: ExtensionArray, dtype: ExtensionDtype, **kwargs) -> ExtensionArray | None:
"""cast from this dtype to another dtype"""
if isinstance(dtype, IntegerDtype):
return astype_to_int(array, dtype)
elif isinstance(dtype, StringDtype):
result = to_string(array)
# calling out to the string dtype's cast to ensure we have the exact dtype parametrization
return result.astype(dtype, copy=False)
elif ...
return None
def _cast_to(self, array: ExtensionArray, **kwargs) -> ExtensionArray | None:
"""cast from another dtype to this dtype"""
...
return None Those two methods either return a resulting array if they know how to cast (i.e. if they are the dtype that implements the cast) or otherwise they return None. def astype(arr, dtype, **kwargs):
# does the target dtype know how to cast to it?
result = dtype._cast_to(arr, **kwargs)
if result is not None:
return result
# else try if the array itself knows how to cast to other
result = arr.dtype._cast_from(arr, dtype, **kwargs)
if result is not None:
return result
# fall back to from_sequence
return dtype.construct_array()._from_sequence(arr, dtype, **kwargs) This is a bit simpler than the numpy-like CastImplementation mechanism, and might be sufficient for pandas. Note however that while the actual Formalizing the current situationAs @jbrockmendel brought up just above, formalizing what we currently already do basically in practice for most dtypes is also an option:
In practice, this means that every ExtensionArray subclass needs to define its own class ExtensionArray:
def astype(self, dtype: ExtensionDtype, **kwargs) -> ExtensionArray:
# first handle the cases where the array itself knows how to cast
# to the target dtype
if isinstance(dtype, ...):
return ...
elif ...
# if we don't know ourselves, leave it to the target dtype
return dtype.construct_array()._from_sequence(self, dtype) (the above is leaving out the fact that we actually convert to numpy.array for numpy target dtype, but so just illustrative for the flow if handling extension dtypes) If we would decide this is Good Enough, then I would argue for using a different method than But it is indeed good to ask ourselves: is a negotiation protocol actually needed? (apart from a single I think the main use case I can think about now for the more complicated interface (the first proposal with CastImplementation) is the nicety about not having to know each parametrization of dtypes (the |
I think it is "good enough". You also have to consider what happens for user-implemented extension dtypes. If we just require the extension dtypes to implement I don't think you then need to have |
I will note one two things about my API for NumPy.
But, while I think my API choice is awesome ;), I am not sure which parts fit to pandas ExtensionArrays. The important point is that NumPy provides the container and DTypes are designed completely container agnostic. So, for example, NumPy has to allocate the result array, but the The point of having both directions available (in some form), is that if you write a
Right now the API is a bit restrictive about adding new casts to the dictionary, but effectively that is what defining a cast means: Adding an |
@Dr-Irv I think it's important to be able to control the cast from your dtype to any other dtype as well, and not only to your dtype. Sebastian gave an example of doing a custom float dtype. And another example that I often use to think this through is the GeometryDtype from geopandas. I mentioned above the cast between geometry and string, and in this case as extension dtype author I want to be able to control both the (now, as long as the EA can override But what complicates the "asking each EA to know about StringDtype" is that there is not a single StringDtype, but that it is parametrized (python or pyarrow storage). And the nice thing about the first (numpy-like) proposal is that it doesn't require the EA to know about every StringDtype parametrization, but allows to only know about one.
@seberg I like your API as well :), but it's indeed not necessarily needed for pandas. In numpy it operates on a lower level, interacts with ufuncs, etc. And as you mention, we don't only have the dtype, but also the array container (EA subclass) which is already supposed to do hold some of this logic (eg if casting is needed in an operation, such as |
I think the existing pattern discussed above can do this with the exception of a corner case I'll describe below. I'm curious if anyone thinks this corner case is A Big Deal. Suppose Sally is implementing FooDtype/FooArray and wants to specify the behavior of BarDtype/BarArray.astype (of which she is not the author). Suppose that BarArray.astype uses the pattern discussed above of the form:
Then in FooArray._from_sequence she can do:
So Sally can accomplish the two-way control using the existing pattern except for if |
yes, that makes sense. I now see that the symmetry of |
So if Sally implements |
That's my understanding of _cast_to, yes. My point was that this behavior already exists with the existing pattern with _from_sequence taking the place of _cast_to. |
In #22343, we're bumping into a few problems around casting to / from extension arrays.
ExtensionArray.astype
doesn't handle targetdtype
s that are extension types (soextension_array.astype('category')
fails). At the moment, each EA will have to implement their own astyping, which is error prone and difficult.I'm not sure how far down this road we want to go. Should we instead encorage users to use explicit constructors like
.assign(x=pd.Categorical(...)
rather than.astype('category')
?The text was updated successfully, but these errors were encountered: