API: add EA._from_scalars / stricter casting of result values back to EA dtype #38315

jorisvandenbossche · 2020-12-05T13:09:44Z

This adds a ExtensionArray._from_scalars function which is more strict than the existing _from_sequence constructor: in theory, it should only accept a sequence of scalars of its own type (i.e. the type you get with arr[0], plus NAs), or arrays with the appropriate dtype.

This can then be used in a context where we try to create an EA of the original dtype (in an attempt to preserve the original dtype), but we don't want to coerce any value to that dtype.

Note: this is a WIP PR, which mostly adds dummy implementations of this method, just to mimic the desired behaviour for now, to be able to explore how this would work.
If this turns out to be working, we will need to make better implementations for _from_scalars for our own EAs.

Apart from that, this needs tests, docs, etc. But let's first discuss semantics.

Add tests for example case from API: inconsistency in casting back to its EA dtype (try_cast_to_ea) #31108, period case from BUG: incorrect EA casting in groubpy.agg #38254, test case for categorical with non-categories
Reopen solve "Int64 with null value mangles large-ish integers" problem #30282 once this is merged

jorisvandenbossche · 2020-12-05T14:09:50Z

So some difficult cases I ran into:

1. Integer/BooleanArray gets cast to floats because of NAs and thus give float results, even for ops that are potentially dtype-preserving

In theory, if we are strict in IntegerArray._from_scalars, we should not accept floats, to avoid issues like where the mean() gets cast back to int if the results are accidentally all "integer-like" (see eg one of the examples in #37494, or #27071 (comment)).

However, when you have missing values and operate in numpy, you will typically operate on the array converted to a float array, and thus also get float results, even for ops that are potentially dtype preserving.
An example of this is test_groupby_sum_mincount test for nullable boolean data, which basically does this:

In [1]: arr = pd.array([True, True, pd.NA, pd.NA, False, False, True], dtype="boolean")

In [2]: df = pd.DataFrame({"A": [1, 1, 2, 2, 3, 3, 1], "B": arr})

In [3]: df.groupby("A").sum()
Out[3]: 
     B
A     
1  3.0
2  0.0
3  0.0

With the current implementation, but without trying to cast back or when being strict in what IntegerArray._from_scalars accepts, the result are floats (like above).
For this case, we know the result should be Int64, which is coded here:

pandas/pandas/core/dtypes/cast.py

Lines 364 to 365 in 7073ee1

    
           elif how in ["add", "cumsum", "sum"] and isinstance(dtype, BooleanDtype): 
        
               return Int64Dtype()

But that still means that either Int64 _from_scalars needs to accept floats, or either we need to force the conversion (which we could do in this case, because we definitely know the result dtype, but that's not something we know in general)

Something similar happens in the failing pandas/tests/groupby/test_nth.py::test_first_last_extension_array_keeps_dtype tests, because there we also get floating 0.0 and 1.0 that represent False and True, and we try to preserve the type, which doesn't work anymore. So also here we either need to force the result back, or be less strict in BooleanArray._from_scalars.

2. Data types with different precisions

For the integer and floating data types, we are currently not always strict to the exact dtype instance, i.e. the precision (eg in _from_sequence, when no dtype is passed, we will default to Int64 or Float64) when casting back.

Depending on the result, we also might not want to cast back to eg Int8, if the results would not fit in the int8 range, but would still rather want to use Int64 (to preserve nullability) instead of falling back to numpy int64.

An example:

In [1]: df = pd.DataFrame({"key": [1, 1, 2, 2], "B": pd.array([1, 2, 3, 4], dtype="Int8")})

# "first" is a dtype-preserving op, so this is fine
In [2]: df.groupby("key")["B"].first()
Out[2]: 
key
1    1
2    3
Name: B, dtype: Int8

# "sum" should actually always use Int64, and not preserve Int8
# but this is something we know in this case, and so can force
# (this is being fixed in https://github.com/pandas-dev/pandas/pull/38291)
In [3]: df.groupby("key")["B"].sum()
Out[3]: 
key
1    3
2    7
Name: B, dtype: Int8

# For a UDF, we can try to cast back, and if it fails use the original result, in this case
# an int64 numpy array (which is what happened here). The reason it fails here is because
# np.int64 scalars are disallowed in _form_scalars with Int8 dtype (we check instances of `dtype.type`) 
# However, we could also be more flexible here and detect that we get int64 ndarray, so use Int64 
# instead of strict adherence to Int8
In [4]: df.groupby("key")["B"].aggregate(lambda x: x.sum())
Out[4]: 
key
1    3
2    7
Name: B, dtype: int64

So in "theory", the idea of _from_scalars it to be strict about the dtype, but we should maybe specifically say that it is fine for the EA class to not be strict about the exact dtype instance (Int8 vs Int64), as long as it is strict regarding its general type (for this example: integers).

3. Sparse subtype

See the example described here: #33254 (comment), about SparseArray trying to preserve the "sparse" aspect, and not necessarily the exact SparseDtype.subtype. Sor for eg an array with sparse[int] dtype could give sparse[bool] results (instead of falling back to bool):

In [3]: a = pd.arrays.SparseArray([0, 1, 0])

In [4]: s = pd.Series(a)

In [5]: s
Out[5]: 
0    0
1    1
2    0
dtype: Sparse[int64, 0]

# on master, you get:
In [31]: s.combine(1, lambda x, y: x == y)  
Out[31]: 
0    False
1     True
2    False
dtype: Sparse[bool, False]

# with this PR, however, I currently get:
In [6]: s.combine(1, lambda x, y: x == y)
Out[6]: 
0    False
1     True
2    False
dtype: bool

So this is actually the same underlying case as case 2 (the precision of integer/floating dtypes), but, for sparse it's broader, as it is a "composite" dtype (it has a subtype), so there are much more possibilities compared to just the precision of the integer/floating dtypes.

4. Object type or subdtype

In general, we are not going to be able to be strict when dealing with object types.
For example, for a Categorical with string categories stored in an object dtype index, we are not going to be able to just be strict on the type, since object can hold anything. Now, in the specific case of Categorical, we can probably be strict about "scalars need to be in the categories" (so while writing this down, realizing this might not be an issue).

We do have the issue for object type PandasArray, but that's not an actually used EA in practice at the moment. So that's probably something we can defer to later, if at some point we would get a real "object" EA-dtype (and in that case, it will probably be fine to just always infer the dtype of a potential result and never try to preserve it, since it's object anyway (it would by definition always be possible to preserve the object type)).

jorisvandenbossche · 2020-12-05T15:17:06Z

pandas/tests/indexes/interval/test_astype.py

+        # Once this is fixed, the commented code can be uncommented
+        # -> https://github.com/pandas-dev/pandas/issues/38316
+        # expected = CategoricalIndex(index.values)
+        expected = CategoricalIndex(np.asarray(index.values))


I opened an issue for this -> #38316 (it's also mentioned in the comment above)

jorisvandenbossche · 2020-12-05T15:19:35Z

pandas/core/dtypes/cast.py

+    # result_dtype = maybe_cast_result_dtype(dtype, how)
+    # if result_dtype is not None:
+    #     # we know what the result dtypes needs to be -> be more permissive in casting
+    #     # (eg ints with nans became floats)
+    #     cls = result_dtype.construct_array_type()
+    #     return cls._from_sequence(obj, dtype=result_dtype)


This is related to the first case mentioned (when we know the resulting dtype, we should maybe force the result back, instead of relying on strict casting).
So the above code is one possibility: we can change maybe_cast_result_dtype to only return a dtype if it knows what dtype should be and otherwise return None (instead of passing through the original type). This way, we can take a different path for "known" dtypes, vs when guessing the dtype.

jbrockmendel · 2020-12-05T20:37:18Z

Big picture, how does this relate to _from_sequence? would we being doing this instead of making _from_sequence strict, or in doing both?

jbrockmendel · 2020-12-05T20:38:12Z

pandas/core/arrays/base.py

@@ -188,6 +188,12 @@ class ExtensionArray:
    # Constructors
    # ------------------------------------------------------------------------

+    @classmethod
+    def _from_scalars(cls, data, dtype):
+        if not all(isinstance(v, dtype.type) or isna(v) for v in data):


isna -> is_valid_nat_for_dtype? (still need to rename to is_valid_na_for_dtype)

this has been renamed to the clearer is_valid_na_for_dtype

jbrockmendel · 2020-12-05T20:39:15Z

pandas/core/arrays/categorical.py

+        # if not all(
+        #     isinstance(v, dtype.categories.dtype.type) or isna(v) for v in data
+        # ):
+        #     raise TypeError("Requires dtype scalars")


if not all (x in dtype.categories or is_valid_nat_for_dtype(x, dtype.categories.dtype) for x in data)?

Yes, as mentioned somewhere in #38315 (comment), for categorical we probably want to check if the values are valid categories.

We might already have some functionality for this? Like the main constructor, but then raising an error instead of coercing unknown values to NaN:

In [17]: pd.Categorical(["a", "b", "c"], categories=["a", "b"]) Out[17]: ['a', 'b', NaN] Categories (2, object): ['a', 'b']

The above is basically done by _get_codes_for_values, so we might want a version of that which is strict instead of coercing to NaN.

I think Categorical._validate_setitem_value doest what you're describing

jbrockmendel · 2020-12-05T20:39:57Z

pandas/core/arrays/integer.py

+        # override because dtype.type is only the numpy scalar
+        # TODO accept float here?
+        if not all(
+            isinstance(v, (int, dtype.type, float, np.float_)) or isna(v) for v in data


for floats require that v.is_integer()?

The coerce_to_arrray function that is used by _from_sequence already checks for this as well (so we can pass through here any float, as it will be catched later)

DTA/TDA/PA have a _recognized_scalars attribute that could be useful

jorisvandenbossche · 2020-12-05T20:42:31Z

For your inline comments (up to now): can we keep it more high-level for now (the semantics of _form_scalars and how the method is used?
That are mostly comments on implementations of _from_scalars, but as mentioned above, this are mostly dummy (and slow) implementations that would never be good to merge, just to explore the semantics of how it could work.

jbrockmendel · 2020-12-05T21:24:16Z

For your inline comments (up to now): can we keep it more high-level for now (the semantics of _form_scalars and how the method is used?

sure

jorisvandenbossche · 2020-12-06T09:23:33Z

@jbrockmendel sorry about my brief comment yesterday, to be clear your review comments are certainly useful for when we optimize those implementations (it's just that I didn't yet do any effort related to that)

jorisvandenbossche · 2020-12-06T09:27:24Z

Big picture, how does this relate to _from_sequence? would we being doing this instead of making _from_sequence strict, or in doing both?

It relates to _from_sequence in the sense that they both serve a similar purpose (created an array object from a sequence/array-like as input), but _from_scalars is strict, and _from_sequence is not strict.
And we can't make the existing _from_sequence strict, because it is used for general array creation purposes (pd.array(..)) where we explicitly don't want to be strict. See #33254 for the detailed explanation.

…alars-take-2

jbrockmendel · 2020-12-15T18:07:46Z

The astype discussion is making me more positive on this. is it ready for another look?

jorisvandenbossche · 2020-12-15T19:51:55Z

is it ready for another look?

It's in the same state as the previous time you looked (I only merged master since): feedback is mostly needed on the semantics (and not necessarily on the actual implementation, as it are mostly dummy implementations that should still be optimized), and on all the questions I raised above #38315 (comment), before I can further work on it
So yes, this is ready (waiting) for more feedback.

jbrockmendel · 2020-12-16T22:45:03Z

pandas/core/frame.py

@@ -2909,7 +2909,7 @@ def transpose(self, *args, copy: bool = False) -> DataFrame:
            arr_type = dtype.construct_array_type()
            values = self.values

-            new_values = [arr_type._from_sequence(row, dtype=dtype) for row in values]
+            new_values = [arr_type._from_scalars(row, dtype=dtype) for row in values]


in this case we already know we have the correct types, so cant we go directly to from_sequence?

jbrockmendel · 2020-12-16T22:45:41Z

pandas/core/internals/managers.py

@@ -965,7 +965,7 @@ def fast_xs(self, loc: int) -> ArrayLike:
                result[rl] = blk.iget((i, loc))

        if isinstance(dtype, ExtensionDtype):
-            result = dtype.construct_array_type()._from_sequence(result, dtype=dtype)
+            result = dtype.construct_array_type()._from_scalars(result, dtype=dtype)


here too, shouldnt we know we have the correct types?

jbrockmendel · 2020-12-16T23:09:56Z

IIUC it is mainly just in maybe_cast_to_extension_array that from_scalars is needed. Are there others?

Supposing we had a fully-general solution to #37367. That would render from_scalars unnecessary, right?

What are your thoughts on how this affects from_sequence? e.g. it could become non-strict and used for astype?

jorisvandenbossche · 2020-12-17T23:16:08Z

Supposing we had a fully-general solution to #37367. That would render from_scalars unnecessary, right?

Do you mean the ability to generally use infer_dtype for scalars from extension arrays? (so eg some way to register scalar types linked to a certain extension dtype) I seem to recall we have an issue about this, but can't directly find this.
(note this is not directly related to #37367, as that PR only fixes the case where you already have an ExtensionArray to start with, not a list / object dtyped array of scalars)

I would need to think a bit more about it, but yes, that's certainly related. One concern that comes to mind is that you can have conflicting dtypes that use the same scalar type. So even then you might still want to give the original dtype the chance to "recreate" itself with a method like from_scalars.

What are your thoughts on how this affects from_sequence? e.g. it could become non-strict and used for astype?

See #38315 (comment) ?
It's already non-strict, and short term it is used for pd.array(..) and in casting routines. But, as noted in #33254, what is currently covered by _from_sequence could in principle also be done by a general casting interface (#22384)

github-actions · 2021-01-17T00:43:19Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

jorisvandenbossche · 2021-01-18T10:57:12Z

(@jbrockmendel this is kind of waiting on more feedback from you or others)

jbrockmendel · 2021-01-19T00:02:17Z

IIUC it is mainly just in maybe_cast_to_extension_array that from_scalars is needed. Are there others?

Did you address this? Looking back on my inline comments, this looks important.

jbrockmendel · 2021-01-19T02:23:18Z

Looked back over this and #33254. Thoughts:

We should choose a name that makes it really clear what this is for and what distinguishes it from _from_sequence
If we could avoid needing another constructor that would be nice. One way to do that would be to make _from_sequence strict. The barriers to that AFAICT are
a) DTA/TDA getting special treatment inside pd.array (similar behavior shows up in DatetimeIndex.__new__ and Series constructor). Could use to_datetime/to_timedelta instead of from_sequence for those. This has the normal downsides that come with special-casing.
b) astype machinery might want to be more forgiving
i) Do we have a good idea of what cases we would want to allow in astype but not in _from_sequence?

…alars-take-2

jorisvandenbossche · 2021-01-29T16:06:56Z

Sorry for the slow reply here.

IIUC it is mainly just in maybe_cast_to_extension_array that from_scalars is needed. Are there others?

That certainly has been the use case driving the discussion, yes (and maybe_cast_to_extension_array is eventually used in the python fallback in groupby and in the combine method).

Now, in this PR, I am using it in a bunch of other places as well. Places where we know we will only have scalars of exact correct type. But, that should maybe be discussed whether that's actually useful. Because assuming that _from_sequence will also give the correct result in this case, using _from_scalars instead only makes sense if it is for example faster. In theory _from_scalar implemenations could be faster because they have to deal with less variable input, but in practice making it strict also might have a performance cost.

We should choose a name that makes it really clear what this is for and what distinguishes it from _from_sequence

For me _from_scalars is clear in that regard (it's to reconstruct from scalars as you would get them from the array). But alternative suggestions are certainly welcome.

One way to do that would be to make _from_sequence strict. The barriers to that AFAICT are .. Do we have a good idea of what cases we would want to allow in astype but not in _from_sequence?

In general both pd.array(..) and astype(..) are not strict (and I think both those cases should probably allow the same conversions). For example, we allow casting floats to ints or ints to floats, casting ints or strings to datetime or the other way around, etc

github-actions · 2021-03-01T00:15:23Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

jbrockmendel · 2021-03-26T23:13:03Z

Now, in this PR, I am using it in a bunch of other places as well. Places where we know we will only have scalars of exact correct type.

This may just be a semantic misunderstanding. Is there a distinction between, e.g. a list of Period objects vs an ndarray[object] of Period objects vs a PeriodArray? My intuition is that the PeriodArray would be a use case for from_sequence and the others would be for from_scalars.

For the purposes of review, I would find it much easier if you changed _from_sequence usages to _from_scalars only in the places where the difference in behavior matters.

Will this make period_array unnecessary? i dont like the idea of ending up with 3 constructors for PeriodArray.

jbrockmendel · 2021-05-18T20:44:27Z

from #38315 (comment)

so we might want a version [of a Categorical constructor] of that which is strict instead of coercing to NaN.

#40996 proposes making the Categorical constructor stricter. On last week's call there was consensus to try changing it to see if it breaks the world. It turns out to break 220 tests.

I think we do want to support having some way of taking e.g. ["A", "B", "D"] -> Categorical(["A", "B", NaN], categories=["A", "B", "C"]). The two options that come to mind are astype or _from_sequence. In one or both (i think i prefer just one) we are going to need a keyword (or second method like in this PR) to make it either more or less strict.

ATM i'm leaning towards preferring a keyword in astype either errors='raise' like we use elsewhere, or casting='unsafe' to follow numpy. I may revisit this opinion as i deal with more of our corner-y constructor logic.

@jorisvandenbossche Not sure if this really helps with this PR, but wanted to keep you apprised of my thinking on the matter.

simonjayhawkins · 2021-05-24T17:58:44Z

removing the 1.3 milestone

jreback · 2021-11-28T21:04:19Z

this is quite old, happen to reopen if actively worked on.

jbrockmendel · 2022-11-04T16:04:44Z

another thought: we could specify what exceptions _from_scalars is allowed to raise, to avoid except Exception in maybe_cast_to_extension_array

simonjayhawkins · 2023-02-22T13:47:25Z

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

Add stricter EA._from_scalars and use in maybe_cast_result

c8dc332

jorisvandenbossche added API Design ExtensionArray Extending pandas with custom dtypes or arrays. labels Dec 5, 2020

jorisvandenbossche added this to the 1.3 milestone Dec 5, 2020

jorisvandenbossche requested a review from jbrockmendel December 5, 2020 13:09

silence interval astype(category) test

2bf2992

jorisvandenbossche commented Dec 5, 2020

View reviewed changes

jorisvandenbossche mentioned this pull request Dec 5, 2020

BUG: incorrect EA casting in groubpy.agg #38254

Closed

5 tasks

jbrockmendel reviewed Dec 5, 2020

View reviewed changes

jorisvandenbossche added 2 commits December 10, 2020 22:48

Merge remote-tracking branch 'upstream/master' into EA-strict-from-sc…

4ddfe1c

…alars-take-2

Merge remote-tracking branch 'upstream/master' into EA-strict-from-sc…

98eec57

…alars-take-2

jbrockmendel reviewed Dec 16, 2020

View reviewed changes

jorisvandenbossche mentioned this pull request Dec 30, 2020

BUG: Series constructor with nullable unsigned integer dtype fails with large number #38798

Closed

github-actions bot added the Stale label Jan 17, 2021

jorisvandenbossche removed the Stale label Jan 18, 2021

Merge remote-tracking branch 'upstream/master' into EA-strict-from-sc…

5c4dc77

…alars-take-2

github-actions bot added the Stale label Mar 1, 2021

simonjayhawkins removed this from the 1.3 milestone May 24, 2021

jreback closed this Nov 28, 2021

jorisvandenbossche reopened this Dec 7, 2021

simonjayhawkins closed this Feb 22, 2023

jorisvandenbossche mentioned this pull request May 15, 2023

ENH: EA._from_scalars #53089

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: add EA._from_scalars / stricter casting of result values back to EA dtype #38315

API: add EA._from_scalars / stricter casting of result values back to EA dtype #38315

jorisvandenbossche commented Dec 5, 2020 •

edited

Loading

jorisvandenbossche commented Dec 5, 2020 •

edited

Loading

jorisvandenbossche Dec 5, 2020

jorisvandenbossche Dec 5, 2020

jbrockmendel commented Dec 5, 2020

jbrockmendel Dec 5, 2020

jbrockmendel Mar 26, 2021

jbrockmendel Dec 5, 2020

jorisvandenbossche Dec 6, 2020

jbrockmendel Dec 16, 2020

jbrockmendel Dec 5, 2020

jorisvandenbossche Dec 6, 2020

jbrockmendel Dec 16, 2020

jorisvandenbossche commented Dec 5, 2020

jbrockmendel commented Dec 5, 2020

jorisvandenbossche commented Dec 6, 2020

jorisvandenbossche commented Dec 6, 2020

jbrockmendel commented Dec 15, 2020

jorisvandenbossche commented Dec 15, 2020

jbrockmendel Dec 16, 2020

jbrockmendel Dec 16, 2020

jbrockmendel commented Dec 16, 2020

jorisvandenbossche commented Dec 17, 2020

github-actions bot commented Jan 17, 2021

jorisvandenbossche commented Jan 18, 2021

jbrockmendel commented Jan 19, 2021

jbrockmendel commented Jan 19, 2021

jorisvandenbossche commented Jan 29, 2021 •

edited

Loading

github-actions bot commented Mar 1, 2021

jbrockmendel commented Mar 26, 2021

jbrockmendel commented May 18, 2021

simonjayhawkins commented May 24, 2021

jreback commented Nov 28, 2021

jbrockmendel commented Nov 4, 2022

simonjayhawkins commented Feb 22, 2023

API: add EA._from_scalars / stricter casting of result values back to EA dtype #38315

API: add EA._from_scalars / stricter casting of result values back to EA dtype #38315

Conversation

jorisvandenbossche commented Dec 5, 2020 • edited Loading

jorisvandenbossche commented Dec 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Dec 5, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Dec 5, 2020

jbrockmendel commented Dec 5, 2020

jorisvandenbossche commented Dec 6, 2020

jorisvandenbossche commented Dec 6, 2020

jbrockmendel commented Dec 15, 2020

jorisvandenbossche commented Dec 15, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Dec 16, 2020

jorisvandenbossche commented Dec 17, 2020

github-actions bot commented Jan 17, 2021

jorisvandenbossche commented Jan 18, 2021

jbrockmendel commented Jan 19, 2021

jbrockmendel commented Jan 19, 2021

jorisvandenbossche commented Jan 29, 2021 • edited Loading

github-actions bot commented Mar 1, 2021

jbrockmendel commented Mar 26, 2021

jbrockmendel commented May 18, 2021

simonjayhawkins commented May 24, 2021

jreback commented Nov 28, 2021

jbrockmendel commented Nov 4, 2022

simonjayhawkins commented Feb 22, 2023

jorisvandenbossche commented Dec 5, 2020 •

edited

Loading

jorisvandenbossche commented Dec 5, 2020 •

edited

Loading

jorisvandenbossche commented Jan 29, 2021 •

edited

Loading