Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: add EA._from_scalars / stricter casting of result values back to EA dtype #38315

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Dec 5, 2020

Closes #33254
Closes #31108

This adds a ExtensionArray._from_scalars function which is more strict than the existing _from_sequence constructor: in theory, it should only accept a sequence of scalars of its own type (i.e. the type you get with arr[0], plus NAs), or arrays with the appropriate dtype.

This can then be used in a context where we try to create an EA of the original dtype (in an attempt to preserve the original dtype), but we don't want to coerce any value to that dtype.

Note: this is a WIP PR, which mostly adds dummy implementations of this method, just to mimic the desired behaviour for now, to be able to explore how this would work.
If this turns out to be working, we will need to make better implementations for _from_scalars for our own EAs.

Apart from that, this needs tests, docs, etc. But let's first discuss semantics.

@jorisvandenbossche jorisvandenbossche added API Design ExtensionArray Extending pandas with custom dtypes or arrays. labels Dec 5, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.3 milestone Dec 5, 2020
@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Dec 5, 2020

So some difficult cases I ran into:

1. Integer/BooleanArray gets cast to floats because of NAs and thus give float results, even for ops that are potentially dtype-preserving

In theory, if we are strict in IntegerArray._from_scalars, we should not accept floats, to avoid issues like where the mean() gets cast back to int if the results are accidentally all "integer-like" (see eg one of the examples in #37494, or #27071 (comment)).

However, when you have missing values and operate in numpy, you will typically operate on the array converted to a float array, and thus also get float results, even for ops that are potentially dtype preserving.
An example of this is test_groupby_sum_mincount test for nullable boolean data, which basically does this:

In [1]: arr = pd.array([True, True, pd.NA, pd.NA, False, False, True], dtype="boolean")

In [2]: df = pd.DataFrame({"A": [1, 1, 2, 2, 3, 3, 1], "B": arr})

In [3]: df.groupby("A").sum()
Out[3]: 
     B
A     
1  3.0
2  0.0
3  0.0

With the current implementation, but without trying to cast back or when being strict in what IntegerArray._from_scalars accepts, the result are floats (like above).
For this case, we know the result should be Int64, which is coded here:

elif how in ["add", "cumsum", "sum"] and isinstance(dtype, BooleanDtype):
return Int64Dtype()

But that still means that either Int64 _from_scalars needs to accept floats, or either we need to force the conversion (which we could do in this case, because we definitely know the result dtype, but that's not something we know in general)

Something similar happens in the failing pandas/tests/groupby/test_nth.py::test_first_last_extension_array_keeps_dtype tests, because there we also get floating 0.0 and 1.0 that represent False and True, and we try to preserve the type, which doesn't work anymore. So also here we either need to force the result back, or be less strict in BooleanArray._from_scalars.

2. Data types with different precisions

For the integer and floating data types, we are currently not always strict to the exact dtype instance, i.e. the precision (eg in _from_sequence, when no dtype is passed, we will default to Int64 or Float64) when casting back.

Depending on the result, we also might not want to cast back to eg Int8, if the results would not fit in the int8 range, but would still rather want to use Int64 (to preserve nullability) instead of falling back to numpy int64.

An example:

In [1]: df = pd.DataFrame({"key": [1, 1, 2, 2], "B": pd.array([1, 2, 3, 4], dtype="Int8")})

# "first" is a dtype-preserving op, so this is fine
In [2]: df.groupby("key")["B"].first()
Out[2]: 
key
1    1
2    3
Name: B, dtype: Int8

# "sum" should actually always use Int64, and not preserve Int8
# but this is something we know in this case, and so can force
# (this is being fixed in https://github.com/pandas-dev/pandas/pull/38291)
In [3]: df.groupby("key")["B"].sum()
Out[3]: 
key
1    3
2    7
Name: B, dtype: Int8

# For a UDF, we can try to cast back, and if it fails use the original result, in this case
# an int64 numpy array (which is what happened here). The reason it fails here is because
# np.int64 scalars are disallowed in _form_scalars with Int8 dtype (we check instances of `dtype.type`) 
# However, we could also be more flexible here and detect that we get int64 ndarray, so use Int64 
# instead of strict adherence to Int8
In [4]: df.groupby("key")["B"].aggregate(lambda x: x.sum())
Out[4]: 
key
1    3
2    7
Name: B, dtype: int64

So in "theory", the idea of _from_scalars it to be strict about the dtype, but we should maybe specifically say that it is fine for the EA class to not be strict about the exact dtype instance (Int8 vs Int64), as long as it is strict regarding its general type (for this example: integers).

3. Sparse subtype

See the example described here: #33254 (comment), about SparseArray trying to preserve the "sparse" aspect, and not necessarily the exact SparseDtype.subtype. Sor for eg an array with sparse[int] dtype could give sparse[bool] results (instead of falling back to bool):

In [3]: a = pd.arrays.SparseArray([0, 1, 0])

In [4]: s = pd.Series(a)

In [5]: s
Out[5]: 
0    0
1    1
2    0
dtype: Sparse[int64, 0]

# on master, you get:
In [31]: s.combine(1, lambda x, y: x == y)  
Out[31]: 
0    False
1     True
2    False
dtype: Sparse[bool, False]

# with this PR, however, I currently get:
In [6]: s.combine(1, lambda x, y: x == y)
Out[6]: 
0    False
1     True
2    False
dtype: bool

So this is actually the same underlying case as case 2 (the precision of integer/floating dtypes), but, for sparse it's broader, as it is a "composite" dtype (it has a subtype), so there are much more possibilities compared to just the precision of the integer/floating dtypes.

4. Object type or subdtype

In general, we are not going to be able to be strict when dealing with object types.
For example, for a Categorical with string categories stored in an object dtype index, we are not going to be able to just be strict on the type, since object can hold anything. Now, in the specific case of Categorical, we can probably be strict about "scalars need to be in the categories" (so while writing this down, realizing this might not be an issue).

We do have the issue for object type PandasArray, but that's not an actually used EA in practice at the moment. So that's probably something we can defer to later, if at some point we would get a real "object" EA-dtype (and in that case, it will probably be fine to just always infer the dtype of a potential result and never try to preserve it, since it's object anyway (it would by definition always be possible to preserve the object type)).

# Once this is fixed, the commented code can be uncommented
# -> https://github.com/pandas-dev/pandas/issues/38316
# expected = CategoricalIndex(index.values)
expected = CategoricalIndex(np.asarray(index.values))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened an issue for this -> #38316 (it's also mentioned in the comment above)

Comment on lines +323 to +328
# result_dtype = maybe_cast_result_dtype(dtype, how)
# if result_dtype is not None:
# # we know what the result dtypes needs to be -> be more permissive in casting
# # (eg ints with nans became floats)
# cls = result_dtype.construct_array_type()
# return cls._from_sequence(obj, dtype=result_dtype)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to the first case mentioned (when we know the resulting dtype, we should maybe force the result back, instead of relying on strict casting).
So the above code is one possibility: we can change maybe_cast_result_dtype to only return a dtype if it knows what dtype should be and otherwise return None (instead of passing through the original type). This way, we can take a different path for "known" dtypes, vs when guessing the dtype.

@jbrockmendel
Copy link
Member

Big picture, how does this relate to _from_sequence? would we being doing this instead of making _from_sequence strict, or in doing both?

@@ -188,6 +188,12 @@ class ExtensionArray:
# Constructors
# ------------------------------------------------------------------------

@classmethod
def _from_scalars(cls, data, dtype):
if not all(isinstance(v, dtype.type) or isna(v) for v in data):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isna -> is_valid_nat_for_dtype? (still need to rename to is_valid_na_for_dtype)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has been renamed to the clearer is_valid_na_for_dtype

# if not all(
# isinstance(v, dtype.categories.dtype.type) or isna(v) for v in data
# ):
# raise TypeError("Requires dtype scalars")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not all (x in dtype.categories or is_valid_nat_for_dtype(x, dtype.categories.dtype) for x in data)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as mentioned somewhere in #38315 (comment), for categorical we probably want to check if the values are valid categories.

We might already have some functionality for this? Like the main constructor, but then raising an error instead of coercing unknown values to NaN:

In [17]: pd.Categorical(["a", "b", "c"], categories=["a", "b"])
Out[17]: 
['a', 'b', NaN]
Categories (2, object): ['a', 'b']

The above is basically done by _get_codes_for_values, so we might want a version of that which is strict instead of coercing to NaN.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Categorical._validate_setitem_value doest what you're describing

# override because dtype.type is only the numpy scalar
# TODO accept float here?
if not all(
isinstance(v, (int, dtype.type, float, np.float_)) or isna(v) for v in data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for floats require that v.is_integer()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The coerce_to_arrray function that is used by _from_sequence already checks for this as well (so we can pass through here any float, as it will be catched later)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DTA/TDA/PA have a _recognized_scalars attribute that could be useful

@jorisvandenbossche
Copy link
Member Author

For your inline comments (up to now): can we keep it more high-level for now (the semantics of _form_scalars and how the method is used?
That are mostly comments on implementations of _from_scalars, but as mentioned above, this are mostly dummy (and slow) implementations that would never be good to merge, just to explore the semantics of how it could work.

@jbrockmendel
Copy link
Member

For your inline comments (up to now): can we keep it more high-level for now (the semantics of _form_scalars and how the method is used?

sure

@jorisvandenbossche
Copy link
Member Author

@jbrockmendel sorry about my brief comment yesterday, to be clear your review comments are certainly useful for when we optimize those implementations (it's just that I didn't yet do any effort related to that)

@jorisvandenbossche
Copy link
Member Author

Big picture, how does this relate to _from_sequence? would we being doing this instead of making _from_sequence strict, or in doing both?

It relates to _from_sequence in the sense that they both serve a similar purpose (created an array object from a sequence/array-like as input), but _from_scalars is strict, and _from_sequence is not strict.
And we can't make the existing _from_sequence strict, because it is used for general array creation purposes (pd.array(..)) where we explicitly don't want to be strict. See #33254 for the detailed explanation.

@jbrockmendel
Copy link
Member

The astype discussion is making me more positive on this. is it ready for another look?

@jorisvandenbossche
Copy link
Member Author

is it ready for another look?

It's in the same state as the previous time you looked (I only merged master since): feedback is mostly needed on the semantics (and not necessarily on the actual implementation, as it are mostly dummy implementations that should still be optimized), and on all the questions I raised above #38315 (comment), before I can further work on it
So yes, this is ready (waiting) for more feedback.

@@ -2909,7 +2909,7 @@ def transpose(self, *args, copy: bool = False) -> DataFrame:
arr_type = dtype.construct_array_type()
values = self.values

new_values = [arr_type._from_sequence(row, dtype=dtype) for row in values]
new_values = [arr_type._from_scalars(row, dtype=dtype) for row in values]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case we already know we have the correct types, so cant we go directly to from_sequence?

@@ -965,7 +965,7 @@ def fast_xs(self, loc: int) -> ArrayLike:
result[rl] = blk.iget((i, loc))

if isinstance(dtype, ExtensionDtype):
result = dtype.construct_array_type()._from_sequence(result, dtype=dtype)
result = dtype.construct_array_type()._from_scalars(result, dtype=dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here too, shouldnt we know we have the correct types?

@jbrockmendel
Copy link
Member

IIUC it is mainly just in maybe_cast_to_extension_array that from_scalars is needed. Are there others?

Supposing we had a fully-general solution to #37367. That would render from_scalars unnecessary, right?

What are your thoughts on how this affects from_sequence? e.g. it could become non-strict and used for astype?

@jorisvandenbossche
Copy link
Member Author

Supposing we had a fully-general solution to #37367. That would render from_scalars unnecessary, right?

Do you mean the ability to generally use infer_dtype for scalars from extension arrays? (so eg some way to register scalar types linked to a certain extension dtype) I seem to recall we have an issue about this, but can't directly find this.
(note this is not directly related to #37367, as that PR only fixes the case where you already have an ExtensionArray to start with, not a list / object dtyped array of scalars)

I would need to think a bit more about it, but yes, that's certainly related. One concern that comes to mind is that you can have conflicting dtypes that use the same scalar type. So even then you might still want to give the original dtype the chance to "recreate" itself with a method like from_scalars.

What are your thoughts on how this affects from_sequence? e.g. it could become non-strict and used for astype?

See #38315 (comment) ?
It's already non-strict, and short term it is used for pd.array(..) and in casting routines. But, as noted in #33254, what is currently covered by _from_sequence could in principle also be done by a general casting interface (#22384)

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@jorisvandenbossche
Copy link
Member Author

(@jbrockmendel this is kind of waiting on more feedback from you or others)

@jbrockmendel
Copy link
Member

IIUC it is mainly just in maybe_cast_to_extension_array that from_scalars is needed. Are there others?

Did you address this? Looking back on my inline comments, this looks important.

@jbrockmendel
Copy link
Member

Looked back over this and #33254. Thoughts:

  1. We should choose a name that makes it really clear what this is for and what distinguishes it from _from_sequence
  2. If we could avoid needing another constructor that would be nice. One way to do that would be to make _from_sequence strict. The barriers to that AFAICT are
    a) DTA/TDA getting special treatment inside pd.array (similar behavior shows up in DatetimeIndex.__new__ and Series constructor). Could use to_datetime/to_timedelta instead of from_sequence for those. This has the normal downsides that come with special-casing.
    b) astype machinery might want to be more forgiving
    i) Do we have a good idea of what cases we would want to allow in astype but not in _from_sequence?

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jan 29, 2021

Sorry for the slow reply here.

IIUC it is mainly just in maybe_cast_to_extension_array that from_scalars is needed. Are there others?

That certainly has been the use case driving the discussion, yes (and maybe_cast_to_extension_array is eventually used in the python fallback in groupby and in the combine method).

Now, in this PR, I am using it in a bunch of other places as well. Places where we know we will only have scalars of exact correct type. But, that should maybe be discussed whether that's actually useful. Because assuming that _from_sequence will also give the correct result in this case, using _from_scalars instead only makes sense if it is for example faster. In theory _from_scalar implemenations could be faster because they have to deal with less variable input, but in practice making it strict also might have a performance cost.

We should choose a name that makes it really clear what this is for and what distinguishes it from _from_sequence

For me _from_scalars is clear in that regard (it's to reconstruct from scalars as you would get them from the array). But alternative suggestions are certainly welcome.

One way to do that would be to make _from_sequence strict. The barriers to that AFAICT are .. Do we have a good idea of what cases we would want to allow in astype but not in _from_sequence?

In general both pd.array(..) and astype(..) are not strict (and I think both those cases should probably allow the same conversions). For example, we allow casting floats to ints or ints to floats, casting ints or strings to datetime or the other way around, etc

@github-actions
Copy link
Contributor

github-actions bot commented Mar 1, 2021

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Mar 1, 2021
@jbrockmendel
Copy link
Member

Now, in this PR, I am using it in a bunch of other places as well. Places where we know we will only have scalars of exact correct type.

This may just be a semantic misunderstanding. Is there a distinction between, e.g. a list of Period objects vs an ndarray[object] of Period objects vs a PeriodArray? My intuition is that the PeriodArray would be a use case for from_sequence and the others would be for from_scalars.

For the purposes of review, I would find it much easier if you changed _from_sequence usages to _from_scalars only in the places where the difference in behavior matters.


Will this make period_array unnecessary? i dont like the idea of ending up with 3 constructors for PeriodArray.

@jbrockmendel
Copy link
Member

from #38315 (comment)

so we might want a version [of a Categorical constructor] of that which is strict instead of coercing to NaN.

#40996 proposes making the Categorical constructor stricter. On last week's call there was consensus to try changing it to see if it breaks the world. It turns out to break 220 tests.

I think we do want to support having some way of taking e.g. ["A", "B", "D"] -> Categorical(["A", "B", NaN], categories=["A", "B", "C"]). The two options that come to mind are astype or _from_sequence. In one or both (i think i prefer just one) we are going to need a keyword (or second method like in this PR) to make it either more or less strict.

ATM i'm leaning towards preferring a keyword in astype either errors='raise' like we use elsewhere, or casting='unsafe' to follow numpy. I may revisit this opinion as i deal with more of our corner-y constructor logic.

@jorisvandenbossche Not sure if this really helps with this PR, but wanted to keep you apprised of my thinking on the matter.

@simonjayhawkins
Copy link
Member

removing the 1.3 milestone

@simonjayhawkins simonjayhawkins removed this from the 1.3 milestone May 24, 2021
@jreback
Copy link
Contributor

jreback commented Nov 28, 2021

this is quite old, happen to reopen if actively worked on.

@jbrockmendel
Copy link
Member

another thought: we could specify what exceptions _from_scalars is allowed to raise, to avoid except Exception in maybe_cast_to_extension_array

@simonjayhawkins
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design ExtensionArray Extending pandas with custom dtypes or arrays. Stale
Projects
None yet
4 participants