-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: add EA._from_scalars / stricter casting of result values back to EA dtype #38315
API: add EA._from_scalars / stricter casting of result values back to EA dtype #38315
Conversation
So some difficult cases I ran into: 1. Integer/BooleanArray gets cast to floats because of NAs and thus give float results, even for ops that are potentially dtype-preserving In theory, if we are strict in However, when you have missing values and operate in numpy, you will typically operate on the array converted to a float array, and thus also get float results, even for ops that are potentially dtype preserving. In [1]: arr = pd.array([True, True, pd.NA, pd.NA, False, False, True], dtype="boolean")
In [2]: df = pd.DataFrame({"A": [1, 1, 2, 2, 3, 3, 1], "B": arr})
In [3]: df.groupby("A").sum()
Out[3]:
B
A
1 3.0
2 0.0
3 0.0 With the current implementation, but without trying to cast back or when being strict in what IntegerArray._from_scalars accepts, the result are floats (like above). pandas/pandas/core/dtypes/cast.py Lines 364 to 365 in 7073ee1
But that still means that either Int64 Something similar happens in the failing 2. Data types with different precisions For the integer and floating data types, we are currently not always strict to the exact dtype instance, i.e. the precision (eg in Depending on the result, we also might not want to cast back to eg Int8, if the results would not fit in the int8 range, but would still rather want to use Int64 (to preserve nullability) instead of falling back to numpy int64. An example: In [1]: df = pd.DataFrame({"key": [1, 1, 2, 2], "B": pd.array([1, 2, 3, 4], dtype="Int8")})
# "first" is a dtype-preserving op, so this is fine
In [2]: df.groupby("key")["B"].first()
Out[2]:
key
1 1
2 3
Name: B, dtype: Int8
# "sum" should actually always use Int64, and not preserve Int8
# but this is something we know in this case, and so can force
# (this is being fixed in https://github.com/pandas-dev/pandas/pull/38291)
In [3]: df.groupby("key")["B"].sum()
Out[3]:
key
1 3
2 7
Name: B, dtype: Int8
# For a UDF, we can try to cast back, and if it fails use the original result, in this case
# an int64 numpy array (which is what happened here). The reason it fails here is because
# np.int64 scalars are disallowed in _form_scalars with Int8 dtype (we check instances of `dtype.type`)
# However, we could also be more flexible here and detect that we get int64 ndarray, so use Int64
# instead of strict adherence to Int8
In [4]: df.groupby("key")["B"].aggregate(lambda x: x.sum())
Out[4]:
key
1 3
2 7
Name: B, dtype: int64 So in "theory", the idea of 3. Sparse subtype See the example described here: #33254 (comment), about SparseArray trying to preserve the "sparse" aspect, and not necessarily the exact In [3]: a = pd.arrays.SparseArray([0, 1, 0])
In [4]: s = pd.Series(a)
In [5]: s
Out[5]:
0 0
1 1
2 0
dtype: Sparse[int64, 0]
# on master, you get:
In [31]: s.combine(1, lambda x, y: x == y)
Out[31]:
0 False
1 True
2 False
dtype: Sparse[bool, False]
# with this PR, however, I currently get:
In [6]: s.combine(1, lambda x, y: x == y)
Out[6]:
0 False
1 True
2 False
dtype: bool So this is actually the same underlying case as case 2 (the precision of integer/floating dtypes), but, for sparse it's broader, as it is a "composite" dtype (it has a subtype), so there are much more possibilities compared to just the precision of the integer/floating dtypes. 4. Object type or subdtype In general, we are not going to be able to be strict when dealing with object types. We do have the issue for object type PandasArray, but that's not an actually used EA in practice at the moment. So that's probably something we can defer to later, if at some point we would get a real "object" EA-dtype (and in that case, it will probably be fine to just always infer the dtype of a potential result and never try to preserve it, since it's object anyway (it would by definition always be possible to preserve the object type)). |
# Once this is fixed, the commented code can be uncommented | ||
# -> https://github.com/pandas-dev/pandas/issues/38316 | ||
# expected = CategoricalIndex(index.values) | ||
expected = CategoricalIndex(np.asarray(index.values)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened an issue for this -> #38316 (it's also mentioned in the comment above)
# result_dtype = maybe_cast_result_dtype(dtype, how) | ||
# if result_dtype is not None: | ||
# # we know what the result dtypes needs to be -> be more permissive in casting | ||
# # (eg ints with nans became floats) | ||
# cls = result_dtype.construct_array_type() | ||
# return cls._from_sequence(obj, dtype=result_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is related to the first case mentioned (when we know the resulting dtype, we should maybe force the result back, instead of relying on strict casting).
So the above code is one possibility: we can change maybe_cast_result_dtype
to only return a dtype if it knows what dtype should be and otherwise return None (instead of passing through the original type). This way, we can take a different path for "known" dtypes, vs when guessing the dtype.
Big picture, how does this relate to _from_sequence? would we being doing this instead of making _from_sequence strict, or in doing both? |
@@ -188,6 +188,12 @@ class ExtensionArray: | |||
# Constructors | |||
# ------------------------------------------------------------------------ | |||
|
|||
@classmethod | |||
def _from_scalars(cls, data, dtype): | |||
if not all(isinstance(v, dtype.type) or isna(v) for v in data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isna -> is_valid_nat_for_dtype? (still need to rename to is_valid_na_for_dtype)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this has been renamed to the clearer is_valid_na_for_dtype
# if not all( | ||
# isinstance(v, dtype.categories.dtype.type) or isna(v) for v in data | ||
# ): | ||
# raise TypeError("Requires dtype scalars") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not all (x in dtype.categories or is_valid_nat_for_dtype(x, dtype.categories.dtype) for x in data)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as mentioned somewhere in #38315 (comment), for categorical we probably want to check if the values are valid categories.
We might already have some functionality for this? Like the main constructor, but then raising an error instead of coercing unknown values to NaN:
In [17]: pd.Categorical(["a", "b", "c"], categories=["a", "b"])
Out[17]:
['a', 'b', NaN]
Categories (2, object): ['a', 'b']
The above is basically done by _get_codes_for_values
, so we might want a version of that which is strict instead of coercing to NaN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Categorical._validate_setitem_value doest what you're describing
# override because dtype.type is only the numpy scalar | ||
# TODO accept float here? | ||
if not all( | ||
isinstance(v, (int, dtype.type, float, np.float_)) or isna(v) for v in data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for floats require that v.is_integer()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The coerce_to_arrray
function that is used by _from_sequence
already checks for this as well (so we can pass through here any float, as it will be catched later)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DTA/TDA/PA have a _recognized_scalars
attribute that could be useful
For your inline comments (up to now): can we keep it more high-level for now (the semantics of |
sure |
@jbrockmendel sorry about my brief comment yesterday, to be clear your review comments are certainly useful for when we optimize those implementations (it's just that I didn't yet do any effort related to that) |
It relates to |
The astype discussion is making me more positive on this. is it ready for another look? |
It's in the same state as the previous time you looked (I only merged master since): feedback is mostly needed on the semantics (and not necessarily on the actual implementation, as it are mostly dummy implementations that should still be optimized), and on all the questions I raised above #38315 (comment), before I can further work on it |
@@ -2909,7 +2909,7 @@ def transpose(self, *args, copy: bool = False) -> DataFrame: | |||
arr_type = dtype.construct_array_type() | |||
values = self.values | |||
|
|||
new_values = [arr_type._from_sequence(row, dtype=dtype) for row in values] | |||
new_values = [arr_type._from_scalars(row, dtype=dtype) for row in values] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case we already know we have the correct types, so cant we go directly to from_sequence?
@@ -965,7 +965,7 @@ def fast_xs(self, loc: int) -> ArrayLike: | |||
result[rl] = blk.iget((i, loc)) | |||
|
|||
if isinstance(dtype, ExtensionDtype): | |||
result = dtype.construct_array_type()._from_sequence(result, dtype=dtype) | |||
result = dtype.construct_array_type()._from_scalars(result, dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here too, shouldnt we know we have the correct types?
IIUC it is mainly just in maybe_cast_to_extension_array that from_scalars is needed. Are there others? Supposing we had a fully-general solution to #37367. That would render from_scalars unnecessary, right? What are your thoughts on how this affects from_sequence? e.g. it could become non-strict and used for astype? |
Do you mean the ability to generally use I would need to think a bit more about it, but yes, that's certainly related. One concern that comes to mind is that you can have conflicting dtypes that use the same scalar type. So even then you might still want to give the original dtype the chance to "recreate" itself with a method like
See #38315 (comment) ? |
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
(@jbrockmendel this is kind of waiting on more feedback from you or others) |
Did you address this? Looking back on my inline comments, this looks important. |
Looked back over this and #33254. Thoughts:
|
Sorry for the slow reply here.
That certainly has been the use case driving the discussion, yes (and Now, in this PR, I am using it in a bunch of other places as well. Places where we know we will only have scalars of exact correct type. But, that should maybe be discussed whether that's actually useful. Because assuming that
For me
In general both |
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
This may just be a semantic misunderstanding. Is there a distinction between, e.g. a list of Period objects vs an ndarray[object] of Period objects vs a PeriodArray? My intuition is that the PeriodArray would be a use case for from_sequence and the others would be for from_scalars. For the purposes of review, I would find it much easier if you changed _from_sequence usages to _from_scalars only in the places where the difference in behavior matters. Will this make period_array unnecessary? i dont like the idea of ending up with 3 constructors for PeriodArray. |
from #38315 (comment)
#40996 proposes making the Categorical constructor stricter. On last week's call there was consensus to try changing it to see if it breaks the world. It turns out to break 220 tests. I think we do want to support having some way of taking e.g. ATM i'm leaning towards preferring a keyword in @jorisvandenbossche Not sure if this really helps with this PR, but wanted to keep you apprised of my thinking on the matter. |
removing the 1.3 milestone |
this is quite old, happen to reopen if actively worked on. |
another thought: we could specify what exceptions _from_scalars is allowed to raise, to avoid |
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen. |
Closes #33254
Closes #31108
This adds a
ExtensionArray._from_scalars
function which is more strict than the existing_from_sequence
constructor: in theory, it should only accept a sequence of scalars of its own type (i.e. the type you get witharr[0]
, plus NAs), or arrays with the appropriate dtype.This can then be used in a context where we try to create an EA of the original dtype (in an attempt to preserve the original dtype), but we don't want to coerce any value to that dtype.
Note: this is a WIP PR, which mostly adds dummy implementations of this method, just to mimic the desired behaviour for now, to be able to explore how this would work.
If this turns out to be working, we will need to make better implementations for
_from_scalars
for our own EAs.Apart from that, this needs tests, docs, etc. But let's first discuss semantics.