-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement NA.__array_ufunc__ #30245
Implement NA.__array_ufunc__ #30245
Conversation
This gives us consistent comparisions with NumPy scalars.
46f2327 restores the behavior on master. We return an object-dtype ndarray filled with NA values. Not sure if that's the right choice though. |
Anybody else seeing the CI check failure when linting
Not able to reproduce locally with those versions. |
All green now. Any thoughts on what, e.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will review soon
Another option might also be to raise right now? (or is that also surprising?) That might be the safest option to allow choosing a more specific behaviour later. Otherwise returning an object array of NAs seems OK to me. But from numpy's point of view, such an object array is also strange, as you can't do anything boolean with it like boolean indexing. |
Right. An ndarray full of pd.NA is a fairly useless thing I think. That's my main motivation for considering anything fancy like returning an IntegerArray, etc. We do have precedence for returning non-ndarrays from In [8]: a = pd.array([2000, 2001], dtype="Period[D]")
In [9]: np.add(a, 1)
Out[9]: array([Period('2000-01-02', 'D'), Period('2001-01-02', 'D')], dtype=object) Raising for ndarrays is also fine by me. Just need to choose whether a 0-dim ndarray is an array or scalar (probably a scalar). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(only looked at the first part)
@@ -290,12 +291,22 @@ def _create_binary_propagating_op(name, divmod=False): | |||
|
|||
def method(self, other): | |||
if (other is C_NA or isinstance(other, str) | |||
or isinstance(other, (numbers.Number, np.bool_))): | |||
or isinstance(other, (numbers.Number, np.bool_, np.int64, np.int_)) | |||
or isinstance(other, np.ndarray) and not other.shape): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit strange, but I suppose it is to follow numpy behaviour of 0-dim arrays returning scalars from comparison operations? (if so, maybe add a comment about that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to follow numpy behaviour of 0-dim arrays returning scalars from comparison operations?
It's to NumPy scalars. Without this, we'd have
np.int64(1) == pd.NA
raise with
> out[:] = NA
E IndexError: too many indices for array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A numpy scalar (which is different from a 0-dim array) is not a ndarray, so wouldn't pass the isinstance(other, np.ndarray)
test. So I don't fully understand how that example is related.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could call lib.item_from_zerodim at the top?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger can you check this thread? I still don't understand how this relates to scalars, as they shouldn't pass the isinstance(other, np.ndarray)
check. So not sure the below comment is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this
diff --git a/pandas/_libs/missing.pyx b/pandas/_libs/missing.pyx
index 8d4d2c5568..e2bb6448cd 100644
--- a/pandas/_libs/missing.pyx
+++ b/pandas/_libs/missing.pyx
@@ -295,8 +295,7 @@ def _create_binary_propagating_op(name, is_divmod=False):
def method(self, other):
if (other is C_NA or isinstance(other, str)
- or isinstance(other, (numbers.Number, np.bool_))
- or isinstance(other, np.ndarray) and not other.shape):
+ or isinstance(other, (numbers.Number, np.bool_))):
# Need the other.shape clause to handle NumPy scalars,
# since we do a setitem on `out` below, which
# won't work for NumPy scalars.
I have
In [3]: np.int64(1) == pd.NA
/Users/taugspurger/.virtualenvs/pandas-dev/bin/ipython:1: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
#!/Users/taugspurger/Envs/pandas-dev/bin/python
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-3-209227154583> in <module>
----> 1 np.int64(1) == pd.NA
~/sandbox/pandas/pandas/_libs/missing.pyx in pandas._libs.missing._create_binary_propagating_op.method()
307 elif isinstance(other, np.ndarray):
308 out = np.empty(other.shape, dtype=object)
--> 309 out[:] = NA
310
311 if is_divmod:
IndexError: too many indices for array
So when we get there, it really does seem like np.int64(1)
is an ndarray. Is cython doing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So when we get there, it really does seem like np.int64(1) is an ndarray. Is cython doing something?
Possibly, no idea. But I trust you this is indeed needed to get it working
pandas/_libs/missing.pyx
Outdated
@@ -434,6 +449,34 @@ class NAType(C_NAType): | |||
|
|||
__rxor__ = __xor__ | |||
|
|||
# What else to add here? datetime / Timestamp? Period? Interval? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timestamp et al don't implement __array_ufunc__
right? If they would, I would say they could handle NA instead of NA handling all those types..
(I also think it is not that important right now that pd.NA can handle those types, since we don't yet use NA for those dtypes)
What you show is actually returning a numpy object array (but I would agree that should/could return a PeriodArray) |
Oh... it's too early here :) I could have sworn we returned an EA there. |
Thanks for the review @jorisvandenbossche. Things should be good here. I've deferred on returning anything too exotic from NA comparisons with ndarrays. We just return an ndarray of NA now. |
should getting this in for 1.0 be a priority? |
This should be good to go, assuming CI passes. |
All green. |
Planning to merge this later today if there aren't any objections. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, except for a question.
|
||
def maybe_dispatch_ufunc_to_dunder_op( | ||
object self, object ufunc, str method, *inputs, **kwargs | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for functions that arent cdef
or cpdef
, we can use python syntax for annotations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. @jbrockmendel
comment about L482 vs L483, otherwise LGTM |
Merging in a few hours if there aren't any objections. Can revisit the line in https://github.com/pandas-dev/pandas/pull/30245/files/0f4e121a229e3ede7d3b0fa4a974f5e2dcc8e30e#diff-c4b0cf199cc6d5bc2362ceb737159458 if needed. |
Checks are failing: the benchmark run failure seems unrelated, the typing failure looks unrelated as well but I am not very knowledgeable about that. |
thanks @TomAugspurger very nice! |
This gives us consistent comparisons with NumPy scalars.
I have a few implementation questions:
maybe_dispatch_ufunc_to_dunder_op
. That was incore/ops/dispatch.py
, but NA is defend in_libs/missing.pyx
. For now I just moved it to_libs/missing.pyx
, is there a better home for it? I don't think_libs/ops.pyx
is an option, as it imports missing.a. Return an
ndarray[object]
, where each element is NAb. Really take over and return our extension arrays. So
np.array([1, 2]) == pd.NA
would return anIntegerArray([NA, NA])
. Boolean results would return a BooleanArray. And float results would return a ... PandasArray? (so maybe this isn't a good idea. But worth considering)3. What scalars should we handle (in
__array_ufunc__
and in the regular methods)? Any of date time, Timestamp, time delta, Period, Interval?One item 2, I seem to have messed something up, as we return just a scalar
NA
right now. Will want to sort that out.