Implement NA.__array_ufunc__ #30245

TomAugspurger · 2019-12-12T21:48:00Z

This gives us consistent comparisons with NumPy scalars.

I have a few implementation questions:

I wanted to reuse maybe_dispatch_ufunc_to_dunder_op. That was in core/ops/dispatch.py, but NA is defend in _libs/missing.pyx. For now I just moved it to _libs/missing.pyx, is there a better home for it? I don't think _libs/ops.pyx is an option, as it imports missing.
How should we handle comparisons with ndarrays? I see a few options:

a. Return an ndarray[object], where each element is NA
b. Really take over and return our extension arrays. So np.array([1, 2]) == pd.NA would return an IntegerArray([NA, NA]). Boolean results would return a BooleanArray. And float results would return a ... PandasArray? (so maybe this isn't a good idea. But worth considering)
3. What scalars should we handle (in __array_ufunc__ and in the regular methods)? Any of date time, Timestamp, time delta, Period, Interval?

One item 2, I seem to have messed something up, as we return just a scalar NA right now. Will want to sort that out.

This gives us consistent comparisions with NumPy scalars.

TomAugspurger · 2019-12-13T16:24:03Z

How should we handle comparisons with ndarrays?

46f2327 restores the behavior on master. We return an object-dtype ndarray filled with NA values. Not sure if that's the right choice though.

TomAugspurger · 2019-12-16T19:15:48Z

Anybody else seeing the CI check failure when linting .pyx files?

flake8 --version
3.7.9 (flake8-comprehensions: 3.1.4, mccabe: 0.6.1, pycodestyle: 2.5.0, pyflakes: 2.1.1) CPython 3.7.3 on Linux
Linting .py code
Linting .py code DONE
Linting .pyx code
Traceback (most recent call last):
  File "/home/runner/miniconda3/envs/pandas-dev/bin/flake8", line 11, in <module>
    sys.exit(main())

 ...

    self.formatter.handle(error)
  File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/flake8/formatting/base.py", line 91, in handle
    line = self.format(error)
  File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/flake8/formatting/default.py", line 39, in format
    "col": error.column_number,
ValueError: unsupported format character ':' (0x3a) at index 41

Not able to reproduce locally with those versions.

TomAugspurger · 2019-12-17T17:26:09Z

All green now. Any thoughts on what, e.g. np.array([True, False]) | pd.NA should return? Our choices are an object-dtype ndarray, or a BooleanArray. Is it strange to return a non-ndarray?

jreback

will review soon

doc/source/user_guide/missing_data.rst

jorisvandenbossche · 2019-12-18T12:41:07Z

How should we handle comparisons with ndarrays? I see a few options:

Another option might also be to raise right now? (or is that also surprising?) That might be the safest option to allow choosing a more specific behaviour later.

Otherwise returning an object array of NAs seems OK to me. But from numpy's point of view, such an object array is also strange, as you can't do anything boolean with it like boolean indexing.

TomAugspurger · 2019-12-18T12:48:49Z

But from numpy's point of view, such an object array is also strange, as you can't do anything boolean with it like boolean indexing.

Right. An ndarray full of pd.NA is a fairly useless thing I think. That's my main motivation for considering anything fancy like returning an IntegerArray, etc.

We do have precedence for returning non-ndarrays from __array_ufunc__.

In [8]: a = pd.array([2000, 2001], dtype="Period[D]")

In [9]: np.add(a, 1)
Out[9]: array([Period('2000-01-02', 'D'), Period('2001-01-02', 'D')], dtype=object)

Raising for ndarrays is also fine by me. Just need to choose whether a 0-dim ndarray is an array or scalar (probably a scalar).

jorisvandenbossche

(only looked at the first part)

doc/source/user_guide/missing_data.rst

pandas/_libs/missing.pyx

jorisvandenbossche · 2019-12-18T12:50:11Z

pandas/_libs/missing.pyx

@@ -290,12 +291,22 @@ def _create_binary_propagating_op(name, divmod=False):

    def method(self, other):
        if (other is C_NA or isinstance(other, str)
-                or isinstance(other, (numbers.Number, np.bool_))):
+                or isinstance(other, (numbers.Number, np.bool_, np.int64, np.int_))
+                or isinstance(other, np.ndarray) and not other.shape):


This seems a bit strange, but I suppose it is to follow numpy behaviour of 0-dim arrays returning scalars from comparison operations? (if so, maybe add a comment about that)

to follow numpy behaviour of 0-dim arrays returning scalars from comparison operations?

It's to NumPy scalars. Without this, we'd have

np.int64(1) == pd.NA

raise with

> out[:] = NA E IndexError: too many indices for array

A numpy scalar (which is different from a 0-dim array) is not a ndarray, so wouldn't pass the isinstance(other, np.ndarray) test. So I don't fully understand how that example is related.

could call lib.item_from_zerodim at the top?

@TomAugspurger can you check this thread? I still don't understand how this relates to scalars, as they shouldn't pass the isinstance(other, np.ndarray) check. So not sure the below comment is correct.

With this

diff --git a/pandas/_libs/missing.pyx b/pandas/_libs/missing.pyx index 8d4d2c5568..e2bb6448cd 100644 --- a/pandas/_libs/missing.pyx +++ b/pandas/_libs/missing.pyx @@ -295,8 +295,7 @@ def _create_binary_propagating_op(name, is_divmod=False): def method(self, other): if (other is C_NA or isinstance(other, str) - or isinstance(other, (numbers.Number, np.bool_)) - or isinstance(other, np.ndarray) and not other.shape): + or isinstance(other, (numbers.Number, np.bool_))): # Need the other.shape clause to handle NumPy scalars, # since we do a setitem on `out` below, which # won't work for NumPy scalars.

I have

In [3]: np.int64(1) == pd.NA /Users/taugspurger/.virtualenvs/pandas-dev/bin/ipython:1: DeprecationWarning: elementwise comparison failed; this will raise an error in the future. #!/Users/taugspurger/Envs/pandas-dev/bin/python --------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-3-209227154583> in <module> ----> 1 np.int64(1) == pd.NA ~/sandbox/pandas/pandas/_libs/missing.pyx in pandas._libs.missing._create_binary_propagating_op.method() 307 elif isinstance(other, np.ndarray): 308 out = np.empty(other.shape, dtype=object) --> 309 out[:] = NA 310 311 if is_divmod: IndexError: too many indices for array

So when we get there, it really does seem like np.int64(1) is an ndarray. Is cython doing something?

So when we get there, it really does seem like np.int64(1) is an ndarray. Is cython doing something?

Possibly, no idea. But I trust you this is indeed needed to get it working

jorisvandenbossche · 2019-12-18T12:54:44Z

pandas/_libs/missing.pyx

@@ -434,6 +449,34 @@ class NAType(C_NAType):

    __rxor__ = __xor__

+    # What else to add here? datetime / Timestamp? Period? Interval?


Timestamp et al don't implement __array_ufunc__ right? If they would, I would say they could handle NA instead of NA handling all those types..

(I also think it is not that important right now that pd.NA can handle those types, since we don't yet use NA for those dtypes)

jorisvandenbossche · 2019-12-18T12:57:15Z

We do have precedence for returning non-ndarrays from array_ufunc.

What you show is actually returning a numpy object array (but I would agree that should/could return a PeriodArray)

TomAugspurger · 2019-12-18T13:12:30Z

Oh... it's too early here :) I could have sworn we returned an EA there.

TomAugspurger · 2019-12-20T12:39:33Z

Thanks for the review @jorisvandenbossche. Things should be good here.

I've deferred on returning anything too exotic from NA comparisons with ndarrays. We just return an ndarray of NA now.

jbrockmendel · 2019-12-21T03:29:30Z

should getting this in for 1.0 be a priority?

TomAugspurger · 2019-12-30T15:02:45Z

This should be good to go, assuming CI passes.

TomAugspurger · 2019-12-30T17:31:19Z

All green.

TomAugspurger · 2019-12-31T12:28:46Z

Planning to merge this later today if there aren't any objections.

jreback

lgtm, except for a question.

pandas/_libs/ops_dispatch.pyx

jbrockmendel · 2019-12-31T20:22:49Z

pandas/_libs/ops_dispatch.pyx

+
+def maybe_dispatch_ufunc_to_dunder_op(
+    object self, object ufunc, str method, *inputs, **kwargs
+):


for functions that arent cdef or cpdef, we can use python syntax for annotations

jreback

lgtm. @jbrockmendel

pandas/_libs/missing.pyx

jbrockmendel · 2020-01-01T17:08:30Z

comment about L482 vs L483, otherwise LGTM

TomAugspurger · 2020-01-02T14:05:15Z

Fixed. Merging once https://github.com/pandas-dev/pandas/pull/30245/files/0f4e121a229e3ede7d3b0fa4a974f5e2dcc8e30e#diff-c4b0cf199cc6d5bc2362ceb737159458 is resolved @jorisvandenbossche.

TomAugspurger · 2020-01-03T13:57:25Z

Merging in a few hours if there aren't any objections. Can revisit the line in https://github.com/pandas-dev/pandas/pull/30245/files/0f4e121a229e3ede7d3b0fa4a974f5e2dcc8e30e#diff-c4b0cf199cc6d5bc2362ceb737159458 if needed.

jorisvandenbossche · 2020-01-05T17:34:08Z

Checks are failing: the benchmark run failure seems unrelated, the typing failure looks unrelated as well but I am not very knowledgeable about that.

jreback · 2020-01-05T20:43:25Z

thanks @TomAugspurger very nice!

Implement NA.__array_ufunc__

bf8680f

This gives us consistent comparisions with NumPy scalars.

TomAugspurger added ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations labels Dec 12, 2019

jorisvandenbossche mentioned this pull request Dec 13, 2019

Missing values proposal: concrete steps for 1.0 #29556

Closed

13 tasks

TomAugspurger added 2 commits December 13, 2019 09:37

Merge remote-tracking branch 'upstream/master' into na-array-ufunc

0c69bd0

ndarrays

46f2327

TomAugspurger added 6 commits December 16, 2019 10:54

move

075d58a

Merge remote-tracking branch 'upstream/master' into na-array-ufunc

b72dd1c

setup, prints

0371cf4

docs

878ef70

fixup

97af2e9

fixups

f175a34

TomAugspurger mentioned this pull request Dec 16, 2019

test CI #30293

Merged

TomAugspurger added 2 commits December 17, 2019 09:01

Merge remote-tracking branch 'upstream/master' into na-array-ufunc

f2ac945

lint

0f4e121

jreback requested changes Dec 17, 2019

View reviewed changes

doc/source/user_guide/missing_data.rst Show resolved Hide resolved

Merge remote-tracking branch 'upstream/master' into na-array-ufunc

fe04554

jorisvandenbossche reviewed Dec 18, 2019

View reviewed changes

TomAugspurger added 2 commits December 20, 2019 06:28

Merge remote-tracking branch 'upstream/master' into na-array-ufunc

cf9ac10

Fixups

72e2b67

TomAugspurger added 4 commits December 30, 2019 08:41

Merge remote-tracking branch 'upstream/master' into na-array-ufunc

db4cc40

fix bug

7b1585a

update

b79e07f

test special

567c584

TomAugspurger added 2 commits December 30, 2019 09:03

doc

b27470d

move

c7c9184

jreback added this to the 1.0 milestone Dec 31, 2019

jreback reviewed Dec 31, 2019

View reviewed changes

pandas/_libs/ops_dispatch.pyx Show resolved Hide resolved

jbrockmendel reviewed Dec 31, 2019

View reviewed changes

pandas/_libs/ops_dispatch.pyx Outdated Show resolved Hide resolved

jbrockmendel reviewed Dec 31, 2019

View reviewed changes

jreback approved these changes Jan 1, 2020

View reviewed changes

jbrockmendel reviewed Jan 1, 2020

View reviewed changes

pandas/_libs/missing.pyx Show resolved Hide resolved

TomAugspurger added 3 commits January 2, 2020 06:09

Merge remote-tracking branch 'upstream/master' into na-array-ufunc

a0dbca8

fixup

ce209f9

fixup

e4ecadb

TomAugspurger added 3 commits January 4, 2020 10:43

Merge remote-tracking branch 'upstream/master' into na-array-ufunc

8d2763d

restore

f68e178

fixup

d8c23e9

jorisvandenbossche approved these changes Jan 5, 2020

View reviewed changes

Merge remote-tracking branch 'upstream/master' into na-array-ufunc

4c30bb4

jreback merged commit 97153bf into pandas-dev:master Jan 5, 2020

jorisvandenbossche mentioned this pull request May 10, 2020

API/BUG: pd.NA == pd.NaT returns False #34104

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement NA.__array_ufunc__ #30245

Implement NA.__array_ufunc__ #30245

TomAugspurger commented Dec 12, 2019 •

edited

Loading

TomAugspurger commented Dec 13, 2019

TomAugspurger commented Dec 16, 2019

TomAugspurger commented Dec 17, 2019

jreback left a comment

jorisvandenbossche commented Dec 18, 2019

TomAugspurger commented Dec 18, 2019

jorisvandenbossche left a comment

jorisvandenbossche Dec 18, 2019

TomAugspurger Dec 20, 2019

jorisvandenbossche Dec 21, 2019

jbrockmendel Dec 31, 2019

jorisvandenbossche Jan 1, 2020

TomAugspurger Jan 2, 2020

jorisvandenbossche Jan 5, 2020

jorisvandenbossche Dec 18, 2019

jorisvandenbossche commented Dec 18, 2019

TomAugspurger commented Dec 18, 2019 •

edited

Loading

TomAugspurger commented Dec 20, 2019

jbrockmendel commented Dec 21, 2019

TomAugspurger commented Dec 30, 2019

TomAugspurger commented Dec 30, 2019

TomAugspurger commented Dec 31, 2019

jreback left a comment

jbrockmendel Dec 31, 2019

jreback left a comment

jbrockmendel commented Jan 1, 2020

TomAugspurger commented Jan 2, 2020

TomAugspurger commented Jan 3, 2020

jorisvandenbossche commented Jan 5, 2020

jreback commented Jan 5, 2020

		@@ -434,6 +449,34 @@ class NAType(C_NAType):

		__rxor__ = __xor__

		# What else to add here? datetime / Timestamp? Period? Interval?

Implement NA.__array_ufunc__ #30245

Implement NA.__array_ufunc__ #30245

Conversation

TomAugspurger commented Dec 12, 2019 • edited Loading

TomAugspurger commented Dec 13, 2019

TomAugspurger commented Dec 16, 2019

TomAugspurger commented Dec 17, 2019

jreback left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Dec 18, 2019

TomAugspurger commented Dec 18, 2019

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Dec 18, 2019

TomAugspurger commented Dec 18, 2019 • edited Loading

TomAugspurger commented Dec 20, 2019

jbrockmendel commented Dec 21, 2019

TomAugspurger commented Dec 30, 2019

TomAugspurger commented Dec 30, 2019

TomAugspurger commented Dec 31, 2019

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

jbrockmendel commented Jan 1, 2020

TomAugspurger commented Jan 2, 2020

TomAugspurger commented Jan 3, 2020

jorisvandenbossche commented Jan 5, 2020

jreback commented Jan 5, 2020

TomAugspurger commented Dec 12, 2019 •

edited

Loading

TomAugspurger commented Dec 18, 2019 •

edited

Loading