Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement NA.__array_ufunc__ #30245

Merged
merged 29 commits into from
Jan 5, 2020
Merged

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Dec 12, 2019

This gives us consistent comparisons with NumPy scalars.

I have a few implementation questions:

  1. I wanted to reuse maybe_dispatch_ufunc_to_dunder_op. That was in core/ops/dispatch.py, but NA is defend in _libs/missing.pyx. For now I just moved it to _libs/missing.pyx, is there a better home for it? I don't think _libs/ops.pyx is an option, as it imports missing.
  2. How should we handle comparisons with ndarrays? I see a few options:

a. Return an ndarray[object], where each element is NA
b. Really take over and return our extension arrays. So np.array([1, 2]) == pd.NA would return an IntegerArray([NA, NA]). Boolean results would return a BooleanArray. And float results would return a ... PandasArray? (so maybe this isn't a good idea. But worth considering)
3. What scalars should we handle (in __array_ufunc__ and in the regular methods)? Any of date time, Timestamp, time delta, Period, Interval?

One item 2, I seem to have messed something up, as we return just a scalar NA right now. Will want to sort that out.

This gives us consistent comparisions with NumPy scalars.
@TomAugspurger TomAugspurger added ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations labels Dec 12, 2019
@TomAugspurger
Copy link
Contributor Author

How should we handle comparisons with ndarrays?

46f2327 restores the behavior on master. We return an object-dtype ndarray filled with NA values. Not sure if that's the right choice though.

@TomAugspurger
Copy link
Contributor Author

Anybody else seeing the CI check failure when linting .pyx files?

flake8 --version
3.7.9 (flake8-comprehensions: 3.1.4, mccabe: 0.6.1, pycodestyle: 2.5.0, pyflakes: 2.1.1) CPython 3.7.3 on Linux
Linting .py code
Linting .py code DONE
Linting .pyx code
Traceback (most recent call last):
  File "/home/runner/miniconda3/envs/pandas-dev/bin/flake8", line 11, in <module>
    sys.exit(main())

 ...

    self.formatter.handle(error)
  File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/flake8/formatting/base.py", line 91, in handle
    line = self.format(error)
  File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/flake8/formatting/default.py", line 39, in format
    "col": error.column_number,
ValueError: unsupported format character ':' (0x3a) at index 41

Not able to reproduce locally with those versions.

@TomAugspurger TomAugspurger mentioned this pull request Dec 16, 2019
@TomAugspurger
Copy link
Contributor Author

All green now. Any thoughts on what, e.g. np.array([True, False]) | pd.NA should return? Our choices are an object-dtype ndarray, or a BooleanArray. Is it strange to return a non-ndarray?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will review soon

doc/source/user_guide/missing_data.rst Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member

How should we handle comparisons with ndarrays? I see a few options:

Another option might also be to raise right now? (or is that also surprising?) That might be the safest option to allow choosing a more specific behaviour later.

Otherwise returning an object array of NAs seems OK to me. But from numpy's point of view, such an object array is also strange, as you can't do anything boolean with it like boolean indexing.

@TomAugspurger
Copy link
Contributor Author

But from numpy's point of view, such an object array is also strange, as you can't do anything boolean with it like boolean indexing.

Right. An ndarray full of pd.NA is a fairly useless thing I think. That's my main motivation for considering anything fancy like returning an IntegerArray, etc.

We do have precedence for returning non-ndarrays from __array_ufunc__.

In [8]: a = pd.array([2000, 2001], dtype="Period[D]")

In [9]: np.add(a, 1)
Out[9]: array([Period('2000-01-02', 'D'), Period('2001-01-02', 'D')], dtype=object)

Raising for ndarrays is also fine by me. Just need to choose whether a 0-dim ndarray is an array or scalar (probably a scalar).

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(only looked at the first part)

doc/source/user_guide/missing_data.rst Show resolved Hide resolved
doc/source/user_guide/missing_data.rst Outdated Show resolved Hide resolved
pandas/_libs/missing.pyx Outdated Show resolved Hide resolved
@@ -290,12 +291,22 @@ def _create_binary_propagating_op(name, divmod=False):

def method(self, other):
if (other is C_NA or isinstance(other, str)
or isinstance(other, (numbers.Number, np.bool_))):
or isinstance(other, (numbers.Number, np.bool_, np.int64, np.int_))
or isinstance(other, np.ndarray) and not other.shape):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit strange, but I suppose it is to follow numpy behaviour of 0-dim arrays returning scalars from comparison operations? (if so, maybe add a comment about that)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to follow numpy behaviour of 0-dim arrays returning scalars from comparison operations?

It's to NumPy scalars. Without this, we'd have

np.int64(1) == pd.NA

raise with

>   out[:] = NA
E   IndexError: too many indices for array

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A numpy scalar (which is different from a 0-dim array) is not a ndarray, so wouldn't pass the isinstance(other, np.ndarray) test. So I don't fully understand how that example is related.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could call lib.item_from_zerodim at the top?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger can you check this thread? I still don't understand how this relates to scalars, as they shouldn't pass the isinstance(other, np.ndarray) check. So not sure the below comment is correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this

diff --git a/pandas/_libs/missing.pyx b/pandas/_libs/missing.pyx
index 8d4d2c5568..e2bb6448cd 100644
--- a/pandas/_libs/missing.pyx
+++ b/pandas/_libs/missing.pyx
@@ -295,8 +295,7 @@ def _create_binary_propagating_op(name, is_divmod=False):
 
     def method(self, other):
         if (other is C_NA or isinstance(other, str)
-                or isinstance(other, (numbers.Number, np.bool_))
-                or isinstance(other, np.ndarray) and not other.shape):
+                or isinstance(other, (numbers.Number, np.bool_))):
             # Need the other.shape clause to handle NumPy scalars,
             # since we do a setitem on `out` below, which
             # won't work for NumPy scalars.

I have

In [3]: np.int64(1) == pd.NA
/Users/taugspurger/.virtualenvs/pandas-dev/bin/ipython:1: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
  #!/Users/taugspurger/Envs/pandas-dev/bin/python
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-3-209227154583> in <module>
----> 1 np.int64(1) == pd.NA

~/sandbox/pandas/pandas/_libs/missing.pyx in pandas._libs.missing._create_binary_propagating_op.method()
    307         elif isinstance(other, np.ndarray):
    308             out = np.empty(other.shape, dtype=object)
--> 309             out[:] = NA
    310
    311             if is_divmod:

IndexError: too many indices for array

So when we get there, it really does seem like np.int64(1) is an ndarray. Is cython doing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So when we get there, it really does seem like np.int64(1) is an ndarray. Is cython doing something?

Possibly, no idea. But I trust you this is indeed needed to get it working

@@ -434,6 +449,34 @@ class NAType(C_NAType):

__rxor__ = __xor__

# What else to add here? datetime / Timestamp? Period? Interval?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timestamp et al don't implement __array_ufunc__ right? If they would, I would say they could handle NA instead of NA handling all those types..

(I also think it is not that important right now that pd.NA can handle those types, since we don't yet use NA for those dtypes)

@jorisvandenbossche
Copy link
Member

We do have precedence for returning non-ndarrays from array_ufunc.

What you show is actually returning a numpy object array (but I would agree that should/could return a PeriodArray)

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 18, 2019

Oh... it's too early here :) I could have sworn we returned an EA there.

@TomAugspurger
Copy link
Contributor Author

Thanks for the review @jorisvandenbossche. Things should be good here.

I've deferred on returning anything too exotic from NA comparisons with ndarrays. We just return an ndarray of NA now.

@jbrockmendel
Copy link
Member

should getting this in for 1.0 be a priority?

@TomAugspurger
Copy link
Contributor Author

This should be good to go, assuming CI passes.

@TomAugspurger
Copy link
Contributor Author

All green.

@TomAugspurger
Copy link
Contributor Author

Planning to merge this later today if there aren't any objections.

@jreback jreback added this to the 1.0 milestone Dec 31, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, except for a question.

pandas/_libs/ops_dispatch.pyx Show resolved Hide resolved

def maybe_dispatch_ufunc_to_dunder_op(
object self, object ufunc, str method, *inputs, **kwargs
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for functions that arent cdef or cpdef, we can use python syntax for annotations

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel
Copy link
Member

comment about L482 vs L483, otherwise LGTM

@TomAugspurger
Copy link
Contributor Author

Merging in a few hours if there aren't any objections. Can revisit the line in https://github.com/pandas-dev/pandas/pull/30245/files/0f4e121a229e3ede7d3b0fa4a974f5e2dcc8e30e#diff-c4b0cf199cc6d5bc2362ceb737159458 if needed.

@jorisvandenbossche
Copy link
Member

Checks are failing: the benchmark run failure seems unrelated, the typing failure looks unrelated as well but I am not very knowledgeable about that.

@jreback jreback merged commit 97153bf into pandas-dev:master Jan 5, 2020
@jreback
Copy link
Contributor

jreback commented Jan 5, 2020

thanks @TomAugspurger very nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants