Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP]: Indexing with BooleanArray propagates NA #30265

Closed
wants to merge 1 commit into from

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Dec 13, 2019

This implements indexing a Series with a BooleanArray for some
dtypes. NA values in the mask propagate. It's not clear that we want to propagate NAs.

This required updating each EA's __getitem__ to correctly propagate
NA. Doing the same for ndarray-backed Series is possible, but will require
some additional hacking and we first need to determine the result dtype.

No need to review in detail. The primary motivation for pushing this up
is to better understand what indexing with NA values will look like in
practice.

xref #28778, and #28778 (comment) in particular.


(copied from #28778 (comment))

On indexing, propagating NAs presents some challenging dtype casting issues. For example, what is the dtype on the output of

In [3]: s = pd.Series([1, 2, 3])

In [4]: mask = pd.array([True, False, None])

In [5]: s[mask]

Would that be an Int64 dtype, with the last value being NA? Would we cast to float64 and use np.nan? object-dtype?

And what if the original series was float dtype? A float-dtype with NaN is about the only option we have right now, since we lack a float64 array that can hold NA.

I don't think that an indexing operation that introduces missing values should change the dtype of the array. I don't think anyone would realistically want that. So... do we just raise for now?

What about cases when we are able to index without changing the dtype? Those would be

  1. Indexing with a BooleanArray with no missing values.
  2. Indexing a dtype that supports missing values (datetime, timedelta, string, object, Int64, boolean...). Basically anything but NumPy's bool, int float.

IMO, which shouldn't have value-dependent behavior, so if

>>> pd.Series([1, 2])[pd.array([True, None])

raises, then so should pd.Series([1, 2])[pd.array([True, False]) (no missing values).

I think supporting 2 is fine, since it just depends on the dtypes.

>>> pd.Series([1, 2], dtype="Int64")[pd.array([True, None])]
Series([1, NA], dtype="Int64")
>>> pd.Series([1, 2], dtype="Int64")[pd.array([True, False])]
Series([1], dtype="Int64")

This implements indexing a Series with a BooleanArray for some
dtypes. NA values in the mask propagate.

This required updating each EA's `__getitem__` to correctly propagate
NA. This isn't an option for ndarray-backed Series, which don't
understand BooleanArray.

No need to review in detail. The primary motivation for pushing this up
is to better understand what indexing with NA values will look like in
practice.
@TomAugspurger
Copy link
Contributor Author

The primary motivation for pushing this up
is to better understand what indexing with NA values will look like in
practice.

Any thoughts on this point @jorisvandenbossche, @jreback, et. al? In #28778 (comment) Joris said that

If not raising an error, skipping (interpret NA as False) feels more intuitive to me than propagating NAs (and thus introducing NAs in the filtered array). That's also what SQL and tidyverse do (base R propagates, see above)

I think that the additional dtype for Series([1, 2])[True, False]) being int, but Series([1, 2])[True, NA]) being unknown makes the case for raising even stronger.

So my proposal is to

  1. Have indexing with NA values raises with a nice error message.
  2. Implement BooleanArray.fillna() to make the common case of indexing with a BooleanArray easier.

At some point, we can discuss having NA values propagate or drop, but lets get some user feedback first.

@TomAugspurger TomAugspurger changed the title [WIP]: Indexing with BooleanArray [WIP]: Indexing with BooleanArray propagates NA Dec 16, 2019
@TomAugspurger
Copy link
Contributor Author

Closing this, since I don't think we'll want to implement NA propagation.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 16, 2019

So my proposal is to

  1. Have indexing with NA values raises with a nice error message.
  2. Implement BooleanArray.fillna() to make the common case of indexing with a BooleanArray easier.

I think this is the right thing to do. IMHO, if there is a pd.NA value in a BooleanArray, then we raise to force the user to think about the issue and how they want missing values treated.

Maybe a related issue - what if someone puts pd.NA in an index and then uses .loc, passing pd.NA in the parameters to .loc?

@jorisvandenbossche
Copy link
Member

I think that the additional dtype for Series([1, 2])[True, False]) being int, but Series([1, 2])[True, NA]) being unknown makes the case for raising even stronger.

Note that is not necessarily an argument for raising. Skipping also preserves the dtype.

Maybe a related issue - what if someone puts pd.NA in an index and then uses .loc, passing pd.NA in the parameters to .loc?

For now, the EAs are not yet enabled in Index (#22861), but this is a good question for when looking into that (and a chance to re-evaluate how it now happens with NaN!).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants