[WIP]: Indexing with BooleanArray propagates NA #30265
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This implements indexing a Series with a BooleanArray for some
dtypes. NA values in the mask propagate. It's not clear that we want to propagate NAs.
This required updating each EA's
__getitem__
to correctly propagateNA. Doing the same for ndarray-backed Series is possible, but will require
some additional hacking and we first need to determine the result dtype.
No need to review in detail. The primary motivation for pushing this up
is to better understand what indexing with NA values will look like in
practice.
xref #28778, and #28778 (comment) in particular.
(copied from #28778 (comment))
On indexing, propagating NAs presents some challenging dtype casting issues. For example, what is the dtype on the output of
Would that be an Int64 dtype, with the last value being NA? Would we cast to float64 and use
np.nan
? object-dtype?And what if the original series was float dtype? A float-dtype with NaN is about the only option we have right now, since we lack a float64 array that can hold NA.
I don't think that an indexing operation that introduces missing values should change the dtype of the array. I don't think anyone would realistically want that. So... do we just raise for now?
What about cases when we are able to index without changing the dtype? Those would be
IMO, which shouldn't have value-dependent behavior, so if
raises, then so should
pd.Series([1, 2])[pd.array([True, False])
(no missing values).I think supporting 2 is fine, since it just depends on the dtypes.