DOC/TST: Indexing with NA raises #30308

TomAugspurger · 2019-12-17T18:00:10Z

We're already doing the right thing on master. This just documents that behavior, and adds a handful of tests.

I'm not sure if there are existing tests I should be parameterizing. I only found a couple in tests/indexing/, which I parametrized over bool and boolean dtype. Will add more if I've missed any. In the meantime, I've written new tests.

doc/source/user_guide/boolean.rst

pandas/core/arrays/boolean.py

WillAyd · 2019-12-17T18:27:41Z

pandas/tests/indexing/test_na_indexing.py

+
+    if frame:
+        s = s.to_frame()
+    mask = pd.array([True, False, None], dtype="boolean")


Can we parametrize this? IIRC nulls_fixture wasn't appropriate but maybe need a nulls_scalar_fixture for these purposes

What would be parametrized? The boolean array with missing values?

jorisvandenbossche

This indeed seems to already work for Series etc, but not yet for the EAs itself:

In [22]: arr = pd.array([True, False, True], dtype="boolean") 

In [23]: arr[arr]  
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-23-9b15d846534b> in <module>
----> 1 arr[arr]

~/scipy/pandas/pandas/core/arrays/boolean.py in __getitem__(self, item)
    295                 return self.dtype.na_value
    296             return self._data[item]
--> 297         return type(self)(self._data[item], self._mask[item])
    298 
    299     def to_numpy(self, dtype=None, copy=False, na_value: "Scalar" = libmissing.NA):

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

So we need to add some check there that converts BooleanArray to numpy bool array if possible, or otherwise raise the informative error (we should already have some utility somewhere, since it is being done in the indexing machinery)

pandas/core/arrays/boolean.py

TomAugspurger · 2019-12-18T14:22:46Z

but not yet for the EAs itself:

Good point. I think I'll add a base test for this, since all EAs should probably be able to handle it. Or is that too opinionated?

(we should already have some utility somewhere, since it is being done in the indexing machinery)

It's pandas.core.indexing.check_bool_indexer, which converts the array-like to a boolean ndarray. I can make public somehow, if we want 3rd party EAs to handle these...

jorisvandenbossche · 2019-12-18T14:29:22Z

I think I'll add a base test for this, since all EAs should probably be able to handle it. Or is that too opinionated?

It will probably give failures for external projects that implement EAs .. But maybe that is good, as indeed ideally they would handle this.

I can make public somehow, if we want 3rd party EAs to handle these...

That sounds as a good idea to me!

TomAugspurger · 2019-12-18T15:01:22Z

I can make public somehow, if we want 3rd party EAs to handle these...

That sounds as a good idea to me!

Do people have an initial preference on where to put this? I was thinking a new module in pandas.api, perhaps pandas.api.indexing. OTOH, this is mostly going to be used by EAs, and we already have take in pandas.api.extensions, so perhaps we put it with take?

jorisvandenbossche · 2019-12-18T15:06:17Z

I would put it like take in api.extensions

TomAugspurger · 2019-12-18T15:15:52Z

Added a couple base tests for ExtensionArray.__getitem__(booleanarray), and updated our EAs to pass.

TomAugspurger · 2019-12-18T18:04:16Z

Fixed the mypy issues.

On performance, a BooleanArray takes about about 1.50x as long as an ndarray for a 10,000 element array

In [5]: s = pd.Series(np.arange(10000))
   ...:
   ...: m1 = np.zeros(len(s), dtype="bool")
   ...: m2 = pd.array(m1, dtype="boolean")

In [6]: %timeit s[m1]
270 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [7]: %timeit s[m2]
391 µs ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [8]: 391/270
Out[8]: 1.4481481481481482

I suspect this is do to the additional .isna().any() check we need for BooleanArray, though there may be an unnecessary copy.

Actually... we're doing two isna().any() checks, one in is_bool_indexer and one in check_bool_array_indexer. I didn't realize that is_bool_indexer also converted to an ndarray and checked for missing values. I'm guessing we'll want to combine those checks.

TomAugspurger · 2019-12-19T16:34:56Z

Edit: reverted this for now. Will investigate more later, but probably out of scope for this PR.

Pushed a perf / bugfix update, will see if it breaks things. I restructured is_bool_indexer to not do NA checking. But as I write this, I wonder why we even have is_bool_indexer? I'd rather just use is_bool_dtype and check_bool_indexer. Will look into that now.

Anyway, that turned up a bug in DatetimeArray.__getitem__ which has been fixed and documented.

I'm probably going to stop looking at perf things now, but there's likely more to be done. We're still about 1.3x as long as ndarray, but only 8% of the time is spent in check_array doing the isna().any(), which should be the extent of the overhead. We're leaving a lot on the table I think: #30349

This reverts commit 151bdfe.

jorisvandenbossche

This looks good to me

pandas/tests/extension/base/getitem.py

pandas/core/common.py

…-indexing-raises

TomAugspurger · 2020-01-02T13:44:46Z

@jorisvandenbossche can you take a look at 21fd589? That's updated DecimalArray.__getitem__ to handle boolean masks, and I think is representative of what 3rd party arrays would need to do.

If we're OK with that, I'll add check_bool_array_indexer to the public API. I'm not worried about its implementation changing. I'm still not sure where it should go though, but probably api.extensions.

jreback · 2020-01-02T13:47:43Z

we have api.indexers now -
seems like a good place

TomAugspurger · 2020-01-02T13:51:04Z

Yeah, just saw that. It seems reasonable.

TomAugspurger · 2020-01-02T15:33:19Z

All green.

jreback · 2020-01-03T02:21:37Z

very nice @TomAugspurger

lots of testing!

jorisvandenbossche

@TomAugspurger thanks, that function looks good!

On the location of pandas.api.indexers, no strong opinion on putting it there in se, but I am not sure this new function necessarily fits together with a rolling-window functionality (but then I also find api.indexers a strange name for rolling window functionality that has nothing to do with indexing, will comment about that elsewhere)

jorisvandenbossche · 2020-01-06T08:44:36Z

pandas/core/indexers.py

+
+    See Also
+    --------
+    api.extensions.is_bool_indexer : Check if `key` is a boolean indexer.


this does not exist now (will do a PR)

jorisvandenbossche · 2020-01-06T08:50:18Z

pandas/core/indexers.py

+    >>> pd.api.extensions.check_bool_array_indexer(arr, mask)
+    Traceback (most recent call last):
+    ...
+    ValueError: cannot convert to bool numpy array in presence of missing values


Should we try to improve this error message? We know this is called in terms of indexing, and then something like "Cannot do boolean indexing with missing values, use fillna(True/False) ..." would be a much more useful error message than the message about conversion to numpy array.

jorisvandenbossche · 2020-01-06T13:55:01Z

While implementing support for this in GeoPandas, I ran into an issue with your example implementation for DecimalArray, basically being:

        def __getitem__(self, item):
            ....
            # array, slice.
            if pd.api.types.is_list_like(item):
                if not pd.api.types.is_array_like(item):
                    item = pd.array(item)
                dtype = item.dtype
                if pd.api.types.is_bool_dtype(dtype):
                    item = check_bool_array_indexer(self, item)
            return type(self)(self._data[item])

The problem with the above is that this also converts integers to IntegerArray, which is then not supported in the numpy indexing call (not fully sure why this doesn't come up in the decimal array's tests).
Opened #30738 for this about indexing with IntegerArray in general.

That issue is maybe not a blocker for 1.0 (it also doesn't work on 0.25, and I can workaround this in GeoPandas), but we should think a moment if this might change how we want to expose an array indexer check function.

jorisvandenbossche · 2020-01-06T13:59:08Z

not fully sure why this doesn't come up in the decimal array's tests).

Because you added a fix for IntegerArray in DecimalArray.getitem ;)

jbrockmendel · 2020-10-07T15:52:55Z

pandas/core/arrays/sparse/array.py

@@ -766,7 +769,9 @@ def __getitem__(self, key):
                else:
                    key = np.asarray(key)

-            if com.is_bool_indexer(key) and len(self) == len(key):
+            if com.is_bool_indexer(key):
+                key = check_bool_indexer(self, key)


why is this check_bool_indexer when in all the others its check_bool_array_indexer? found this bc mypy is complaining about that the first arg should be an Index

DOC/TST: Indexing with NA raises

492f904

TomAugspurger added Indexing Related to indexing on series/frames, not to indexes themselves ExtensionArray Extending pandas with custom dtypes or arrays. labels Dec 17, 2019

TomAugspurger added this to the 1.0 milestone Dec 17, 2019

jorisvandenbossche mentioned this pull request Dec 17, 2019

Missing values proposal: concrete steps for 1.0 #29556

Closed

13 tasks

WillAyd requested changes Dec 17, 2019

View reviewed changes

Merge remote-tracking branch 'upstream/master' into na-indexing-raises

6444aa0

jorisvandenbossche reviewed Dec 18, 2019

View reviewed changes

pandas/core/arrays/boolean.py Outdated Show resolved Hide resolved

Handle BooleanArray in all EAs

53f4f63

TomAugspurger added 2 commits December 18, 2019 09:09

update

3bbf868

fixups

a5ac457

TomAugspurger added 4 commits December 18, 2019 09:56

type

0dfe761

fix benchmark

dac111d

fixup

d1f08d9

typo

3dd59ca

updates

151bdfe

TomAugspurger added 4 commits December 19, 2019 12:01

Revert "updates"

d57b0ac

This reverts commit 151bdfe.

examples

36be0f6

restore datetime fix

7bd6c2f

Merge remote-tracking branch 'upstream/master' into na-indexing-raises

c5f3afb

jorisvandenbossche reviewed Dec 21, 2019

View reviewed changes

pandas/tests/extension/base/getitem.py Outdated Show resolved Hide resolved

pandas/tests/extension/base/getitem.py Outdated Show resolved Hide resolved

pandas/core/common.py Outdated Show resolved Hide resolved

TomAugspurger added 2 commits December 28, 2019 10:45

Merge branch 'master' of https://github.com/pandas-dev/pandas into na…

76bb6ce

…-indexing-raises

update error message

505112e

TomAugspurger added 2 commits January 2, 2020 07:36

Merge remote-tracking branch 'upstream/master' into na-indexing-raises

816a47c

update arrayo

21fd589

TomAugspurger added 5 commits January 2, 2020 07:55

doc

3637070

integer

61599f2

Merge remote-tracking branch 'upstream/master' into na-indexing-raises

6a0eda6

fixup

e622826

fixup

5004d91

jreback approved these changes Jan 3, 2020

View reviewed changes

jreback merged commit 59b431f into pandas-dev:master Jan 3, 2020

jorisvandenbossche reviewed Jan 6, 2020

View reviewed changes

jorisvandenbossche mentioned this pull request Jan 6, 2020

DOC: fix see also in docstring of check_bool_array_indexer #30725

Merged

This was referenced Jan 6, 2020

BUG: reflect changes in bool(geom) in shapely 1.7 geopandas/geopandas#1244

Merged

BUG: Indexing GeometryArray with boolean array mask geopandas/geopandas#1257

Closed

TomAugspurger deleted the na-indexing-raises branch January 6, 2020 12:23

This was referenced Jan 6, 2020

Fix GeometryArray indexing with pandas boolean array (pandas 1.0 compat) geopandas/geopandas#1258

Merged

PERF: Categorical indexing performance regression #30744

Closed

PERF: Categorical getitem perf #30747

Merged

Add read_parquet xhochy/fletcher#95

Merged

TomAugspurger mentioned this pull request Jan 9, 2020

CLN: Removed "# noqa: F401" comments #30832

Merged

5 tasks

jorisvandenbossche mentioned this pull request Jan 20, 2020

API: generalized check_array_indexer for validating array-like getitem indexers #31150

Merged

jorisvandenbossche mentioned this pull request Jan 30, 2020

REGR: Array.__setitem__ failing with nullable boolean mask #31446

Closed

jbrockmendel reviewed Oct 7, 2020

View reviewed changes

jorisvandenbossche mentioned this pull request Dec 3, 2021

CLN: TODOs #44733

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC/TST: Indexing with NA raises #30308

DOC/TST: Indexing with NA raises #30308

TomAugspurger commented Dec 17, 2019

WillAyd Dec 17, 2019

TomAugspurger Dec 18, 2019

jorisvandenbossche left a comment

TomAugspurger commented Dec 18, 2019

jorisvandenbossche commented Dec 18, 2019

TomAugspurger commented Dec 18, 2019

jorisvandenbossche commented Dec 18, 2019 •

edited

Loading

TomAugspurger commented Dec 18, 2019

TomAugspurger commented Dec 18, 2019

TomAugspurger commented Dec 19, 2019 •

edited

Loading

jorisvandenbossche left a comment

TomAugspurger commented Jan 2, 2020

jreback commented Jan 2, 2020

TomAugspurger commented Jan 2, 2020

TomAugspurger commented Jan 2, 2020

jreback commented Jan 3, 2020

jorisvandenbossche left a comment •

edited

Loading

jorisvandenbossche Jan 6, 2020

jorisvandenbossche Jan 6, 2020

jorisvandenbossche Jan 6, 2020

jorisvandenbossche commented Jan 6, 2020 •

edited

Loading

jorisvandenbossche commented Jan 6, 2020 •

edited

Loading

jbrockmendel Oct 7, 2020

DOC/TST: Indexing with NA raises #30308

DOC/TST: Indexing with NA raises #30308

Conversation

TomAugspurger commented Dec 17, 2019

WillAyd Dec 17, 2019

Choose a reason for hiding this comment

TomAugspurger Dec 18, 2019

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

TomAugspurger commented Dec 18, 2019

jorisvandenbossche commented Dec 18, 2019

TomAugspurger commented Dec 18, 2019

jorisvandenbossche commented Dec 18, 2019 • edited Loading

TomAugspurger commented Dec 18, 2019

TomAugspurger commented Dec 18, 2019

TomAugspurger commented Dec 19, 2019 • edited Loading

jorisvandenbossche left a comment

Choose a reason for hiding this comment

TomAugspurger commented Jan 2, 2020

jreback commented Jan 2, 2020

TomAugspurger commented Jan 2, 2020

TomAugspurger commented Jan 2, 2020

jreback commented Jan 3, 2020

jorisvandenbossche left a comment • edited Loading

Choose a reason for hiding this comment

jorisvandenbossche Jan 6, 2020

Choose a reason for hiding this comment

jorisvandenbossche Jan 6, 2020

Choose a reason for hiding this comment

jorisvandenbossche Jan 6, 2020

Choose a reason for hiding this comment

jorisvandenbossche commented Jan 6, 2020 • edited Loading

jorisvandenbossche commented Jan 6, 2020 • edited Loading

jbrockmendel Oct 7, 2020

Choose a reason for hiding this comment

jorisvandenbossche commented Dec 18, 2019 •

edited

Loading

TomAugspurger commented Dec 19, 2019 •

edited

Loading

jorisvandenbossche left a comment •

edited

Loading

jorisvandenbossche commented Jan 6, 2020 •

edited

Loading

jorisvandenbossche commented Jan 6, 2020 •

edited

Loading