Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC/TST: Indexing with NA raises #30308

Merged
merged 32 commits into from
Jan 3, 2020

Conversation

TomAugspurger
Copy link
Contributor

xref #29556, #28778

We're already doing the right thing on master. This just documents that behavior, and adds a handful of tests.

I'm not sure if there are existing tests I should be parameterizing. I only found a couple in tests/indexing/, which I parametrized over bool and boolean dtype. Will add more if I've missed any. In the meantime, I've written new tests.

@TomAugspurger TomAugspurger added Indexing Related to indexing on series/frames, not to indexes themselves ExtensionArray Extending pandas with custom dtypes or arrays. labels Dec 17, 2019
@TomAugspurger TomAugspurger added this to the 1.0 milestone Dec 17, 2019
doc/source/user_guide/boolean.rst Outdated Show resolved Hide resolved
pandas/core/arrays/boolean.py Show resolved Hide resolved

if frame:
s = s.to_frame()
mask = pd.array([True, False, None], dtype="boolean")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we parametrize this? IIRC nulls_fixture wasn't appropriate but maybe need a nulls_scalar_fixture for these purposes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be parametrized? The boolean array with missing values?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indeed seems to already work for Series etc, but not yet for the EAs itself:

In [22]: arr = pd.array([True, False, True], dtype="boolean") 

In [23]: arr[arr]  
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-23-9b15d846534b> in <module>
----> 1 arr[arr]

~/scipy/pandas/pandas/core/arrays/boolean.py in __getitem__(self, item)
    295                 return self.dtype.na_value
    296             return self._data[item]
--> 297         return type(self)(self._data[item], self._mask[item])
    298 
    299     def to_numpy(self, dtype=None, copy=False, na_value: "Scalar" = libmissing.NA):

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

So we need to add some check there that converts BooleanArray to numpy bool array if possible, or otherwise raise the informative error (we should already have some utility somewhere, since it is being done in the indexing machinery)

pandas/core/arrays/boolean.py Outdated Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor Author

but not yet for the EAs itself:

Good point. I think I'll add a base test for this, since all EAs should probably be able to handle it. Or is that too opinionated?

(we should already have some utility somewhere, since it is being done in the indexing machinery)

It's pandas.core.indexing.check_bool_indexer, which converts the array-like to a boolean ndarray. I can make public somehow, if we want 3rd party EAs to handle these...

@jorisvandenbossche
Copy link
Member

I think I'll add a base test for this, since all EAs should probably be able to handle it. Or is that too opinionated?

It will probably give failures for external projects that implement EAs .. But maybe that is good, as indeed ideally they would handle this.

I can make public somehow, if we want 3rd party EAs to handle these...

That sounds as a good idea to me!

@TomAugspurger
Copy link
Contributor Author

I can make public somehow, if we want 3rd party EAs to handle these...

That sounds as a good idea to me!

Do people have an initial preference on where to put this? I was thinking a new module in pandas.api, perhaps pandas.api.indexing. OTOH, this is mostly going to be used by EAs, and we already have take in pandas.api.extensions, so perhaps we put it with take?

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 18, 2019

I would put it like take in api.extensions

@TomAugspurger
Copy link
Contributor Author

Added a couple base tests for ExtensionArray.__getitem__(booleanarray), and updated our EAs to pass.

@TomAugspurger
Copy link
Contributor Author

Fixed the mypy issues.

On performance, a BooleanArray takes about about 1.50x as long as an ndarray for a 10,000 element array

In [5]: s = pd.Series(np.arange(10000))
   ...:
   ...: m1 = np.zeros(len(s), dtype="bool")
   ...: m2 = pd.array(m1, dtype="boolean")

In [6]: %timeit s[m1]
270 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [7]: %timeit s[m2]
391 µs ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [8]: 391/270
Out[8]: 1.4481481481481482

I suspect this is do to the additional .isna().any() check we need for BooleanArray, though there may be an unnecessary copy.

Actually... we're doing two isna().any() checks, one in is_bool_indexer and one in check_bool_array_indexer. I didn't realize that is_bool_indexer also converted to an ndarray and checked for missing values. I'm guessing we'll want to combine those checks.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 19, 2019

Edit: reverted this for now. Will investigate more later, but probably out of scope for this PR.

Pushed a perf / bugfix update, will see if it breaks things. I restructured is_bool_indexer to not do NA checking. But as I write this, I wonder why we even have is_bool_indexer? I'd rather just use is_bool_dtype and check_bool_indexer. Will look into that now.

Anyway, that turned up a bug in DatetimeArray.__getitem__ which has been fixed and documented.

I'm probably going to stop looking at perf things now, but there's likely more to be done. We're still about 1.3x as long as ndarray, but only 8% of the time is spent in check_array doing the isna().any(), which should be the extent of the overhead. We're leaving a lot on the table I think: #30349

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me

pandas/tests/extension/base/getitem.py Outdated Show resolved Hide resolved
pandas/tests/extension/base/getitem.py Outdated Show resolved Hide resolved
pandas/core/common.py Outdated Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor Author

@jorisvandenbossche can you take a look at 21fd589? That's updated DecimalArray.__getitem__ to handle boolean masks, and I think is representative of what 3rd party arrays would need to do.

If we're OK with that, I'll add check_bool_array_indexer to the public API. I'm not worried about its implementation changing. I'm still not sure where it should go though, but probably api.extensions.

@jreback
Copy link
Contributor

jreback commented Jan 2, 2020

we have api.indexers now -
seems like a good place

@TomAugspurger
Copy link
Contributor Author

Yeah, just saw that. It seems reasonable.

@TomAugspurger
Copy link
Contributor Author

All green.

@jreback jreback merged commit 59b431f into pandas-dev:master Jan 3, 2020
@jreback
Copy link
Contributor

jreback commented Jan 3, 2020

very nice @TomAugspurger

lots of testing!

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger thanks, that function looks good!

On the location of pandas.api.indexers, no strong opinion on putting it there in se, but I am not sure this new function necessarily fits together with a rolling-window functionality (but then I also find api.indexers a strange name for rolling window functionality that has nothing to do with indexing, will comment about that elsewhere)


See Also
--------
api.extensions.is_bool_indexer : Check if `key` is a boolean indexer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not exist now (will do a PR)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> pd.api.extensions.check_bool_array_indexer(arr, mask)
Traceback (most recent call last):
...
ValueError: cannot convert to bool numpy array in presence of missing values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we try to improve this error message? We know this is called in terms of indexing, and then something like "Cannot do boolean indexing with missing values, use fillna(True/False) ..." would be a much more useful error message than the message about conversion to numpy array.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 6, 2020

While implementing support for this in GeoPandas, I ran into an issue with your example implementation for DecimalArray, basically being:

        def __getitem__(self, item):
            ....
            # array, slice.
            if pd.api.types.is_list_like(item):
                if not pd.api.types.is_array_like(item):
                    item = pd.array(item)
                dtype = item.dtype
                if pd.api.types.is_bool_dtype(dtype):
                    item = check_bool_array_indexer(self, item)
            return type(self)(self._data[item])

The problem with the above is that this also converts integers to IntegerArray, which is then not supported in the numpy indexing call (not fully sure why this doesn't come up in the decimal array's tests).
Opened #30738 for this about indexing with IntegerArray in general.

That issue is maybe not a blocker for 1.0 (it also doesn't work on 0.25, and I can workaround this in GeoPandas), but we should think a moment if this might change how we want to expose an array indexer check function.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 6, 2020

not fully sure why this doesn't come up in the decimal array's tests).

Because you added a fix for IntegerArray in DecimalArray.getitem ;)

@@ -766,7 +769,9 @@ def __getitem__(self, key):
else:
key = np.asarray(key)

if com.is_bool_indexer(key) and len(self) == len(key):
if com.is_bool_indexer(key):
key = check_bool_indexer(self, key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this check_bool_indexer when in all the others its check_bool_array_indexer? found this bc mypy is complaining about that the first arg should be an Index

@jorisvandenbossche jorisvandenbossche mentioned this pull request Dec 3, 2021
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants