-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC/TST: Indexing with NA raises #30308
Conversation
|
||
if frame: | ||
s = s.to_frame() | ||
mask = pd.array([True, False, None], dtype="boolean") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we parametrize this? IIRC nulls_fixture
wasn't appropriate but maybe need a nulls_scalar_fixture
for these purposes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be parametrized? The boolean array with missing values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This indeed seems to already work for Series etc, but not yet for the EAs itself:
In [22]: arr = pd.array([True, False, True], dtype="boolean")
In [23]: arr[arr]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-23-9b15d846534b> in <module>
----> 1 arr[arr]
~/scipy/pandas/pandas/core/arrays/boolean.py in __getitem__(self, item)
295 return self.dtype.na_value
296 return self._data[item]
--> 297 return type(self)(self._data[item], self._mask[item])
298
299 def to_numpy(self, dtype=None, copy=False, na_value: "Scalar" = libmissing.NA):
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
So we need to add some check there that converts BooleanArray to numpy bool array if possible, or otherwise raise the informative error (we should already have some utility somewhere, since it is being done in the indexing machinery)
Good point. I think I'll add a base test for this, since all EAs should probably be able to handle it. Or is that too opinionated?
It's |
It will probably give failures for external projects that implement EAs .. But maybe that is good, as indeed ideally they would handle this.
That sounds as a good idea to me! |
Do people have an initial preference on where to put this? I was thinking a new module in |
I would put it like |
Added a couple base tests for |
Fixed the mypy issues. On performance, a BooleanArray takes about about 1.50x as long as an ndarray for a 10,000 element array In [5]: s = pd.Series(np.arange(10000))
...:
...: m1 = np.zeros(len(s), dtype="bool")
...: m2 = pd.array(m1, dtype="boolean")
In [6]: %timeit s[m1]
270 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [7]: %timeit s[m2]
391 µs ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [8]: 391/270
Out[8]: 1.4481481481481482 I suspect this is do to the additional Actually... we're doing two |
Edit: reverted this for now. Will investigate more later, but probably out of scope for this PR. Pushed a perf / bugfix update, will see if it breaks things. I restructured Anyway, that turned up a bug in I'm probably going to stop looking at perf things now, but there's likely more to be done. We're still about 1.3x as long as ndarray, but only 8% of the time is spent in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me
@jorisvandenbossche can you take a look at 21fd589? That's updated If we're OK with that, I'll add |
we have api.indexers now - |
Yeah, just saw that. It seems reasonable. |
All green. |
very nice @TomAugspurger lots of testing! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger thanks, that function looks good!
On the location of pandas.api.indexers
, no strong opinion on putting it there in se, but I am not sure this new function necessarily fits together with a rolling-window functionality (but then I also find api.indexers
a strange name for rolling window functionality that has nothing to do with indexing, will comment about that elsewhere)
|
||
See Also | ||
-------- | ||
api.extensions.is_bool_indexer : Check if `key` is a boolean indexer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does not exist now (will do a PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>>> pd.api.extensions.check_bool_array_indexer(arr, mask) | ||
Traceback (most recent call last): | ||
... | ||
ValueError: cannot convert to bool numpy array in presence of missing values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we try to improve this error message? We know this is called in terms of indexing, and then something like "Cannot do boolean indexing with missing values, use fillna(True/False) ..." would be a much more useful error message than the message about conversion to numpy array.
While implementing support for this in GeoPandas, I ran into an issue with your example implementation for DecimalArray, basically being:
The problem with the above is that this also converts integers to IntegerArray, which is then not supported in the numpy indexing call (not fully sure why this doesn't come up in the decimal array's tests). That issue is maybe not a blocker for 1.0 (it also doesn't work on 0.25, and I can workaround this in GeoPandas), but we should think a moment if this might change how we want to expose an array indexer check function. |
Because you added a fix for IntegerArray in DecimalArray.getitem ;) |
@@ -766,7 +769,9 @@ def __getitem__(self, key): | |||
else: | |||
key = np.asarray(key) | |||
|
|||
if com.is_bool_indexer(key) and len(self) == len(key): | |||
if com.is_bool_indexer(key): | |||
key = check_bool_indexer(self, key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this check_bool_indexer when in all the others its check_bool_array_indexer? found this bc mypy is complaining about that the first arg should be an Index
xref #29556, #28778
We're already doing the right thing on master. This just documents that behavior, and adds a handful of tests.
I'm not sure if there are existing tests I should be parameterizing. I only found a couple in
tests/indexing/
, which I parametrized over bool and boolean dtype. Will add more if I've missed any. In the meantime, I've written new tests.