Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC/TST: Indexing with NA raises #30308

Merged
merged 32 commits into from
Jan 3, 2020
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
492f904
DOC/TST: Indexing with NA raises
TomAugspurger Dec 16, 2019
6444aa0
Merge remote-tracking branch 'upstream/master' into na-indexing-raises
TomAugspurger Dec 18, 2019
53f4f63
Handle BooleanArray in all EAs
TomAugspurger Dec 18, 2019
3bbf868
update
TomAugspurger Dec 18, 2019
a5ac457
fixups
TomAugspurger Dec 18, 2019
0dfe761
type
TomAugspurger Dec 18, 2019
dac111d
fix benchmark
TomAugspurger Dec 18, 2019
d1f08d9
fixup
TomAugspurger Dec 18, 2019
3dd59ca
typo
TomAugspurger Dec 18, 2019
151bdfe
updates
TomAugspurger Dec 19, 2019
d57b0ac
Revert "updates"
TomAugspurger Dec 19, 2019
36be0f6
examples
TomAugspurger Dec 20, 2019
7bd6c2f
restore datetime fix
TomAugspurger Dec 20, 2019
c5f3afb
Merge remote-tracking branch 'upstream/master' into na-indexing-raises
TomAugspurger Dec 20, 2019
76bb6ce
Merge branch 'master' of https://github.com/pandas-dev/pandas into na…
TomAugspurger Dec 28, 2019
505112e
update error message
TomAugspurger Dec 28, 2019
c73ae8e
checks
TomAugspurger Dec 28, 2019
3efe359
Merge remote-tracking branch 'upstream/master' into na-indexing-raises
TomAugspurger Dec 30, 2019
f94483f
update for error message
TomAugspurger Dec 30, 2019
953938d
Merge remote-tracking branch 'upstream/master' into na-indexing-raises
TomAugspurger Dec 30, 2019
8b1e567
update isort
TomAugspurger Dec 30, 2019
f317c64
isort
TomAugspurger Dec 30, 2019
c656292
fixup
TomAugspurger Dec 30, 2019
d4f0adc
Merge branch 'master' of https://github.com/pandas-dev/pandas into na…
TomAugspurger Dec 31, 2019
37ea95e
fixup
TomAugspurger Dec 31, 2019
816a47c
Merge remote-tracking branch 'upstream/master' into na-indexing-raises
TomAugspurger Jan 2, 2020
21fd589
update arrayo
TomAugspurger Jan 2, 2020
3637070
doc
TomAugspurger Jan 2, 2020
61599f2
integer
TomAugspurger Jan 2, 2020
6a0eda6
Merge remote-tracking branch 'upstream/master' into na-indexing-raises
TomAugspurger Jan 2, 2020
e622826
fixup
TomAugspurger Jan 2, 2020
5004d91
fixup
TomAugspurger Jan 2, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions asv_bench/benchmarks/indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,7 @@ def setup(self):
self.col_scalar = columns[10]
self.bool_indexer = self.df[self.col_scalar] > 0
self.bool_obj_indexer = self.bool_indexer.astype(object)
self.boolean_indexer = (self.df[self.col_scalar] > 0).astype("boolean")

def time_loc(self):
self.df.loc[self.idx_scalar, self.col_scalar]
Expand All @@ -144,6 +145,9 @@ def time_boolean_rows(self):
def time_boolean_rows_object(self):
self.df[self.bool_obj_indexer]

def time_boolean_rows_boolean(self):
self.df[self.bool_obj_indexer]


class DataFrameNumericIndexing:
def setup(self):
Expand Down
23 changes: 23 additions & 0 deletions doc/source/user_guide/boolean.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,29 @@ Nullable Boolean Data Type

.. versionadded:: 1.0.0


.. _boolean.indexing:

Indexing with NA values
-----------------------

pandas does not allow indexing with NA values. Attempting to do so
will raise a ``ValueError``.

.. ipython:: python
:okexcept:

s = pd.Series([1, 2, 3])
mask = pd.array([True, False, None])
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
s[mask]

The missing values will need to be explicitly filled with True or False prior
to using the array as a mask.

.. ipython:: python

s[mask.fillna(False)]

.. _boolean.kleene:

Kleene Logical Operations
Expand Down
13 changes: 10 additions & 3 deletions pandas/core/arrays/boolean.py
Original file line number Diff line number Diff line change
Expand Up @@ -289,6 +289,13 @@ def _from_factorized(cls, values, original: "BooleanArray"):
def _formatter(self, boxed=False):
return str

@property
def _hasnans(self):
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
# Note: this is expensive right now! The hope is that we can
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
# make this faster by having an optional mask, but not have to change
# source code using it..
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could easily be cached (and then updated on setitem / other mutation)

return self._mask.any()

def __getitem__(self, item):
if is_integer(item):
if self._mask[item]:
Expand All @@ -311,7 +318,7 @@ def _coerce_to_ndarray(self, dtype=None, na_value: "Scalar" = libmissing.NA):
if dtype is None:
dtype = object
if is_bool_dtype(dtype):
if not self.isna().any():
if not self._hasnans:
return self._data
else:
raise ValueError(
Expand Down Expand Up @@ -485,7 +492,7 @@ def astype(self, dtype, copy=True):

if is_bool_dtype(dtype):
# astype_nansafe converts np.nan to True
if self.isna().any():
if self._hasnans:
raise ValueError("cannot convert float NaN to bool")
else:
return self._data.astype(dtype, copy=copy)
Expand All @@ -497,7 +504,7 @@ def astype(self, dtype, copy=True):
)
# for integer, error if there are missing values
if is_integer_dtype(dtype):
if self.isna().any():
if self._hasnans:
raise ValueError("cannot convert NA to integer")
# for float dtype, ensure we use np.nan before casting (numpy cannot
# deal with pd.NA)
Expand Down
3 changes: 3 additions & 0 deletions pandas/tests/indexing/test_loc.py
Original file line number Diff line number Diff line change
Expand Up @@ -373,6 +373,9 @@ def test_loc_index(self):
result = df.loc[mask.values]
tm.assert_frame_equal(result, expected)

result = df.loc[pd.array(mask, dtype="boolean")]
tm.assert_frame_equal(result, expected)

def test_loc_general(self):

df = DataFrame(
Expand Down
79 changes: 79 additions & 0 deletions pandas/tests/indexing/test_na_indexing.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
import pytest

import pandas as pd
import pandas.util.testing as tm


@pytest.mark.parametrize(
"values, dtype",
[
([1, 2, 3], "int64"),
([1.0, 2.0, 3.0], "float64"),
(["a", "b", "c"], "object"),
(["a", "b", "c"], "string"),
([1, 2, 3], "datetime64[ns]"),
([1, 2, 3], "datetime64[ns, CET]"),
([1, 2, 3], "timedelta64[ns]"),
(["2000", "2001", "2002"], "Period[D]"),
([1, 0, 3], "Sparse"),
([pd.Interval(0, 1), pd.Interval(1, 2), pd.Interval(3, 4)], "interval"),
],
)
@pytest.mark.parametrize(
"mask", [[True, False, False], [True, True, True], [False, False, False]]
)
@pytest.mark.parametrize("box_mask", [True, False])
@pytest.mark.parametrize("frame", [True, False])
def test_series_mask_boolean(values, dtype, mask, box_mask, frame):
ser = pd.Series(values, dtype=dtype, index=["a", "b", "c"])
if frame:
ser = ser.to_frame()
mask = pd.array(mask, dtype="boolean")
if box_mask:
mask = pd.Series(mask, index=ser.index)

expected = ser[mask.astype("bool")]

result = ser[mask]
tm.assert_equal(result, expected)

if not box_mask:
# Series.iloc[Series[bool]] isn't allowed
result = ser.iloc[mask]
tm.assert_equal(result, expected)

result = ser.loc[mask]
tm.assert_equal(result, expected)

# empty
mask = mask[:0]
ser = ser.iloc[:0]
expected = ser[mask.astype("bool")]
result = ser[mask]
tm.assert_equal(result, expected)

if not box_mask:
# Series.iloc[Series[bool]] isn't allowed
result = ser.iloc[mask]
tm.assert_equal(result, expected)

result = ser.loc[mask]
tm.assert_equal(result, expected)


@pytest.mark.parametrize("frame", [True, False])
def test_indexing_with_na_raises(frame):
s = pd.Series([1, 2, 3], name="name")

if frame:
s = s.to_frame()
mask = pd.array([True, False, None], dtype="boolean")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we parametrize this? IIRC nulls_fixture wasn't appropriate but maybe need a nulls_scalar_fixture for these purposes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be parametrized? The boolean array with missing values?

match = "cannot index with vector containing NA / NaN values"
with pytest.raises(ValueError, match=match):
s[mask]

with pytest.raises(ValueError, match=match):
s.loc[mask]

with pytest.raises(ValueError, match=match):
s.iloc[mask]