-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: when Index is numeric and indexer is boolean (#16877) #17343
BUG: when Index is numeric and indexer is boolean (#16877) #17343
Conversation
47340c3
to
44ffaf1
Compare
Codecov Report
@@ Coverage Diff @@
## master #17343 +/- ##
==========================================
- Coverage 91.03% 91.02% -0.02%
==========================================
Files 162 162
Lines 49567 49569 +2
==========================================
- Hits 45125 45118 -7
- Misses 4442 4451 +9
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #17343 +/- ##
==========================================
- Coverage 91.23% 91.21% -0.02%
==========================================
Files 163 163
Lines 49808 49810 +2
==========================================
- Hits 45442 45435 -7
- Misses 4366 4375 +9
Continue to review full report at Codecov.
|
I'm a bit in favour of raising an error ( But more in general, I wonder whether a In [2]: pd.MultiIndex.from_product([[1,2], [True, False]]).levels[1].dtype
Out[2]: dtype('O') but that's probably just coherent with BooleanIndex not existing ( Anyway, I'm afraid the current proposal is missing the following: In [2]: pd.Series([1,2,3], index=[0, 1, False]).drop([False])
Out[2]:
1 2
dtype: int64 |
We generally try to avoid going from no error message / warning to direct error message because for downstream users, that could be too drastic of a change. A warning should suffice for now I would think. |
doc/source/whatsnew/v0.21.0.txt
Outdated
@@ -358,6 +358,7 @@ Indexing | |||
- Allow unicode empty strings as placeholders in multilevel columns in Python 2 (:issue:`17099`) | |||
- Bug in ``.iloc`` when used with inplace addition or assignment and an int indexer on a ``MultiIndex`` causing the wrong indexes to be read from and written to (:issue:`17148`) | |||
- Bug in ``.isin()`` in which checking membership in empty ``Series`` objects raised an error (:issue:`16991`) | |||
- Warns when a numeric ``Index``'s ``get_indexer`` is passed a boolean indexer (:issue:`16877`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might be able to use func
here ? @jorisvandenbossche @jreback
pandas/tests/indexes/test_base.py
Outdated
tm.assert_numpy_array_equal(actual, expected, check_dtype=False) | ||
|
||
numeric_idx = pd.Index(range(4)) | ||
with pytest.warns(UserWarning): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use our own tm.assert_produces_warning
instead from pandas.util.testing
.
pandas/core/indexes/base.py
Outdated
if target.is_boolean() and self.is_numeric(): | ||
warnings.warn('Index is numeric and indexer is boolean, ' | ||
'possible unintended conversion from boolean ' | ||
'to numeric', UserWarning, stacklevel=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can rephrase the prose here. Proper grammar and slightly more formality in the writing would be good here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'Passing a boolean indexer to an Index containing numeric indices may cause the indexer to behave as being of numeric type'
@gfyoung Thanks for the feedback. Would the above work? Any pointers would be welcome.
OK: let me just emphasize that in this specific case it's very likely that anyone indexing a numeric index with boolean labels is doing something wrong/assuming a behaviour (e.g. in |
i am ok in raising in this case |
@toobaz The example index [0, 1, False] is a duplicate index (0 and False are equivalent), which I described as not being covered by this PR. However, you are right that the PR does not cover the case with unique mixed-integer indices (e.g. [0, True, 2]). I will add a "self.holds_integer()" to the condition to cover this, since the function is already implemented. This would not cover "mixed-float" (e.g. [1.5, True]), which is 'mixed' inferred type in pandas but 'mixed' could also be ['ab', True] in which case it would be an incorrect warning. So I'll leave that one out. Edit: I will only add a self.holds_integer() condition in case it's a warning. Since a user with [0, True, 2] index should be able to drop the True without an Error. Even if that user may drop the 0 by passing [False]. |
Actually, thinking to my example again I realize that you are right... but that this PR might not be the fix we need. My example should not raise an error (neither a warning); however, it should give a different result (not dropping the 0). So I'm afraid the fix is a bit deeper (wherever booleans are automatically casted to integers). That said, waiting for a better fix, your PR certainly won't harm. |
Just for context: In [2]: pd.Series([1,2,3], index=[0, 1, 's']).loc[True]
Out[2]: 2
In [3]: pd.Series([1,2,3], index=[0, 1, True]).loc[True]
Out[3]:
1 2
True 3
dtype: int64
In [4]: pd.Series([1,2,3], index=[0, 1, 2]).loc[True]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/home/nobackup/repo/pandas/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2457 try:
-> 2458 return self._engine.get_loc(key)
2459 except KeyError:
/home/nobackup/repo/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5272)()
128
--> 129 cpdef get_loc(self, object val):
130 if is_definitely_invalid_key(val):
/home/nobackup/repo/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5077)()
147
--> 148 self._check_type(val)
149
/home/nobackup/repo/pandas/pandas/_libs/index_class_helper.pxi in pandas._libs.index.Int64Engine._check_type (pandas/_libs/index.c:16899)()
135 if util.is_bool_object(val):
--> 136 raise KeyError(val)
137 elif util.is_float_object(val):
KeyError: True
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
/home/nobackup/repo/pandas/pandas/core/indexing.py in _get_label(self, label, axis)
129 try:
--> 130 return self.obj._xs(label, axis=axis)
131 except:
/home/nobackup/repo/pandas/pandas/core/generic.py in xs(self, key, axis, level, drop_level)
2242 else:
-> 2243 loc = self.index.get_loc(key)
2244
/home/nobackup/repo/pandas/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2459 except KeyError:
-> 2460 return self._engine.get_loc(self._maybe_cast_indexer(key))
2461
/home/nobackup/repo/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5272)()
128
--> 129 cpdef get_loc(self, object val):
130 if is_definitely_invalid_key(val):
/home/nobackup/repo/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5077)()
147
--> 148 self._check_type(val)
149
/home/nobackup/repo/pandas/pandas/_libs/index_class_helper.pxi in pandas._libs.index.Int64Engine._check_type (pandas/_libs/index.c:16899)()
135 if util.is_bool_object(val):
--> 136 raise KeyError(val)
137 elif util.is_float_object(val):
KeyError: True
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-4-b1a86c8195a6> in <module>()
----> 1 pd.Series([1,2,3], index=[0, 1, 2]).loc[True]
/home/nobackup/repo/pandas/pandas/core/indexing.py in __getitem__(self, key)
1333 else:
1334 key = com._apply_if_callable(key, self.obj)
-> 1335 return self._getitem_axis(key, axis=0)
1336
1337 def _is_scalar_access(self, key):
/home/nobackup/repo/pandas/pandas/core/indexing.py in _getitem_axis(self, key, axis)
1557 # fall thru to straight lookup
1558 self._has_valid_type(key, axis)
-> 1559 return self._get_label(key, axis=axis)
1560
1561
/home/nobackup/repo/pandas/pandas/core/indexing.py in _get_label(self, label, axis)
130 return self.obj._xs(label, axis=axis)
131 except:
--> 132 return self.obj[label]
133 elif isinstance(label, tuple) and isinstance(label[axis], slice):
134 raise IndexingError('no slices here, handle elsewhere')
/home/nobackup/repo/pandas/pandas/core/series.py in __getitem__(self, key)
617 key = com._apply_if_callable(key, self)
618 try:
--> 619 result = self.index.get_value(self, key)
620
621 if not is_scalar(result):
/home/nobackup/repo/pandas/pandas/core/indexes/base.py in get_value(self, series, key)
2491 try:
2492 return self._engine.get_value(s, k,
-> 2493 tz=getattr(series.dtype, 'tz', None))
2494 except KeyError as e1:
2495 if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:
/home/nobackup/repo/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4396)()
93 return val in self.mapping
94
---> 95 cpdef get_value(self, ndarray arr, object key, object tz=None):
96 """
97 arr : 1-dimensional ndarray
/home/nobackup/repo/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4079)()
101 void* data_ptr
102
--> 103 loc = self.get_loc(key)
104 if PySlice_Check(loc) or cnp.PyArray_Check(loc):
105 return arr[loc]
/home/nobackup/repo/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5077)()
146 return self._get_loc_duplicates(val)
147
--> 148 self._check_type(val)
149
150 try:
/home/nobackup/repo/pandas/pandas/_libs/index_class_helper.pxi in pandas._libs.index.Int64Engine._check_type (pandas/_libs/index.c:16899)()
134 hash(val)
135 if util.is_bool_object(val):
--> 136 raise KeyError(val)
137 elif util.is_float_object(val):
138 raise KeyError(val)
KeyError: True |
44ffaf1
to
661bfb6
Compare
Hello @andrejonasson! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on September 25, 2017 at 14:07 Hours UTC |
661bfb6
to
4a10a11
Compare
@toobaz I agree. This PR can prevent cases like the issues referenced until the core of the problem is solved. I have changed it to correctly return an array of repeated -1 when the Index is numeric and the indexer is boolean. |
0aaaad8
to
085b9a0
Compare
>>> a = pd.Index([0,1,2,3])
>>> a.drop([True,False])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/andredotj/Projects/pandas/pandas/pandas/core/indexes/base.py", line 3641, in drop
labels[mask])
ValueError: labels [ True False] not contained in axis
>>> a.difference([True, False])
Int64Index([0, 1, 2, 3], dtype='int64') New update let's the client functions handle the missing values as they see fit e.g. For the example in #16877 >>> df = pd.DataFrame( data = {
... 'acol' : np.arange(4),
... 'bcol' : 2*np.arange(4)
... })
>>> df.drop(df.bcol > 2, axis=0, inplace=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/andredotj/Projects/pandas/pandas/pandas/core/generic.py", line 2402, in drop
new_axis = axis.drop(labels, errors=errors)
File "/home/andredotj/Projects/pandas/pandas/pandas/core/indexes/base.py", line 3641, in drop
labels[mask])
ValueError: labels [False False True True] not contained in axis It correctly raises when passed the boolean target. |
Just wanted to say "shouldn't this rather return -1 than raising an error", but I see you already did that!
@toobaz I am not sure we will ever be able to solve this. When you have an 'object' index, I think we can only rely on the python object's equality / hashing, and True and 1 are equal for Python (only their identity is different, but we don't rely on that) |
Is there anything that should be amended in the updated PR? Also, do I just push again to get AppVeyor to pass or will it automatically rebuild at some point? Thanks! |
doc/source/whatsnew/v0.21.0.txt
Outdated
@@ -358,6 +358,7 @@ Indexing | |||
- Allow unicode empty strings as placeholders in multilevel columns in Python 2 (:issue:`17099`) | |||
- Bug in ``.iloc`` when used with inplace addition or assignment and an int indexer on a ``MultiIndex`` causing the wrong indexes to be read from and written to (:issue:`17148`) | |||
- Bug in ``.isin()`` in which checking membership in empty ``Series`` objects raised an error (:issue:`16991`) | |||
- When a numeric ``Index``'s :func:`get_indexer` is passed a boolean target it now returns the correct indexer (:issue:`16877`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback @jorisvandenbossche : I don't quite remember the formatting for :func:
and stuff, but somehow the beginning of the whatsnew
looks incorrectly formatted. However, correct me if I'm wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would actually re-write this message so it is more user-oriented (the typical user does not know Index.get_indexer). In the end it is about boolean values that now raise when indexing a numeric index or dropping from such an index?
@gfyoung for :func:
, to have it working, in this case you would need to write :func:`Index.get_indexer`
(but as I said, I wouldn't add that, as typical user does not need to use that function)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche : Ah, gotcha, thanks for clarifying!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, this issue is all about passing a boolean indexer to DataFrame.drop()
, so say that.
pandas/tests/indexes/test_base.py
Outdated
expected = np.array([0, 0, 1]) | ||
tm.assert_numpy_array_equal(actual, expected, check_dtype=False) | ||
|
||
def test_get_indexer_numeric_index_boolean_target(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reference issue number for both of your tests.
pandas/tests/indexes/test_base.py
Outdated
boolean_idx = pd.Index([True, False]) | ||
actual = boolean_idx.get_indexer([True, True, False]) | ||
expected = np.array([0, 0, 1]) | ||
tm.assert_numpy_array_equal(actual, expected, check_dtype=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we check dtype
here (we generally prefer to)? I think if you set the dtype
of expected
to be np.intp
, you should be okay. Same for your test below.
@andrejonasson : Don't worry about |
pandas/tests/indexes/test_base.py
Outdated
@@ -1131,6 +1131,18 @@ def test_get_indexer_strings(self): | |||
with pytest.raises(TypeError): | |||
idx.get_indexer(['a', 'b', 'c', 'd'], method='pad', tolerance=2) | |||
|
|||
def test_get_indexer_boolean_index_boolean_target(self): | |||
boolean_idx = pd.Index([True, False]) | |||
actual = boolean_idx.get_indexer([True, True, False]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use result =
rather than actual.
doc/source/whatsnew/v0.21.0.txt
Outdated
@@ -358,6 +358,7 @@ Indexing | |||
- Allow unicode empty strings as placeholders in multilevel columns in Python 2 (:issue:`17099`) | |||
- Bug in ``.iloc`` when used with inplace addition or assignment and an int indexer on a ``MultiIndex`` causing the wrong indexes to be read from and written to (:issue:`17148`) | |||
- Bug in ``.isin()`` in which checking membership in empty ``Series`` objects raised an error (:issue:`16991`) | |||
- When a numeric ``Index``'s :func:`get_indexer` is passed a boolean target it now returns the correct indexer (:issue:`16877`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, this issue is all about passing a boolean indexer to DataFrame.drop()
, so say that.
pandas/tests/indexes/test_base.py
Outdated
@@ -1131,6 +1131,18 @@ def test_get_indexer_strings(self): | |||
with pytest.raises(TypeError): | |||
idx.get_indexer(['a', 'b', 'c', 'd'], method='pad', tolerance=2) | |||
|
|||
def test_get_indexer_boolean_index_boolean_target(self): | |||
boolean_idx = pd.Index([True, False]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number here
pandas/tests/indexes/test_base.py
Outdated
numeric_idx = pd.Index(range(4)) | ||
actual = numeric_idx.get_indexer([True, False, True]) | ||
expected = np.repeat(-1, 3) | ||
tm.assert_numpy_array_equal(actual, expected, check_dtype=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add a test that directly exercises .drop()
as in the original issue
085b9a0
to
5f877fa
Compare
Hi, thanks for the feedback everyone. I removed I added a direct test to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments and pls rebase
pandas/core/indexes/base.py
Outdated
@@ -2589,6 +2589,9 @@ def get_indexer(self, target, method=None, limit=None, tolerance=None): | |||
if tolerance is not None: | |||
tolerance = self._convert_tolerance(tolerance) | |||
|
|||
if target.is_boolean() and self.is_numeric(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment here why
pandas/tests/indexes/test_base.py
Outdated
# GH 16877 | ||
numeric_idx = pd.Index(range(4)) | ||
result = numeric_idx.get_indexer([True, False, True]) | ||
expected = np.repeat(-1, 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be intp_
dtype
doc/source/whatsnew/v0.21.0.txt
Outdated
@@ -360,6 +360,7 @@ Indexing | |||
- Bug in ``.iloc`` when used with inplace addition or assignment and an int indexer on a ``MultiIndex`` causing the wrong indexes to be read from and written to (:issue:`17148`) | |||
- Bug in ``.isin()`` in which checking membership in empty ``Series`` objects raised an error (:issue:`16991`) | |||
- Bug in ``CategoricalIndex`` reindexing in which specified indices containing duplicates were not being respected (:issue:`17323`) | |||
- Bug in :func:`DataFrame.drop` caused boolean labels ``False`` and ``True`` to be treated as labels 0 and 1 respectively when dropping indices from a numeric index (:issue:`16877`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add this will now raise.and move to api other changes section (even though its actually a bug fix, its highlited a bit more there).
5f877fa
to
c0a2156
Compare
c0a2156
to
5b22400
Compare
@jreback Is there anything else that needs editing? Also is there some way to make sure I don't get merge conflicts due to the whatsnew line I've added (keeps on being the reason for my conflicts)? |
Really not much you can do here. People will modify the |
doc/source/whatsnew/v0.21.0.txt
Outdated
@@ -433,7 +433,7 @@ Other API Changes | |||
- :class:`Period` is now immutable, and will now raise an ``AttributeError`` when a user tries to assign a new value to the ``ordinal`` or ``freq`` attributes (:issue:`17116`). | |||
- :func:`to_datetime` when passed a tz-aware ``origin=`` kwarg will now raise a more informative ``ValueError`` rather than a ``TypeError`` (:issue:`16842`) | |||
- Renamed non-functional ``index`` to ``index_col`` in :func:`read_stata` to improve API consistency (:issue:`16342`) | |||
|
|||
- Bug in :func:`DataFrame.drop` caused boolean labels ``False`` and ``True`` to be treated as labels 0 and 1 respectively when dropping indices from a numeric index, this will now raise (:issue:`16877`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is grammatically incorrect. "This will now raise" should be its own sentence.
5b22400
to
9c914ff
Compare
can you rebase / update |
9abe2a1
to
6e60ee1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. minor doc-comments
doc/source/whatsnew/v0.21.0.txt
Outdated
@@ -481,7 +481,7 @@ Other API Changes | |||
- :class:`Period` is now immutable, and will now raise an ``AttributeError`` when a user tries to assign a new value to the ``ordinal`` or ``freq`` attributes (:issue:`17116`). | |||
- :func:`to_datetime` when passed a tz-aware ``origin=`` kwarg will now raise a more informative ``ValueError`` rather than a ``TypeError`` (:issue:`16842`) | |||
- Renamed non-functional ``index`` to ``index_col`` in :func:`read_stata` to improve API consistency (:issue:`16342`) | |||
|
|||
- Bug in :func:`DataFrame.drop` caused boolean labels ``False`` and ``True`` to be treated as labels 0 and 1 respectively when dropping indices from a numeric index. This will now raise (:issue:`16877`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add what its raising (ValueError)
pandas/core/indexes/base.py
Outdated
@@ -2609,6 +2609,11 @@ def get_indexer(self, target, method=None, limit=None, tolerance=None): | |||
if tolerance is not None: | |||
tolerance = self._convert_tolerance(tolerance) | |||
|
|||
# Treat boolean labels passed to a numeric index as not found. Without | |||
# this fix False and True would be treated as 0 and 1 respectively. | |||
if target.is_boolean() and self.is_numeric(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number here as a comment
6e60ee1
to
0997ade
Compare
There is a windows failure (int32 vs int64) |
pandas/core/indexes/base.py
Outdated
# this fix False and True would be treated as 0 and 1 respectively. | ||
# (GH #16877) | ||
if target.is_boolean() and self.is_numeric(): | ||
return np.repeat(-1, target.size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to put _ensure_patform_int
around this before returning (similar to the final return in this function)
…d target is boolean (pandas-dev#16877)
0997ade
to
1e6c013
Compare
I restarted the failed build (compile error, that I think has been fixed on master). Ping on green. |
@andrejonasson Thanks! |
…d target is boolean (pandas-dev#16877) (pandas-dev#17343)
…d target is boolean (pandas-dev#16877) (pandas-dev#17343)
git diff upstream/master -u -- "*.py" | flake8 --diff
This warns for both tickets:
#6189
#16877
There's one particular flaw: non-unique indices never run get_indexer and deals with dropping in a way where it is difficult to inspect types. I think it should be OK to leave that one for now though.
One of the reasons I chose to warn instead of Error is that running some_numeric_index.difference([True, False]) should possibly still be allowed, while still warning the user that the difference they are getting may just be due to conversion from True to 1 (or 1.0) and False to 0 (or 0.0). This probably is rare enough that an Error might still be the best approach.
The questions remaining to me are:
Is this severe enough that an Error should be raised? What type in this case?
Is the message sufficient?
Thanks.