Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: boolean indexing error with .drop() #16877

Closed
danparshall opened this issue Jul 10, 2017 · 8 comments · Fixed by #17343
Closed

BUG: boolean indexing error with .drop() #16877

danparshall opened this issue Jul 10, 2017 · 8 comments · Fixed by #17343
Labels
Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@danparshall
Copy link

Code Sample, a copy-pastable example if possible

df = pd.DataFrame( data = {
                         'acol'  : np.arange(4),
                         'bcol' :  2*np.arange(4)
                        })
df.drop(df.bcol > 2, axis=0, inplace=True)

print(df)

Expected Output

	acol	bcol
0	0	0
1	1	2

Observed Output

	acol	bcol
2	2	4
3	3	6
4	4	8

Problem description

The anticipated behavior was that rows with bcol > 2 would be dropped. The actual behavior is that the boolean gets converted to 0/1, and then treated as index label. So row numbers 0 and/or 1 are dropped... but all other rows will be kept.

The documentation did not make it clear what was happening.

Solutions might include documentation clarifying that .drop() cannot be used with boolean indexing, or a warning when receiving the (attempted) boolean index.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Linux OS-release: 2.6.32-573.12.1.el6.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.2
pytest: 3.1.2
pip: 9.0.1
setuptools: 33.1.1.post20170320
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.1
xarray: 0.9.6
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.0
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.5.0a1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.8.0
bs4: 4.5.3
html5lib: 0.9999999
sqlalchemy: 1.1.11
pymysql: 0.7.9.None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.9.5
s3fs: 0.1.1
pandas_gbq: None
pandas_datareader: None

@gfyoung
Copy link
Member

gfyoung commented Jul 11, 2017

From the current docs:

Return new object with labels in requested axis removed.

When you call df.bcol > 2 , your labels are Series([False, False, True, True]), which pandas (and Python) would interpret as the labels 0 and 1 on the index.

I know that I am repeating part of what you said, but the documentation IMO seems to align with what it's supposed to do. Nowhere does it say that it filters by a conditional, which is what you were aiming to do.

To perform the filtering that you want, one recommended way is this:

df = df[df.bcol <= 2]

Note that using inplace=True is generally not considered good practice because it makes code more prone to bugs (we will likely deprecate and remove this option at some point).

@jreback
Copy link
Contributor

jreback commented Jul 11, 2017

this is a duplicate of #6189, but will keep this issue open. This is pretty easy to fix, by raising on a boolean indexer. PR's welcome!

@jreback jreback added Difficulty Novice Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves labels Jul 11, 2017
@jreback jreback added this to the Next Major Release milestone Jul 11, 2017
@jreback jreback changed the title boolean indexing error with .drop() ERR: boolean indexing error with .drop() Jul 11, 2017
@gfyoung
Copy link
Member

gfyoung commented Jul 11, 2017

@jreback : I'm not sure what the problem is here. The documentation looks pretty clear on this. #6189 demonstrates that clearly. True and False are the labels 1 and 0 respectively, which is why it works the first time but fails the second time, so I don't think there is anything to fix here. Also, you can't raise on a boolean indexer because you can have booleans as indices!

@jreback
Copy link
Contributor

jreback commented Jul 13, 2017

Well a boolean indexer doesn't make sense here and should raise an error. Having boolean indices is quite rare and you can also detect that case.

@gfyoung
Copy link
Member

gfyoung commented Jul 13, 2017

@jreback : Is that not special-casing? True and False are interpreted by Python as the labels 1 and 0 respectively, regardless of the type of index you are operating with.

@jreback
Copy link
Contributor

jreback commented Jul 13, 2017

its a fail-fast error check, if a boolean indexer is passed in, it should raise unless the axis is in fact a boolean index (and the shapes match).

@gfyoung
Copy link
Member

gfyoung commented Jul 14, 2017

Fair enough. I feel like this should just be allowed, but given the confusion it's generated amongst users (two independent issues), I concede 😄

@andrejonasson
Copy link
Contributor

Hi, I'm working on this issue.

andrejonasson added a commit to andrejonasson/pandas that referenced this issue Aug 19, 2017
andrejonasson added a commit to andrejonasson/pandas that referenced this issue Aug 19, 2017
andrejonasson added a commit to andrejonasson/pandas that referenced this issue Aug 26, 2017
andrejonasson added a commit to andrejonasson/pandas that referenced this issue Aug 26, 2017
andrejonasson added a commit to andrejonasson/pandas that referenced this issue Aug 26, 2017
andrejonasson added a commit to andrejonasson/pandas that referenced this issue Aug 26, 2017
andrejonasson added a commit to andrejonasson/pandas that referenced this issue Aug 27, 2017
andrejonasson added a commit to andrejonasson/pandas that referenced this issue Aug 27, 2017
@jreback jreback modified the milestones: 0.21.0, Next Major Release Aug 30, 2017
@jreback jreback removed the Error Reporting Incorrect or improved errors from pandas label Aug 30, 2017
@jreback jreback changed the title ERR: boolean indexing error with .drop() BUG: boolean indexing error with .drop() Aug 30, 2017
andrejonasson added a commit to andrejonasson/pandas that referenced this issue Aug 30, 2017
andrejonasson added a commit to andrejonasson/pandas that referenced this issue Sep 7, 2017
andrejonasson added a commit to andrejonasson/pandas that referenced this issue Sep 18, 2017
andrejonasson added a commit to andrejonasson/pandas that referenced this issue Sep 18, 2017
@jreback jreback modified the milestones: 0.21.0, Next Major Release Sep 23, 2017
andrejonasson added a commit to andrejonasson/pandas that referenced this issue Sep 24, 2017
andrejonasson added a commit to andrejonasson/pandas that referenced this issue Sep 24, 2017
@jreback jreback modified the milestones: Next Major Release, 0.21.0 Sep 24, 2017
andrejonasson added a commit to andrejonasson/pandas that referenced this issue Sep 24, 2017
andrejonasson added a commit to andrejonasson/pandas that referenced this issue Sep 25, 2017
alanbato pushed a commit to alanbato/pandas that referenced this issue Nov 10, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017
toobaz added a commit to toobaz/pandas that referenced this issue Jun 29, 2019
toobaz added a commit to toobaz/pandas that referenced this issue Jun 29, 2019
jreback pushed a commit that referenced this issue Jun 30, 2019
…dex (#27119)

* TST: actually test #16877 on numeric index (not just RangeIndex)

* PERF: do not instantiate IndexEngine for standard lookup over RangeIndex

closes #16685
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants