BUG: Inconsistent behavior of pd.DataFrame.drop #33438

FSpanhel · 2020-04-09T18:46:41Z

Problem description
Until an hour ago I thought that I can safely omit the use of inplace = True (which is not recommended, e.g., #30484) and instead use inplace = False and directly assign the result.
For example, I thought that

df.drop(columns = ['a'], inplace = True)

can be replaced by

df = df.drop(columns = ['a'])

where df is a pd.DataFrame.

However, this does not seem to be case when I use these operations within a function.

# 1) direct assignment, .drop
df = pd.DataFrame([1])
def tfun(df):
    df['a'] = 2
    df = df.drop(columns = ['a'])
tfun(df)
print(df.columns)
>>> Index([0, 'a'], dtype='object')

Compare this with

# 2) direct assignment, .drop with inplace (or del)
df = pd.DataFrame([1])
def tfun(df):
    df['a'] = 2
    df.drop(columns = ['a'], inplace = True) # using del df['a'] leads to the same result
tfun(df)
print(df.columns)
>>> Index([0], dtype='object')

The result of 2) is as expected (we add column 'a' and immediately remove it). However, I am very confused about the result of 1). The removal of column 'a' which is done inside tfun is not reflected in df outside after tfun is applied.

It gets even stranger when we use .assign to add column 'a' to df inside tfun:

# 3) .assign, .drop
df= pd.DataFrame([1])
def tfun(df):
    df= df.assign(a = 2)
    df= df.drop(columns = ['a'])
tfun(df)
print(df.columns)
>>> RangeIndex(start=0, stop=1, step=1)

Now, column 'a' is removed, although the type of the remaining column is now a RangeIndex.
I definitely would expect that the result of 3) is equal to the result of 1). What is going on here?

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel
byteorder : little
LC_ALL : None
LANG : de_DE.UTF-8
LOCALE : None.None

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0.post20200210
Cython : 0.29.15
pytest : 5.3.5
hypothesis : 5.5.4
sphinx : 2.4.0
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.5.0
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.2
fastparquet : None
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.1.3
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.3.5
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.13
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.7
numba : 0.48.0

The text was updated successfully, but these errors were encountered:

FSpanhel · 2020-04-10T09:53:49Z

Solution
The problem in 1) disappears if a copy of df is made before column 'a' is assigned.

# 4) copy before assigment
df= pd.DataFrame([1])
def tfun(df):
    df= df.copy()
    df['a'] = 2
    df= df.drop(columns = ['a'])
tfun(df)
print(df.columns)
>>> RangeIndex(start=0, stop=1, step=1)

This is consistent with 3) because .assign returns a copy.

The reason for the behavior in 1) is not related to pandas but how python deals with view/copies. To make it short:

df['a'] = 2 has the same id as df outside tfun.
df = df.drop(columns = ['a']) has a new id.

Therefore, the addition of column 'a' is reflected in df outside tfun but the deletion of 'a' is not considered because it is done on a different object.

So my confusion has arisen because I implicitly assumed that pd.DataFrame.drop() returns a view of the DataFrame in any case.
The documentation is not really explicit about this. It basically says "Drop specified labels from rows or columns.'" and only mentions what happens if inplace = True. Looking at #30484 it appears that other users also do not know what happens in the default case.
I think it would be really helpful if the documentation reads `"Drop specified labels from rows or columns and return a copy"' or "inplace : bool, default False. If False, return a copy. If True, do operation inplace and return None." This also applies to other methods that have the inplace parameter.

I've opened #33451 to improve the documentation in this regard.

KenilMehta · 2020-04-10T12:02:37Z

I would like to help to solve this issue.
Upon investing a little, I found that this happens not only for the "drop" function but for the other functions like "replace".

The problem I found is that while assigning value to a data frame inside a function, another copy of the data frame is created. The following code snippet demonstrates it:

import pandas as pd

def demoDrop(df):
        print("address before drop : ", hex(id(df)))
        df = df.drop(columns = ['a'])
        print("address after drop : ", hex(id(df)))

def demoDropinplace(df):
        print("address before drop inplace : ", hex(id(df)))
        df.drop(columns = ['a'], inplace=True)
        print("address after drop inplace: ", hex(id(df)))

df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6]})
print("address of original frame : ", hex(id(df)))
demoDrop(df)
demoDropinplace(df)

Output:

address of original frame : 0x7f35cb4c33c8
address before drop : 0x7f35cb4c33c8
address after drop : 0x7f35b3f89ba8
address before drop inplace : 0x7f35cb4c33c8
address after drop inplace: 0x7f35cb4c33c8

As one can see that address is maintained if inplace is used, but in the other case the address of original dataframe gets changed.

FSpanhel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 9, 2020

FSpanhel mentioned this issue Apr 10, 2020

Strange behaviour of pd.DataFrame.drop() with inplace argument #30484

Closed

FSpanhel changed the title ~~BUG: Inconsistent behavior of pd.DataFrame.drop~~ DOC: Explain inplace = True in pd.DataFrame.drop etc. Apr 10, 2020

FSpanhel mentioned this issue Apr 10, 2020

DOC: Be explicit whether a view or copy is returned #33451

Closed

FSpanhel changed the title ~~DOC: Explain inplace = True in pd.DataFrame.drop etc.~~ BUG: Inconsistent behavior of pd.DataFrame.drop Apr 10, 2020

FSpanhel closed this as completed Apr 10, 2020

bashtage removed the Needs Triage Issue that has not been reviewed by a pandas team member label Aug 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Inconsistent behavior of pd.DataFrame.drop #33438

BUG: Inconsistent behavior of pd.DataFrame.drop #33438

FSpanhel commented Apr 9, 2020 •

edited

Loading

INSTALLED VERSIONS

FSpanhel commented Apr 10, 2020 •

edited

Loading

KenilMehta commented Apr 10, 2020

BUG: Inconsistent behavior of pd.DataFrame.drop #33438

BUG: Inconsistent behavior of pd.DataFrame.drop #33438

Comments

FSpanhel commented Apr 9, 2020 • edited Loading

Output of pd.show_versions()

INSTALLED VERSIONS

FSpanhel commented Apr 10, 2020 • edited Loading

KenilMehta commented Apr 10, 2020

FSpanhel commented Apr 9, 2020 •

edited

Loading

Output of `pd.show_versions()`

FSpanhel commented Apr 10, 2020 •

edited

Loading