Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Inconsistent behavior of pd.DataFrame.drop #33438

Closed
FSpanhel opened this issue Apr 9, 2020 · 2 comments
Closed

BUG: Inconsistent behavior of pd.DataFrame.drop #33438

FSpanhel opened this issue Apr 9, 2020 · 2 comments
Labels

Comments

@FSpanhel
Copy link

FSpanhel commented Apr 9, 2020

Problem description
Until an hour ago I thought that I can safely omit the use of inplace = True (which is not recommended, e.g., #30484) and instead use inplace = False and directly assign the result.
For example, I thought that

df.drop(columns = ['a'], inplace = True)

can be replaced by

df = df.drop(columns = ['a']) 

where df is a pd.DataFrame.

However, this does not seem to be case when I use these operations within a function.

# 1) direct assignment, .drop
df = pd.DataFrame([1])
def tfun(df):
    df['a'] = 2
    df = df.drop(columns = ['a'])
tfun(df)
print(df.columns)
>>> Index([0, 'a'], dtype='object')

Compare this with

# 2) direct assignment, .drop with inplace (or del)
df = pd.DataFrame([1])
def tfun(df):
    df['a'] = 2
    df.drop(columns = ['a'], inplace = True) # using del df['a'] leads to the same result
tfun(df)
print(df.columns)
>>> Index([0], dtype='object')

The result of 2) is as expected (we add column 'a' and immediately remove it). However, I am very confused about the result of 1). The removal of column 'a' which is done inside tfun is not reflected in df outside after tfun is applied.

It gets even stranger when we use .assign to add column 'a' to df inside tfun:

# 3) .assign, .drop
df= pd.DataFrame([1])
def tfun(df):
    df= df.assign(a = 2)
    df= df.drop(columns = ['a'])
tfun(df)
print(df.columns)
>>> RangeIndex(start=0, stop=1, step=1)

Now, column 'a' is removed, although the type of the remaining column is now a RangeIndex.
I definitely would expect that the result of 3) is equal to the result of 1). What is going on here?

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel
byteorder : little
LC_ALL : None
LANG : de_DE.UTF-8
LOCALE : None.None

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0.post20200210
Cython : 0.29.15
pytest : 5.3.5
hypothesis : 5.5.4
sphinx : 2.4.0
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.5.0
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.2
fastparquet : None
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.1.3
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.3.5
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.13
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.7
numba : 0.48.0

@FSpanhel FSpanhel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 9, 2020
@FSpanhel
Copy link
Author

FSpanhel commented Apr 10, 2020

Solution
The problem in 1) disappears if a copy of df is made before column 'a' is assigned.

# 4) copy before assigment
df= pd.DataFrame([1])
def tfun(df):
    df= df.copy()
    df['a'] = 2
    df= df.drop(columns = ['a'])
tfun(df)
print(df.columns)
>>> RangeIndex(start=0, stop=1, step=1)

This is consistent with 3) because .assign returns a copy.

The reason for the behavior in 1) is not related to pandas but how python deals with view/copies. To make it short:

  • df['a'] = 2 has the same id as df outside tfun.
  • df = df.drop(columns = ['a']) has a new id.

Therefore, the addition of column 'a' is reflected in df outside tfun but the deletion of 'a' is not considered because it is done on a different object.

So my confusion has arisen because I implicitly assumed that pd.DataFrame.drop() returns a view of the DataFrame in any case.
The documentation is not really explicit about this. It basically says "Drop specified labels from rows or columns.'" and only mentions what happens if inplace = True. Looking at #30484 it appears that other users also do not know what happens in the default case.
I think it would be really helpful if the documentation reads `"Drop specified labels from rows or columns and return a copy"' or "inplace : bool, default False. If False, return a copy. If True, do operation inplace and return None." This also applies to other methods that have the inplace parameter.

I've opened #33451 to improve the documentation in this regard.

@FSpanhel FSpanhel changed the title BUG: Inconsistent behavior of pd.DataFrame.drop DOC: Explain inplace = True in pd.DataFrame.drop etc. Apr 10, 2020
@FSpanhel FSpanhel changed the title DOC: Explain inplace = True in pd.DataFrame.drop etc. BUG: Inconsistent behavior of pd.DataFrame.drop Apr 10, 2020
@KenilMehta
Copy link
Contributor

I would like to help to solve this issue.
Upon investing a little, I found that this happens not only for the "drop" function but for the other functions like "replace".

The problem I found is that while assigning value to a data frame inside a function, another copy of the data frame is created. The following code snippet demonstrates it:

import pandas as pd

def demoDrop(df):
        print("address before drop : ", hex(id(df)))
        df = df.drop(columns = ['a'])
        print("address after drop : ", hex(id(df)))

def demoDropinplace(df):
        print("address before drop inplace : ", hex(id(df)))
        df.drop(columns = ['a'], inplace=True)
        print("address after drop inplace: ", hex(id(df)))

df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6]})
print("address of original frame : ", hex(id(df)))
demoDrop(df)
demoDropinplace(df)

Output:

address of original frame : 0x7f35cb4c33c8
address before drop : 0x7f35cb4c33c8
address after drop : 0x7f35b3f89ba8
address before drop inplace : 0x7f35cb4c33c8
address after drop inplace: 0x7f35cb4c33c8

As one can see that address is maintained if inplace is used, but in the other case the address of original dataframe gets changed.

@bashtage bashtage removed the Needs Triage Issue that has not been reviewed by a pandas team member label Aug 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants