-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Inconsistent behavior of pd.DataFrame.drop #33438
Comments
Solution # 4) copy before assigment
df= pd.DataFrame([1])
def tfun(df):
df= df.copy()
df['a'] = 2
df= df.drop(columns = ['a'])
tfun(df)
print(df.columns)
>>> RangeIndex(start=0, stop=1, step=1) This is consistent with 3) because .assign returns a copy. The reason for the behavior in 1) is not related to pandas but how python deals with view/copies. To make it short:
Therefore, the addition of column 'a' is reflected in df outside tfun but the deletion of 'a' is not considered because it is done on a different object. So my confusion has arisen because I implicitly assumed that pd.DataFrame.drop() returns a view of the DataFrame in any case. I've opened #33451 to improve the documentation in this regard. |
I would like to help to solve this issue. The problem I found is that while assigning value to a data frame inside a function, another copy of the data frame is created. The following code snippet demonstrates it:
Output:
As one can see that address is maintained if inplace is used, but in the other case the address of original dataframe gets changed. |
Problem description
Until an hour ago I thought that I can safely omit the use of inplace = True (which is not recommended, e.g., #30484) and instead use inplace = False and directly assign the result.
For example, I thought that
can be replaced by
where df is a pd.DataFrame.
However, this does not seem to be case when I use these operations within a function.
Compare this with
The result of 2) is as expected (we add column 'a' and immediately remove it). However, I am very confused about the result of 1). The removal of column 'a' which is done inside tfun is not reflected in df outside after tfun is applied.
It gets even stranger when we use .assign to add column 'a' to df inside tfun:
Now, column 'a' is removed, although the type of the remaining column is now a RangeIndex.
I definitely would expect that the result of 3) is equal to the result of 1). What is going on here?
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel
byteorder : little
LC_ALL : None
LANG : de_DE.UTF-8
LOCALE : None.None
pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0.post20200210
Cython : 0.29.15
pytest : 5.3.5
hypothesis : 5.5.4
sphinx : 2.4.0
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.5.0
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.2
fastparquet : None
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.1.3
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.3.5
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.13
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.7
numba : 0.48.0
The text was updated successfully, but these errors were encountered: