-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: groupby().ffill() adds group labels as extra column #21521
Comments
@adbull I think you should try and build from master. The given problem no longer persists in the current version of pandas >>> pd.DataFrame(0, [1,2], [3,4]).groupby([5, 6]).ffill()
3 4
1 0 0
2 0 0 |
@uds5501 : Thanks for looking into this! Would you like to add a test to close? |
|
@gfyoung actually no. I just fetched recent version of master and can now actually reproduce this error. I am sorry for the confusion. >>> import pandas as pd
>>> pd.__version__
'0.24.0.dev0+122.g6131a59'
>>> pd.DataFrame(0, [1,2], [3,4]).groupby([5, 6]).ffill()
NaN 3 4
1 5 0 0
2 6 0 0 I have honestly no idea how did I got the one which i reported before but the error is reproducable |
Hey, can i work on this issue? This is my first time i am contributing to an open source project. I have some experience using pandas dataframe |
@aggarwalvinayak : By all means! Go for it! |
Can i get some help getting started with this. Like where to found the implementation of |
@aggarwalvinayak : Oh, that is very useful information! You can proceed as follows:
The goal is to figure out the commit where this example starts to fail. Then we can debug from there. |
As in the original report, this is a regression between 0.22.0 and 0.23.0, presumably due to the reimplementation of |
@gfyoung |
@aggarwalvinayak sure as you look at it and have questions feel free to ask |
Just as a heads up - this "good first issue" label was added when we thought this was just going to be a test case. I've removed it as it appears to be a little more complex than that. Absolutely welcome to diagnose and debug but just want to be clear that it may not be as simple as originally thought |
The error occurs when the number of columns specified in the groupby equals number of rows in the dataframe while one of the columns is not contained in the dataframe. If the number of columns in the groupby is not equivalent to the number of rows in the dataframe a keyerror is raised. Why is it desired for nothing to happen if the number of columns specified equals the number of rows in the dataframe (and one of the columns specified is not contained in the dataframe) as opposed to a keyerror? I couldn't find the reason in the groupby docs. |
Marking this for |
IMO this is not a regression. If you do: pd.DataFrame(0, [1,2], [3,4]).groupby([3, 4]).ffill() Then you do not get the additional column. 5 and 6 are not valid labels, so if anything this should be raising a |
This is a regression. >>> pd.DataFrame([[1,np.nan,np.nan,np.nan]]).T.groupby([1,1,2,2]).ffill()
NaN 0
0 1 1.0
1 1 1.0
2 2 NaN
3 2 NaN In 0.22.0, the correct result was returned, with no extra column; in 0.23.0, the extra column gets added. @HarryVolek The special logic when the argument is the same length as the df, and contains an entry that is not a column key, is to support this use case. Possible this could be better documented? |
Looks like the issue also occurs when grouping by an index level: >>> pd.DataFrame([[1,np.nan,np.nan,np.nan]],[0],[1,1,2,2]).T.groupby(level=0).ffill()
NaN 0
1 1 1.0
1 1 1.0
2 2 NaN
2 2 NaN |
Code Sample, a copy-pastable example if possible
Input:
Output:
Problem description
groupby().ffill()
adds an additional column to the dataframe, containing a copy of the group labels. This is a regression in pandas v0.23.0 (#19673).Expected Output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.14.14-200.fc26.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C
LANG: C
LOCALE: None.None
pandas: 0.23.1
pytest: 3.6.0
pip: 10.0.1
setuptools: 39.2.0
Cython: 0.28.3
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 4.2.1
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.2
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.8
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.5
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: