You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am currently working on a PR to add numeric_only support to DataFrameGroupBy.agg() as it relates to some work on Geopandas I am currently trying to resolve. #56946, #58132
I noted there is an edge case that is currently not working as intended. This edge case occurs under the following conditions:
A single callable function is submitted as func
A kwarg is included
There is more than one column in the dataframe.
The first 2 conditions is what would lead it to use: DataFrameGroupBy._aggregate_frame
The third condition causes the bug in DataFrameGroupBy._aggregate_frame because the aggfunc would be applied to the entire dataframe, as opposed to column by column.
This has actually been touched upon in an existing test:
deftest_pass_args_kwargs_duplicate_columns(tsframe, as_index):
# go through _aggregate_frame with self.axis == 0 and duplicate columnstsframe.columns= ["A", "B", "A", "C"]
gb=tsframe.groupby(lambdax: x.month, as_index=as_index)
res=gb.agg(np.percentile, 80, axis=0)
ex_data= {
1: tsframe[tsframe.index.month==1].quantile(0.8),
2: tsframe[tsframe.index.month==2].quantile(0.8),
}
expected=DataFrame(ex_data).Tifnotas_index:
expected.insert(0, "index", [1, 2])
expected.index=Index(range(2))
tm.assert_frame_equal(res, expected)
Where the kwarg axis=0 was fed into agg(). Three comments I have on this is that:
it isn't necessarily obvious that this argument is required based on the documentation.
not putting in that argument doesn't raise an error nor warning, but takes the 80th percentile in that group for all columns
if using a UDF, such an argument would not be practical to be required in all cases
The root issue is how the aggfunc is applied in GroupByDataFrame._aggregate_frame() when there is more than one column. Without the axis=0 kwarg, it applies the function on all columns to get a single value.
Below is a very braindead example of something that might work for _aggregate_frame:
def_aggregate_frame(self, func, *args, **kwargs) ->DataFrame:
ifself._grouper.nkeys!=1:
raiseAssertionError("Number of keys must be 1")
obj=self._obj_with_exclusionsresult_index=self._grouper.result_indexout=self.obj._constructor([], index=obj.columns, columns=result_index)
forname, grp_dfinself._grouper.get_iterator(obj):
aggregated_row_data=list()
forcounter, _inenumerate(grp_df.columns):
fres=func(grp_df.iloc[:,counter], *args, **kwargs)
aggregated_row_data.append(fres)
out[name] =aggregated_row_dataout=out.Treturnout
There are a number of things that need to be accounted for:
Need to consider that we can have duplicate column names, therefore we can not right out to the resultdict() as dictionaries don't allow duplicates. Hence why the object is created to at the start and written to in the for loop
If calling a dataframe[column_name], where there are duplicates names, all the data in the columns will show up and the function will be applied to all of them. Hence why iloc is used.
To address wanting the function to be operated on only in the column-by-column, it's going through a for-loop.
This was just a naive approach to the issue but would like to confirm that this would be considered an issue before giving serious thought about a more appropriate fix.
Expected Behavior
A B
groupby1
diamond 6.5 20.0
spade 9.0 45.0
Installed Versions
INSTALLED VERSIONS
commit : 6f39c4f
python : 3.10.14.final.0
python-bits : 64
OS : Linux
OS-release : 6.5.0-26-generic
Version : #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Mar 12 10:22:43 UTC 2
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_AU.UTF-8
LOCALE : en_AU.UTF-8
Not a problem at all - with 3500+ issues, it can be quite difficult.
Also - the PR I linked to currently makes the changes to agg as breaking changes. I'm going to be looking at reworking it to instead introduce deprecations - so this bug will likely still exist in 3.0. All this is to say that you will need to work around this issue if we want to get #58132 into 3.0, and you shouldn't be waiting on #57706.
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
I am currently working on a PR to add
numeric_only
support toDataFrameGroupBy.agg()
as it relates to some work on Geopandas I am currently trying to resolve. #56946, #58132I noted there is an edge case that is currently not working as intended. This edge case occurs under the following conditions:
func
kwarg
is includedThe first 2 conditions is what would lead it to use:
DataFrameGroupBy._aggregate_frame
The third condition causes the bug in
DataFrameGroupBy._aggregate_frame
because the aggfunc would be applied to the entire dataframe, as opposed to column by column.This has actually been touched upon in an existing test:
Where the kwarg
axis=0
was fed intoagg()
. Three comments I have on this is that:The root issue is how the aggfunc is applied in
GroupByDataFrame._aggregate_frame()
when there is more than one column. Without theaxis=0
kwarg, it applies the function on all columns to get a single value.Below is a very braindead example of something that might work for
_aggregate_frame
:There are a number of things that need to be accounted for:
result
dict()
as dictionaries don't allow duplicates. Hence why the object is created to at the start and written to in the for loopThis was just a naive approach to the issue but would like to confirm that this would be considered an issue before giving serious thought about a more appropriate fix.
Expected Behavior
Installed Versions
INSTALLED VERSIONS
commit : 6f39c4f
python : 3.10.14.final.0
python-bits : 64
OS : Linux
OS-release : 6.5.0-26-generic
Version : #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Mar 12 10:22:43 UTC 2
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_AU.UTF-8
LOCALE : en_AU.UTF-8
pandas : 3.0.0.dev0+666.g6f39c4fa1f.dirty
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0
setuptools : 69.2.0
pip : 24.0
Cython : 3.0.10
pytest : 8.1.1
hypothesis : 6.100.0
sphinx : 7.2.6
blosc : None
feather : None
xlsxwriter : 3.1.9
lxml.etree : 5.1.0
html5lib : 1.1
pymysql : 1.4.6
psycopg2 : 2.9.9
jinja2 : 3.1.3
IPython : 8.22.2
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : 1.3.8
fastparquet : 2024.2.0
fsspec : 2024.3.1
gcsfs : 2024.3.1
matplotlib : 3.8.3
numba : 0.59.1
numexpr : 2.9.0
odfpy : None
openpyxl : 3.1.2
pyarrow : 15.0.2
pyreadstat : 1.2.7
python-calamine : None
pyxlsb : 1.0.10
s3fs : 2024.3.1
scipy : 1.12.0
sqlalchemy : 2.0.29
tables : 3.9.2
tabulate : 0.9.0
xarray : 2024.3.0
xlrd : 2.0.1
zstandard : 0.22.0
tzdata : 2024.1
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: