Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: pass ddof down to Cythonised std / var in groupby.agg #60591

Open
MarcoGorelli opened this issue Dec 20, 2024 · 1 comment
Open

API: pass ddof down to Cythonised std / var in groupby.agg #60591

MarcoGorelli opened this issue Dec 20, 2024 · 1 comment

Comments

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Dec 20, 2024

Say I want to do a groupby and perform various aggregations, e.g. I want to find mean and std of b. Easy:

import pandas as pd

df = pd.DataFrame({'a': [1,1,2], 'b': [4,5,6]})
df.groupby('a').agg({'b': ['mean', 'std']})

What if I want to do the same with ddof=0? If was computing a single aggregation, I could do:

print(df.groupby('a')['b'].std(ddof=0))

and that uses the Cythonized path.

However, I think the current pandas API doesn't allow a way of passing ddof to 'std' when used in .agg. The workaround often suggested in StackOverflow is (😭 ):

print(df.groupby('a').agg({'b': ['mean', lambda x: np.std(x)]}))

but that'll evade the Cythonized path, which is a missed opportunity

@rhshadrach
Copy link
Member

rhshadrach commented Dec 20, 2024

I'm supportive of creating a way to do this. We currently have NamedAgg, and could add a kwargs argument here. This would not support the OP of acting on all columns. We could also expand NamedAgg to allow acting on multiple and possibly all columns. Or we could introduce a new class, e.g. pd.Op(name, kwargs) specifically for the purpose of acting on all columns.

More generally, there has been a desire for some time to add an expression system to pandas. If we also take the idea of Polars' selectors, then pd.selectors.all().std(ddof=0) becomes possible. If we choose to go this route, then I would be opposed to making other short-term enhancements to support this.

In any case, I think we should strive for consistency of all UDF methods (apply, agg, transform, filter, map) in Series, DataFrame, GroupBy, Window, Resample.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants