ENH: add columns attribute to DataFrameGroupBy #53583

grisaitis · 2023-06-10T02:08:10Z

Feature Type

Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas

Problem Description

I wish I could easily inspect the columns present in a DataFrameGroupBy.

Feature Description

I.e., after dfg = df.groupby([..]), I wish I could do dfg.columns such that dfg.columns == df.columns.

Alternative Solutions

The best I'm aware of is:

dfg.get_group(next(iter(dfg.groups.keys()))).columns

cc https://stackoverflow.com/q/76444424/781938

Additional Context

I find myself needing this sometimes when passing around DataFrameGroupBy objects. Maybe this is counterintuitive? Is my coding approach suboptimal here? Just an idea. It seems pretty basic.

The text was updated successfully, but these errors were encountered:

rhshadrach · 2023-06-12T01:57:21Z

Unfortunately I don't think this is well-defined at this time, although we've been working to fix this.

df = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5], 'c': [6, 7, 8]})
gb = df.groupby('a')

print(gb.sum())
#    b   c
# a       
# 1  7  13
# 2  5   8

print(gb.apply(lambda x: x.sum()))
#    a  b   c
# a          
# 1  2  7  13
# 2  2  5   8

Would you expect ['a', 'b', 'c'] here or ['b', 'c']? Most groupby ops will return ['b', 'c'], whereas apply and __iter__ (and a few others) include the groupers.

rhshadrach · 2023-06-12T01:59:27Z

@grisaitis: Do you not have access to df itself? Why not just use df?

topper-123 · 2023-06-13T06:33:58Z

To add to the answer from @rhshadrach , the df is accessible from the dfg.obj attribute already, so in that example you could get the original columns from dfg.obj.columns. The columns without the grouping labels is accessible from dfg._obj_with_exclusions.columns.

Neither of the above are part of the public API, IMO it would be beneficial to make the hidden attributes public, for example as attributes of dfg.attrs, so users could do e.g. dfg.attrs.axis to get the aggregation axis.

grisaitis · 2023-06-15T18:15:59Z

@rhshadrach thank you for your feedback and questions.

Would you expect ['a', 'b', 'c'] here or ['b', 'c']? Most groupby ops will return ['b', 'c'], whereas apply and __iter__ (and a few others) include the groupers.

personally, i would expect the columns of the original DataFrame, regardless of groupby keys. the reason for this is, i'm interested in what columns are available when i apply aggregation functions on each group. In other words, when I do dfg.apply(aggfunc), where aggfunc admits a DataFrame, I want know what the columns will be of that DataFrame.

@grisaitis: Do you not have access to df itself? Why not just use df?

i do have access to df, but i sometimes find it easier to handle dfg objects exclusively. I have a big data processing DAG where i'm computing various aggregations on the same groupby keys. so, i find it cleaner to first define a DataFrameGroupBy and pass that to the various aggregation tasks. i could change my code so that i keep a reference to the df or just df.columns, but my feature request is to make the df.columns attribute accessible via the DataFrameGroupBy object instead.

also thank you @topper-123 for that info!

rhshadrach · 2023-06-15T20:50:19Z

personally, i would expect the columns of the original DataFrame, regardless of groupby keys. the reason for this is, i'm interested in what columns are available when i apply aggregation functions on each group.

This makes sense - it would be the same columns that are available for selection after groupby, e.g. df.groupby('a')[['a', 'b', 'c']].

In other words, when I do dfg.apply(aggfunc), where aggfunc admits a DataFrame, I want know what the columns will be of that DataFrame.

While this currently includes the grouping columns, that will likely not be the case in pandas 3.0. We are looking to deprecate this behavior. See #7155.

grisaitis · 2023-06-16T00:56:37Z

it would be the same columns that are available for selection after groupby

exactly. and i think this would be useful to know sometimes.

While this currently includes the grouping columns

i believe in pandas 2 even this is not necessarily true. e.g., if the groupby keys include an index level name:

>>> df.set_index("b").groupby(["a", "b"])["a", "b", "c"]  
...
KeyError: "Columns not found: 'b'"

in short, i think knowing what these columns are can be useful, and also tricky to determine without the original dataframe.

rhshadrach · 2023-06-16T02:15:41Z

i believe in pandas 2 even this is not necessarily true

This is true, but I think it misses the issue I was addressing. You made the statements

personally, i would expect the columns of the original DataFrame

In other words, when I do dfg.apply(aggfunc), where aggfunc admits a DataFrame, I want know what the columns will be of that DataFrame.

The use of "in other words" suggests to me you think these two are equivalent; they are not. In some cases in 2.0.2 they aren't the same (to your point), and in 3.0 there will be even more cases where they are not the same.

Where I think we agree is that the columns returned should be that of the original DataFrame, or of the selected columns if selection was used (e.g. df.groupby('a')[['a', 'b', 'c']]).

I'm personally +0 on this. Assuming we do go forward, I think some care needs to be taken on how it's exposed to the user, including if the attribute can be modified and/or mutated. @topper-123 - do you have any thoughts there.

tpaxman · 2023-06-19T04:38:53Z

Interesting issue here. I can see that there is value in accessing the columns associated with a groupby object. @topper-123 's suggestion to simply make the attributes of .obj (and maybe ._obj_with_exclusions as well) seems like a good approach.

Regarding whether to include "all" columns or just the non-index columns:
If we were to have it accessible as @grisaitis suggested, i.e., with DataFrameGroupBy.columns, would it make sense to have the return value be dictated by the as_index parameter? I.e., in the original example, if as_index=True then dfg.columns returns ['b', 'c'], otherwise ['a', 'b', 'c']? This would seem to align with what DataFrame.columns returns (that is, only the non-index column names).

If DataFrameGroupBy.columns were implemented in that way, then it might make sense to also add an index attribute, to mirror the behaviour of DataFrame.index. For example, it might look like this:

dfg1 = df.groupby('a', as_index=True)
dfg1.columns         # -> ['b', 'c']
dfg1.index.names     # -> FrozenList(['a'])

dfg2 = df.groupby('a', as_index=False)
dfg2.columns         # --> ['a', 'b', 'c']
dfg2.index.names     # --> FrozenList([None])

Anyway, not sure on the best approach but just thought it was worth mentioning that as_index might be a factor in how the hypothetical .columns attribute on a groupby object might behave.

rhshadrach · 2023-06-19T11:49:52Z

as_index only impacts the result of reducers, not all groupby methods. As such, I don't think we should consider its state for an attribute that isn't specifically about reducers.

tpaxman · 2023-06-19T18:23:49Z

That's a good point. Makes it seem even more clear that maybe the groupby object itself should not have a columns attribute. Having a member of attrs instead feels like the most intuitive implementation, as suggested by @topper-123 but that's just my opinion

grisaitis · 2023-06-23T04:38:18Z

Thanks for everyone's consideration and time here. I'm less convinced now that this is a great idea after reading replies and thinking it over. Open to closing this out.

rhshadrach · 2023-06-26T21:16:27Z

@topper-123: Would you include .columns as part of gb.attr? If so, we can close as a duplicate of #53642.

topper-123 · 2023-06-27T08:28:34Z

I wouldn't, because it will be accessible through dfg.attr.obj.columns or dfg.attrs.obj_with_exclusions.columns, so seems redundant.

topper-123 · 2023-06-27T08:29:56Z

@rhshadrach, do you have an opinion on #53642?

topper-123 · 2023-06-27T08:30:17Z

Closed as duplicate of #53642.

grisaitis added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 10, 2023

rhshadrach added the Groupby label Jun 12, 2023

topper-123 closed this as not planned Won't fix, can't repro, duplicate, stale Jun 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add columns attribute to DataFrameGroupBy #53583

ENH: add columns attribute to DataFrameGroupBy #53583

grisaitis commented Jun 10, 2023

rhshadrach commented Jun 12, 2023 •

edited

Loading

rhshadrach commented Jun 12, 2023

topper-123 commented Jun 13, 2023

grisaitis commented Jun 15, 2023

rhshadrach commented Jun 15, 2023

grisaitis commented Jun 16, 2023

rhshadrach commented Jun 16, 2023

tpaxman commented Jun 19, 2023

rhshadrach commented Jun 19, 2023 •

edited

Loading

tpaxman commented Jun 19, 2023

grisaitis commented Jun 23, 2023

rhshadrach commented Jun 26, 2023

topper-123 commented Jun 27, 2023

topper-123 commented Jun 27, 2023

topper-123 commented Jun 27, 2023

ENH: add columns attribute to DataFrameGroupBy #53583

ENH: add columns attribute to DataFrameGroupBy #53583

Comments

grisaitis commented Jun 10, 2023

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

rhshadrach commented Jun 12, 2023 • edited Loading

rhshadrach commented Jun 12, 2023

topper-123 commented Jun 13, 2023

grisaitis commented Jun 15, 2023

rhshadrach commented Jun 15, 2023

grisaitis commented Jun 16, 2023

rhshadrach commented Jun 16, 2023

tpaxman commented Jun 19, 2023

rhshadrach commented Jun 19, 2023 • edited Loading

tpaxman commented Jun 19, 2023

grisaitis commented Jun 23, 2023

rhshadrach commented Jun 26, 2023

topper-123 commented Jun 27, 2023

topper-123 commented Jun 27, 2023

topper-123 commented Jun 27, 2023

rhshadrach commented Jun 12, 2023 •

edited

Loading

rhshadrach commented Jun 19, 2023 •

edited

Loading