Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add columns attribute to DataFrameGroupBy #53583

Closed
1 of 3 tasks
grisaitis opened this issue Jun 10, 2023 · 15 comments
Closed
1 of 3 tasks

ENH: add columns attribute to DataFrameGroupBy #53583

grisaitis opened this issue Jun 10, 2023 · 15 comments
Labels
Enhancement Groupby Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@grisaitis
Copy link

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I wish I could easily inspect the columns present in a DataFrameGroupBy.

Feature Description

I.e., after dfg = df.groupby([..]), I wish I could do dfg.columns such that dfg.columns == df.columns.

Alternative Solutions

The best I'm aware of is:

dfg.get_group(next(iter(dfg.groups.keys()))).columns

cc https://stackoverflow.com/q/76444424/781938

Additional Context

I find myself needing this sometimes when passing around DataFrameGroupBy objects. Maybe this is counterintuitive? Is my coding approach suboptimal here? Just an idea. It seems pretty basic.

@grisaitis grisaitis added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 10, 2023
@rhshadrach
Copy link
Member

rhshadrach commented Jun 12, 2023

Unfortunately I don't think this is well-defined at this time, although we've been working to fix this.

df = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5], 'c': [6, 7, 8]})
gb = df.groupby('a')

print(gb.sum())
#    b   c
# a       
# 1  7  13
# 2  5   8

print(gb.apply(lambda x: x.sum()))
#    a  b   c
# a          
# 1  2  7  13
# 2  2  5   8

Would you expect ['a', 'b', 'c'] here or ['b', 'c']? Most groupby ops will return ['b', 'c'], whereas apply and __iter__ (and a few others) include the groupers.

@rhshadrach
Copy link
Member

@grisaitis: Do you not have access to df itself? Why not just use df?

@topper-123
Copy link
Contributor

To add to the answer from @rhshadrach , the df is accessible from the dfg.obj attribute already, so in that example you could get the original columns from dfg.obj.columns. The columns without the grouping labels is accessible from dfg._obj_with_exclusions.columns.

Neither of the above are part of the public API, IMO it would be beneficial to make the hidden attributes public, for example as attributes of dfg.attrs, so users could do e.g. dfg.attrs.axis to get the aggregation axis.

@grisaitis
Copy link
Author

@rhshadrach thank you for your feedback and questions.

Would you expect ['a', 'b', 'c'] here or ['b', 'c']? Most groupby ops will return ['b', 'c'], whereas apply and __iter__ (and a few others) include the groupers.

personally, i would expect the columns of the original DataFrame, regardless of groupby keys. the reason for this is, i'm interested in what columns are available when i apply aggregation functions on each group. In other words, when I do dfg.apply(aggfunc), where aggfunc admits a DataFrame, I want know what the columns will be of that DataFrame.

@grisaitis: Do you not have access to df itself? Why not just use df?

i do have access to df, but i sometimes find it easier to handle dfg objects exclusively. I have a big data processing DAG where i'm computing various aggregations on the same groupby keys. so, i find it cleaner to first define a DataFrameGroupBy and pass that to the various aggregation tasks. i could change my code so that i keep a reference to the df or just df.columns, but my feature request is to make the df.columns attribute accessible via the DataFrameGroupBy object instead.

also thank you @topper-123 for that info!

@rhshadrach
Copy link
Member

personally, i would expect the columns of the original DataFrame, regardless of groupby keys. the reason for this is, i'm interested in what columns are available when i apply aggregation functions on each group.

This makes sense - it would be the same columns that are available for selection after groupby, e.g. df.groupby('a')[['a', 'b', 'c']].

In other words, when I do dfg.apply(aggfunc), where aggfunc admits a DataFrame, I want know what the columns will be of that DataFrame.

While this currently includes the grouping columns, that will likely not be the case in pandas 3.0. We are looking to deprecate this behavior. See #7155.

@grisaitis
Copy link
Author

it would be the same columns that are available for selection after groupby

exactly. and i think this would be useful to know sometimes.

While this currently includes the grouping columns

i believe in pandas 2 even this is not necessarily true. e.g., if the groupby keys include an index level name:

>>> df.set_index("b").groupby(["a", "b"])["a", "b", "c"]  
...
KeyError: "Columns not found: 'b'"

in short, i think knowing what these columns are can be useful, and also tricky to determine without the original dataframe.

@rhshadrach
Copy link
Member

i believe in pandas 2 even this is not necessarily true

This is true, but I think it misses the issue I was addressing. You made the statements

personally, i would expect the columns of the original DataFrame

In other words, when I do dfg.apply(aggfunc), where aggfunc admits a DataFrame, I want know what the columns will be of that DataFrame.

The use of "in other words" suggests to me you think these two are equivalent; they are not. In some cases in 2.0.2 they aren't the same (to your point), and in 3.0 there will be even more cases where they are not the same.

Where I think we agree is that the columns returned should be that of the original DataFrame, or of the selected columns if selection was used (e.g. df.groupby('a')[['a', 'b', 'c']]).

I'm personally +0 on this. Assuming we do go forward, I think some care needs to be taken on how it's exposed to the user, including if the attribute can be modified and/or mutated. @topper-123 - do you have any thoughts there.

@tpaxman
Copy link
Contributor

tpaxman commented Jun 19, 2023

Interesting issue here. I can see that there is value in accessing the columns associated with a groupby object. @topper-123 's suggestion to simply make the attributes of .obj (and maybe ._obj_with_exclusions as well) seems like a good approach.

Regarding whether to include "all" columns or just the non-index columns:
If we were to have it accessible as @grisaitis suggested, i.e., with DataFrameGroupBy.columns, would it make sense to have the return value be dictated by the as_index parameter? I.e., in the original example, if as_index=True then dfg.columns returns ['b', 'c'], otherwise ['a', 'b', 'c']? This would seem to align with what DataFrame.columns returns (that is, only the non-index column names).

If DataFrameGroupBy.columns were implemented in that way, then it might make sense to also add an index attribute, to mirror the behaviour of DataFrame.index. For example, it might look like this:

dfg1 = df.groupby('a', as_index=True)
dfg1.columns         # -> ['b', 'c']
dfg1.index.names     # -> FrozenList(['a'])

dfg2 = df.groupby('a', as_index=False)
dfg2.columns         # --> ['a', 'b', 'c']
dfg2.index.names     # --> FrozenList([None])

Anyway, not sure on the best approach but just thought it was worth mentioning that as_index might be a factor in how the hypothetical .columns attribute on a groupby object might behave.

@rhshadrach
Copy link
Member

rhshadrach commented Jun 19, 2023

as_index only impacts the result of reducers, not all groupby methods. As such, I don't think we should consider its state for an attribute that isn't specifically about reducers.

@tpaxman
Copy link
Contributor

tpaxman commented Jun 19, 2023

That's a good point. Makes it seem even more clear that maybe the groupby object itself should not have a columns attribute. Having a member of attrs instead feels like the most intuitive implementation, as suggested by @topper-123 but that's just my opinion

@grisaitis
Copy link
Author

Thanks for everyone's consideration and time here. I'm less convinced now that this is a great idea after reading replies and thinking it over. Open to closing this out.

@rhshadrach
Copy link
Member

@topper-123: Would you include .columns as part of gb.attr? If so, we can close as a duplicate of #53642.

@topper-123
Copy link
Contributor

I wouldn't, because it will be accessible through dfg.attr.obj.columns or dfg.attrs.obj_with_exclusions.columns, so seems redundant.

@topper-123
Copy link
Contributor

@rhshadrach, do you have an opinion on #53642?

@topper-123
Copy link
Contributor

Closed as duplicate of #53642.

@topper-123 topper-123 closed this as not planned Won't fix, can't repro, duplicate, stale Jun 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Groupby Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

4 participants