Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: DataFrameGroupBy.apply operating on the group keys #54950

Merged
merged 9 commits into from
Sep 7, 2023

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented Sep 2, 2023

This is a 2nd go at #52477 which was reverted in #52921.

Deprecates groupby.apply sometimes operating on the grouping columns. This only occurs when (a) the grouping columns are part of the DataFrame and (b) when the operation on the grouping columns does not raise a TypeError. Otherwise we fallback to excluding the grouping columns. Other operations (excluding filters) do not operate on the grouping columns. Users can still operate on the grouping columns by including them in selection (this works across groupby), e.g.

df.groupby('a')[['a', 'b', 'c']].sum()
df.groupby('a')[['a', 'b', 'c']].apply(lambda x: x.sum())

We do our best to not warn if we can guarantee that the grouping columns were not used. Unfortunately, we can't see if the supplied UDF subsets columns itself. To avoid the noise of the deprecation and adopt future behavior, users can specify include_groups=False. This will become the default and only valid value of this argument in pandas 3.0.

All of the above also applies to .groupby(...).resample since that uses .groupby(...).apply under the hood.

cc @phofl @jorisvandenbossche @topper-123

…_apply_on_groupings

� Conflicts:
�	doc/source/whatsnew/v2.2.0.rst
�	pandas/core/groupby/groupby.py
@rhshadrach rhshadrach added Groupby Deprecate Functionality to remove in pandas Apply Apply, Aggregate, Transform, Map labels Sep 2, 2023
@rhshadrach rhshadrach added this to the 2.2 milestone Sep 2, 2023
@topper-123
Copy link
Contributor

I think there was a previous PR on this, can you link to it?

@@ -1781,10 +1791,25 @@ def f(g):
else:
f = func

if not include_groups:
return self._python_apply_general(f, self._obj_with_exclusions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, getting rid of the try/except here will be great.

@topper-123
Copy link
Contributor

Is this issue present in resample etc. also, or is it only in groupby?

@rhshadrach
Copy link
Member Author

I think there was a previous PR on this, can you link to it?

Added to the OP.

Is this issue present in resample etc. also, or is it only in groupby?

Yes - added to the OP. That is handled here as well.

Copy link
Contributor

@topper-123 topper-123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be very nice to get in, thx for picking it back up. A few comments from me though.

@@ -77,9 +77,52 @@ Previously you would have to do this to get a rolling window mean per-group:
df = pd.DataFrame({"A": [1] * 20 + [2] * 12 + [3] * 8, "B": np.arange(40)})
df
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change this DataFrame to make the below output be of reasonable size: df = pd.DataFrame({"A": [1] * 10 + [2] * 6 + [3] * 4, "B": np.arange(20)})

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the motivation here? In general, I don't think we should be rewriting old release notes unless there is an issue we're fixing.

pandas/core/frame.py Show resolved Hide resolved
pandas/core/groupby/groupby.py Show resolved Hide resolved
pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved
pandas/core/groupby/groupby.py Show resolved Hide resolved
pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved
pandas/core/groupby/groupby.py Show resolved Hide resolved
pandas/core/resample.py Outdated Show resolved Hide resolved
pandas/tests/extension/base/groupby.py Show resolved Hide resolved
Copy link
Contributor

@topper-123 topper-123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comment.

BTW, in #54747 (comment) you mentioned that groupby.apply has magical behaviors. Did you think about this try/except here, or is there other code locations you think about also?

doc/source/whatsnew/v2.2.0.rst Outdated Show resolved Hide resolved
@rhshadrach
Copy link
Member Author

BTW, in #54747 (comment) you mentioned that groupby.apply has magical behaviors. Did you think about this try/except here, or is there other code locations you think about also?

No, that comment is about how apply infers the results of the UDF. We can always control what is passed to the UDF, so in my mind that is not a serious issue.

Copy link
Contributor

@topper-123 topper-123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@topper-123
Copy link
Contributor

@phofl.

@@ -146,12 +146,12 @@ Deprecations
- Deprecated allowing non-keyword arguments in :meth:`DataFrame.to_pickle` except ``path``. (:issue:`54229`)
- Deprecated allowing non-keyword arguments in :meth:`DataFrame.to_string` except ``buf``. (:issue:`54229`)
- Deprecated downcasting behavior in :meth:`Series.where`, :meth:`DataFrame.where`, :meth:`Series.mask`, :meth:`DataFrame.mask`, :meth:`Series.clip`, :meth:`DataFrame.clip`; in a future version these will not infer object-dtype columns to non-object dtype, or all-round floats to integer dtype. Call ``result.infer_objects(copy=False)`` on the result for object inference, or explicitly cast floats to ints. To opt in to the future version, use ``pd.set_option("future.downcasting", True)`` (:issue:`53656`)
- Deprecated including the groups in computations when using :meth:`DataFrameGroupBy.apply` and :meth:`DataFrameGroupBy.resample`; pass ``include_groups=False`` to exclude the groups (:issue:`7155`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the groupby rolling/expanding/ewm ops affected by this too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not believe so. The only tests that hit groupby.apply in pandas/tests/window are where they directly call groupby.apply to produce the expected result. I also grepped the code and could not find any calls to groupby.apply in pandas.core.window.

@mroeschke mroeschke merged commit cf6100b into pandas-dev:main Sep 7, 2023
@mroeschke
Copy link
Member

Thanks @rhshadrach

@rhshadrach rhshadrach deleted the depr_apply_on_groupings branch September 7, 2023 20:24
@phofl
Copy link
Member

phofl commented Sep 8, 2023

@rhshadrach just found this case:

df = pd.DataFrame({"a": [1,2 ,3 ], "b": 1})
df.groupby(df.b).apply(lambda x: x)

I know we are doing some magic here, but I think here is a case for including b in apply as well?

@rhshadrach
Copy link
Member Author

Not sure I understand. Are you saying after the deprecation is enforced we should still be including b in the result?

@rhshadrach
Copy link
Member Author

rhshadrach commented Sep 9, 2023

Assuming that is the case, it's contrary to the behavior of other ops.

df = pd.DataFrame({"a": [1,2 ,3 ], "b": 1})
print(df.groupby(df.b).sum())
#    a
# b
# 1  6
print(df.groupby(df.b).transform(lambda x: x))
#    a
# 0  1
# 1  2
# 2  3

In general, the docs state that df.groupby('b') is syntactic sugar for df.groupby(df.b).

@phofl
Copy link
Member

phofl commented Sep 9, 2023

Not sure I understand. Are you saying after the deprecation is enforced we should still be including b in the result?

Yes that was my idea, this syntax looks like the users wants b in apply as well.

In general, the docs state that df.groupby('b') is syntactic sugar for df.groupby(df.b).

I did not know that, then you can ignore what I've said. Current behavior is good then!

mroeschke pushed a commit to mroeschke/pandas that referenced this pull request Sep 11, 2023
…54950)

* DEPR: DataFrameGroupBy.apply operating on the group keys

* fixups

* Improvements

* Add DataFrameGroupBy.resample to the whatsnew; mypy fixup

* Ignore wrong parameter order

* Ignore groupby.resample in docstring validation

* Fixup docstring
@jorisvandenbossche
Copy link
Member

@rhshadrach how annoying would it be to keep this deprecation around for a bit longer? So we can start with a DeprecationWarning for this one?
(I don't know how much it interferes with other potential improvements or cleanups that are planned)

(of course, if we do another 2.3 that just bumps some deprecation->future warnings, that could do that for this one as well, and still change the behaviour in 3.0)

@rhshadrach
Copy link
Member Author

This is quite bad behavior - it is inconsistent with the rest of groupby, produces self-inconsistent and confusing results, and can lead to confusion when debugging. It would not, however, interfere with future improvements I have planned. Still, I think we should weigh heavily the impact of keeping this behavior for another 1.5 years.

@jorisvandenbossche
Copy link
Member

But if we would bump it to FutureWarning in a 2.3 (I know we haven't yet decided on that idea though, but assuming we do that), we can still change the behaviour in 3.0, and no need for keeping it another 1.5 years?

I just have the feeling (from the seaborn example, among others) that this is a quite widespread warning (basically whenever you use groupby.apply, even also when you are not actually using the group columns, which might be the majority of the cases), and so doing a deprecation warning first would reduce some of the noise from usage in libraries.

@phofl
Copy link
Member

phofl commented Jan 17, 2024

We already tagged 3.0 and I don’t think we should do a 2.3 generally speaking

@jorisvandenbossche
Copy link
Member

(I mean your idea of doing a 2.3 on the 2.2.x branch, definitely not from main)

@phofl
Copy link
Member

phofl commented Jan 17, 2024

That’s something I wouldn’t object to but I have moved away from this idea over the last few weeks tbh. I’ve objected to this deprecation in the past, but @rhshadrach addressed this with the keyword, the fix is a one line change…

@rhshadrach
Copy link
Member Author

rhshadrach commented Jan 18, 2024

I don't have any opposition to 2.3 assuming it doesn't delay 3.0. Likewise, assuming 2.3, I have no opposition to changing this to a DeprecationWarning.

@jorisvandenbossche
Copy link
Member

I would prefer starting with a DeprecationWarning then

@jorisvandenbossche
Copy link
Member

Another question here: you added the include_groups keyword, to allow users to already get the future behaviour (and silence the warning) by passing include_groups=False.
But this doesn't allow to specify include_groups=True at the moment (you just still get the warning, since that is the default value for that option). You also mentioned that in the top post:

This will become the default and only valid value of this argument in pandas 3.0.

Is there a technical reason to not allow that option in the future? Or just because it is deemed unnecessary? (in the idea that most? users would expect it without the group columns)

@rhshadrach
Copy link
Member Author

Is there a technical reason to not allow that option in the future? Or just because it is deemed unnecessary? (in the idea that most? users would expect it without the group columns)

From a user standpoint, primarily for consistency with other groupby operations. We have _selected_obj and _obj_with_exclusions that differ in subtle ways and have caused bugs in the past. I would like to keep only one of these.

@rhshadrach
Copy link
Member Author

@jorisvandenbossche @phofl I plan to change this to a FutureWarning in 2.3. Let me know if there is any opposition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Deprecate Functionality to remove in pandas Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: way to exclude the grouped column with apply
5 participants