DEPR: DataFrameGroupBy.apply operating on the group keys #54950

rhshadrach · 2023-09-02T10:24:40Z

closes API: way to exclude the grouped column with apply #7155 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

This is a 2nd go at #52477 which was reverted in #52921.

Deprecates groupby.apply sometimes operating on the grouping columns. This only occurs when (a) the grouping columns are part of the DataFrame and (b) when the operation on the grouping columns does not raise a TypeError. Otherwise we fallback to excluding the grouping columns. Other operations (excluding filters) do not operate on the grouping columns. Users can still operate on the grouping columns by including them in selection (this works across groupby), e.g.

df.groupby('a')[['a', 'b', 'c']].sum()
df.groupby('a')[['a', 'b', 'c']].apply(lambda x: x.sum())

We do our best to not warn if we can guarantee that the grouping columns were not used. Unfortunately, we can't see if the supplied UDF subsets columns itself. To avoid the noise of the deprecation and adopt future behavior, users can specify include_groups=False. This will become the default and only valid value of this argument in pandas 3.0.

All of the above also applies to .groupby(...).resample since that uses .groupby(...).apply under the hood.

cc @phofl @jorisvandenbossche @topper-123

…_apply_on_groupings � Conflicts: � doc/source/whatsnew/v2.2.0.rst � pandas/core/groupby/groupby.py

topper-123 · 2023-09-02T11:01:03Z

I think there was a previous PR on this, can you link to it?

topper-123 · 2023-09-02T11:09:16Z

pandas/core/groupby/groupby.py

@@ -1781,10 +1791,25 @@ def f(g):
        else:
            f = func

+        if not include_groups:
+            return self._python_apply_general(f, self._obj_with_exclusions)


Yeah, getting rid of the try/except here will be great.

topper-123 · 2023-09-02T11:10:17Z

Is this issue present in resample etc. also, or is it only in groupby?

rhshadrach · 2023-09-02T11:16:34Z

I think there was a previous PR on this, can you link to it?

Added to the OP.

Is this issue present in resample etc. also, or is it only in groupby?

Yes - added to the OP. That is handled here as well.

topper-123

This will be very nice to get in, thx for picking it back up. A few comments from me though.

topper-123 · 2023-09-02T15:19:53Z

doc/source/whatsnew/v0.18.1.rst

@@ -77,9 +77,52 @@ Previously you would have to do this to get a rolling window mean per-group:
   df = pd.DataFrame({"A": [1] * 20 + [2] * 12 + [3] * 8, "B": np.arange(40)})
   df


Change this DataFrame to make the below output be of reasonable size: df = pd.DataFrame({"A": [1] * 10 + [2] * 6 + [3] * 4, "B": np.arange(20)})

Can you explain the motivation here? In general, I don't think we should be rewriting old release notes unless there is an issue we're fixing.

pandas/core/frame.py

pandas/core/groupby/groupby.py

pandas/core/resample.py

pandas/tests/extension/base/groupby.py

topper-123

A few comment.

BTW, in #54747 (comment) you mentioned that groupby.apply has magical behaviors. Did you think about this try/except here, or is there other code locations you think about also?

doc/source/whatsnew/v2.2.0.rst

rhshadrach · 2023-09-04T11:41:23Z

BTW, in #54747 (comment) you mentioned that groupby.apply has magical behaviors. Did you think about this try/except here, or is there other code locations you think about also?

No, that comment is about how apply infers the results of the UDF. We can always control what is passed to the UDF, so in my mind that is not a serious issue.

topper-123

LGTM.

topper-123 · 2023-09-04T17:05:03Z

@phofl.

…_apply_on_groupings

mroeschke · 2023-09-07T01:39:55Z

doc/source/whatsnew/v2.2.0.rst

@@ -146,12 +146,12 @@ Deprecations
 - Deprecated allowing non-keyword arguments in :meth:`DataFrame.to_pickle` except ``path``. (:issue:`54229`)
 - Deprecated allowing non-keyword arguments in :meth:`DataFrame.to_string` except ``buf``. (:issue:`54229`)
 - Deprecated downcasting behavior in :meth:`Series.where`, :meth:`DataFrame.where`, :meth:`Series.mask`, :meth:`DataFrame.mask`, :meth:`Series.clip`, :meth:`DataFrame.clip`; in a future version these will not infer object-dtype columns to non-object dtype, or all-round floats to integer dtype. Call ``result.infer_objects(copy=False)`` on the result for object inference, or explicitly cast floats to ints. To opt in to the future version, use ``pd.set_option("future.downcasting", True)`` (:issue:`53656`)
+- Deprecated including the groups in computations when using :meth:`DataFrameGroupBy.apply` and :meth:`DataFrameGroupBy.resample`; pass ``include_groups=False`` to exclude the groups (:issue:`7155`)


Are the groupby rolling/expanding/ewm ops affected by this too?

I do not believe so. The only tests that hit groupby.apply in pandas/tests/window are where they directly call groupby.apply to produce the expected result. I also grepped the code and could not find any calls to groupby.apply in pandas.core.window.

mroeschke · 2023-09-07T16:00:09Z

Thanks @rhshadrach

phofl · 2023-09-08T13:55:41Z

@rhshadrach just found this case:

df = pd.DataFrame({"a": [1,2 ,3 ], "b": 1})
df.groupby(df.b).apply(lambda x: x)

I know we are doing some magic here, but I think here is a case for including b in apply as well?

rhshadrach · 2023-09-09T21:40:51Z

Not sure I understand. Are you saying after the deprecation is enforced we should still be including b in the result?

rhshadrach · 2023-09-09T21:46:53Z

Assuming that is the case, it's contrary to the behavior of other ops.

df = pd.DataFrame({"a": [1,2 ,3 ], "b": 1})
print(df.groupby(df.b).sum())
#    a
# b
# 1  6
print(df.groupby(df.b).transform(lambda x: x))
#    a
# 0  1
# 1  2
# 2  3

In general, the docs state that df.groupby('b') is syntactic sugar for df.groupby(df.b).

phofl · 2023-09-09T21:49:16Z

Not sure I understand. Are you saying after the deprecation is enforced we should still be including b in the result?

Yes that was my idea, this syntax looks like the users wants b in apply as well.

In general, the docs state that df.groupby('b') is syntactic sugar for df.groupby(df.b).

I did not know that, then you can ignore what I've said. Current behavior is good then!

…54950) * DEPR: DataFrameGroupBy.apply operating on the group keys * fixups * Improvements * Add DataFrameGroupBy.resample to the whatsnew; mypy fixup * Ignore wrong parameter order * Ignore groupby.resample in docstring validation * Fixup docstring

jorisvandenbossche · 2024-01-17T16:23:24Z

@rhshadrach how annoying would it be to keep this deprecation around for a bit longer? So we can start with a DeprecationWarning for this one?
(I don't know how much it interferes with other potential improvements or cleanups that are planned)

(of course, if we do another 2.3 that just bumps some deprecation->future warnings, that could do that for this one as well, and still change the behaviour in 3.0)

rhshadrach · 2024-01-17T21:11:19Z

This is quite bad behavior - it is inconsistent with the rest of groupby, produces self-inconsistent and confusing results, and can lead to confusion when debugging. It would not, however, interfere with future improvements I have planned. Still, I think we should weigh heavily the impact of keeping this behavior for another 1.5 years.

jorisvandenbossche · 2024-01-17T21:58:40Z

But if we would bump it to FutureWarning in a 2.3 (I know we haven't yet decided on that idea though, but assuming we do that), we can still change the behaviour in 3.0, and no need for keeping it another 1.5 years?

I just have the feeling (from the seaborn example, among others) that this is a quite widespread warning (basically whenever you use groupby.apply, even also when you are not actually using the group columns, which might be the majority of the cases), and so doing a deprecation warning first would reduce some of the noise from usage in libraries.

phofl · 2024-01-17T22:00:27Z

We already tagged 3.0 and I don’t think we should do a 2.3 generally speaking

jorisvandenbossche · 2024-01-17T22:10:46Z

(I mean your idea of doing a 2.3 on the 2.2.x branch, definitely not from main)

phofl · 2024-01-17T22:13:33Z

That’s something I wouldn’t object to but I have moved away from this idea over the last few weeks tbh. I’ve objected to this deprecation in the past, but @rhshadrach addressed this with the keyword, the fix is a one line change…

rhshadrach · 2024-01-18T01:11:39Z

I don't have any opposition to 2.3 assuming it doesn't delay 3.0. Likewise, assuming 2.3, I have no opposition to changing this to a DeprecationWarning.

jorisvandenbossche · 2024-01-18T23:02:35Z

I would prefer starting with a DeprecationWarning then

jorisvandenbossche · 2024-01-19T16:43:48Z

Another question here: you added the include_groups keyword, to allow users to already get the future behaviour (and silence the warning) by passing include_groups=False.
But this doesn't allow to specify include_groups=True at the moment (you just still get the warning, since that is the default value for that option). You also mentioned that in the top post:

This will become the default and only valid value of this argument in pandas 3.0.

Is there a technical reason to not allow that option in the future? Or just because it is deemed unnecessary? (in the idea that most? users would expect it without the group columns)

rhshadrach · 2024-01-19T21:19:32Z

Is there a technical reason to not allow that option in the future? Or just because it is deemed unnecessary? (in the idea that most? users would expect it without the group columns)

From a user standpoint, primarily for consistency with other groupby operations. We have _selected_obj and _obj_with_exclusions that differ in subtle ways and have caused bugs in the past. I would like to keep only one of these.

…15006) Matching pandas-dev/pandas#54950 Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #15006

rhshadrach · 2024-09-07T13:05:45Z

@jorisvandenbossche @phofl I plan to change this to a FutureWarning in 2.3. Let me know if there is any opposition.

rhshadrach added 2 commits September 2, 2023 06:19

DEPR: DataFrameGroupBy.apply operating on the group keys

31dd661

Merge branch 'main' of https://github.com/pandas-dev/pandas into depr…

7f8d7b4

…_apply_on_groupings � Conflicts: � doc/source/whatsnew/v2.2.0.rst � pandas/core/groupby/groupby.py

rhshadrach added Groupby Deprecate Functionality to remove in pandas Apply Apply, Aggregate, Transform, Map labels Sep 2, 2023

rhshadrach added this to the 2.2 milestone Sep 2, 2023

topper-123 reviewed Sep 2, 2023

View reviewed changes

fixups

123acd6

topper-123 reviewed Sep 2, 2023

View reviewed changes

Improvements

bc13835

topper-123 reviewed Sep 3, 2023

View reviewed changes

doc/source/whatsnew/v2.2.0.rst Outdated Show resolved Hide resolved

Add DataFrameGroupBy.resample to the whatsnew; mypy fixup

39f019a

topper-123 approved these changes Sep 4, 2023

View reviewed changes

rhshadrach added 3 commits September 5, 2023 17:58

Ignore wrong parameter order

717e7bc

Merge branch 'main' of https://github.com/pandas-dev/pandas into depr…

c045cda

…_apply_on_groupings

Ignore groupby.resample in docstring validation

98156e9

rhshadrach requested a review from mroeschke as a code owner September 7, 2023 01:09

mroeschke reviewed Sep 7, 2023

View reviewed changes

Fixup docstring

6d9d7e9

mroeschke approved these changes Sep 7, 2023

View reviewed changes

mroeschke merged commit cf6100b into pandas-dev:main Sep 7, 2023

github-actions bot mentioned this pull request Sep 7, 2023

DEPR: List of deprecations to be removed in 3.0 #50578

Open

rhshadrach deleted the depr_apply_on_groupings branch September 7, 2023 20:24

rhshadrach mentioned this pull request Jan 19, 2024

DEPR: Make FutureWarning into DeprecationWarning for groupby.apply #56952

Merged

5 tasks

mroeschke mentioned this pull request Feb 8, 2024

Add groupby.apply(include_groups=) to match pandas 2.2 deprecation rapidsai/cudf#15006

Merged

3 tasks

rhshadrach mentioned this pull request Mar 2, 2024

BUG: GROUPING UNPOPULATED DATAFRAME raises exception - index name clashes with duplicate column name #44350

Closed

3 tasks

rhshadrach mentioned this pull request Sep 8, 2024

DEPR: Update groupby.apply DeprecationWarning to FutureWarning #59751

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DEPR: DataFrameGroupBy.apply operating on the group keys #54950

DEPR: DataFrameGroupBy.apply operating on the group keys #54950

rhshadrach commented Sep 2, 2023 •

edited

Loading

topper-123 commented Sep 2, 2023

topper-123 Sep 2, 2023

topper-123 commented Sep 2, 2023

rhshadrach commented Sep 2, 2023

topper-123 left a comment

topper-123 Sep 2, 2023

rhshadrach Sep 2, 2023

topper-123 left a comment

rhshadrach commented Sep 4, 2023

topper-123 left a comment

topper-123 commented Sep 4, 2023

mroeschke Sep 7, 2023

rhshadrach Sep 7, 2023

mroeschke commented Sep 7, 2023

phofl commented Sep 8, 2023

rhshadrach commented Sep 9, 2023

rhshadrach commented Sep 9, 2023 •

edited

Loading

phofl commented Sep 9, 2023

jorisvandenbossche commented Jan 17, 2024

rhshadrach commented Jan 17, 2024

jorisvandenbossche commented Jan 17, 2024

phofl commented Jan 17, 2024

jorisvandenbossche commented Jan 17, 2024

phofl commented Jan 17, 2024

rhshadrach commented Jan 18, 2024 •

edited

Loading

jorisvandenbossche commented Jan 18, 2024

jorisvandenbossche commented Jan 19, 2024

rhshadrach commented Jan 19, 2024

rhshadrach commented Sep 7, 2024

		@@ -77,9 +77,52 @@ Previously you would have to do this to get a rolling window mean per-group:
		df = pd.DataFrame({"A": [1] * 20 + [2] * 12 + [3] * 8, "B": np.arange(40)})
		df

DEPR: DataFrameGroupBy.apply operating on the group keys #54950

DEPR: DataFrameGroupBy.apply operating on the group keys #54950

Conversation

rhshadrach commented Sep 2, 2023 • edited Loading

topper-123 commented Sep 2, 2023

topper-123 Sep 2, 2023

Choose a reason for hiding this comment

topper-123 commented Sep 2, 2023

rhshadrach commented Sep 2, 2023

topper-123 left a comment

Choose a reason for hiding this comment

topper-123 Sep 2, 2023

Choose a reason for hiding this comment

rhshadrach Sep 2, 2023

Choose a reason for hiding this comment

topper-123 left a comment

Choose a reason for hiding this comment

rhshadrach commented Sep 4, 2023

topper-123 left a comment

Choose a reason for hiding this comment

topper-123 commented Sep 4, 2023

mroeschke Sep 7, 2023

Choose a reason for hiding this comment

rhshadrach Sep 7, 2023

Choose a reason for hiding this comment

mroeschke commented Sep 7, 2023

phofl commented Sep 8, 2023

rhshadrach commented Sep 9, 2023

rhshadrach commented Sep 9, 2023 • edited Loading

phofl commented Sep 9, 2023

jorisvandenbossche commented Jan 17, 2024

rhshadrach commented Jan 17, 2024

jorisvandenbossche commented Jan 17, 2024

phofl commented Jan 17, 2024

jorisvandenbossche commented Jan 17, 2024

phofl commented Jan 17, 2024

rhshadrach commented Jan 18, 2024 • edited Loading

jorisvandenbossche commented Jan 18, 2024

jorisvandenbossche commented Jan 19, 2024

rhshadrach commented Jan 19, 2024

rhshadrach commented Sep 7, 2024

rhshadrach commented Sep 2, 2023 •

edited

Loading

rhshadrach commented Sep 9, 2023 •

edited

Loading

rhshadrach commented Jan 18, 2024 •

edited

Loading