-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEPR: DataFrameGroupBy.apply operating on the group keys #54950
Conversation
…_apply_on_groupings � Conflicts: � doc/source/whatsnew/v2.2.0.rst � pandas/core/groupby/groupby.py
I think there was a previous PR on this, can you link to it? |
@@ -1781,10 +1791,25 @@ def f(g): | |||
else: | |||
f = func | |||
|
|||
if not include_groups: | |||
return self._python_apply_general(f, self._obj_with_exclusions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, getting rid of the try/except here will be great.
Is this issue present in resample etc. also, or is it only in groupby? |
Added to the OP.
Yes - added to the OP. That is handled here as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be very nice to get in, thx for picking it back up. A few comments from me though.
@@ -77,9 +77,52 @@ Previously you would have to do this to get a rolling window mean per-group: | |||
df = pd.DataFrame({"A": [1] * 20 + [2] * 12 + [3] * 8, "B": np.arange(40)}) | |||
df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change this DataFrame to make the below output be of reasonable size: df = pd.DataFrame({"A": [1] * 10 + [2] * 6 + [3] * 4, "B": np.arange(20)})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain the motivation here? In general, I don't think we should be rewriting old release notes unless there is an issue we're fixing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comment.
BTW, in #54747 (comment) you mentioned that groupby.apply has magical behaviors. Did you think about this try/except here, or is there other code locations you think about also?
No, that comment is about how apply infers the results of the UDF. We can always control what is passed to the UDF, so in my mind that is not a serious issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@@ -146,12 +146,12 @@ Deprecations | |||
- Deprecated allowing non-keyword arguments in :meth:`DataFrame.to_pickle` except ``path``. (:issue:`54229`) | |||
- Deprecated allowing non-keyword arguments in :meth:`DataFrame.to_string` except ``buf``. (:issue:`54229`) | |||
- Deprecated downcasting behavior in :meth:`Series.where`, :meth:`DataFrame.where`, :meth:`Series.mask`, :meth:`DataFrame.mask`, :meth:`Series.clip`, :meth:`DataFrame.clip`; in a future version these will not infer object-dtype columns to non-object dtype, or all-round floats to integer dtype. Call ``result.infer_objects(copy=False)`` on the result for object inference, or explicitly cast floats to ints. To opt in to the future version, use ``pd.set_option("future.downcasting", True)`` (:issue:`53656`) | |||
- Deprecated including the groups in computations when using :meth:`DataFrameGroupBy.apply` and :meth:`DataFrameGroupBy.resample`; pass ``include_groups=False`` to exclude the groups (:issue:`7155`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the groupby rolling/expanding/ewm ops affected by this too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not believe so. The only tests that hit groupby.apply in pandas/tests/window are where they directly call groupby.apply to produce the expected result. I also grepped the code and could not find any calls to groupby.apply in pandas.core.window.
Thanks @rhshadrach |
@rhshadrach just found this case:
I know we are doing some magic here, but I think here is a case for including b in apply as well? |
Not sure I understand. Are you saying after the deprecation is enforced we should still be including |
Assuming that is the case, it's contrary to the behavior of other ops.
In general, the docs state that |
Yes that was my idea, this syntax looks like the users wants b in apply as well.
I did not know that, then you can ignore what I've said. Current behavior is good then! |
…54950) * DEPR: DataFrameGroupBy.apply operating on the group keys * fixups * Improvements * Add DataFrameGroupBy.resample to the whatsnew; mypy fixup * Ignore wrong parameter order * Ignore groupby.resample in docstring validation * Fixup docstring
@rhshadrach how annoying would it be to keep this deprecation around for a bit longer? So we can start with a DeprecationWarning for this one? (of course, if we do another 2.3 that just bumps some deprecation->future warnings, that could do that for this one as well, and still change the behaviour in 3.0) |
This is quite bad behavior - it is inconsistent with the rest of groupby, produces self-inconsistent and confusing results, and can lead to confusion when debugging. It would not, however, interfere with future improvements I have planned. Still, I think we should weigh heavily the impact of keeping this behavior for another 1.5 years. |
But if we would bump it to FutureWarning in a 2.3 (I know we haven't yet decided on that idea though, but assuming we do that), we can still change the behaviour in 3.0, and no need for keeping it another 1.5 years? I just have the feeling (from the seaborn example, among others) that this is a quite widespread warning (basically whenever you use groupby.apply, even also when you are not actually using the group columns, which might be the majority of the cases), and so doing a deprecation warning first would reduce some of the noise from usage in libraries. |
We already tagged 3.0 and I don’t think we should do a 2.3 generally speaking |
(I mean your idea of doing a 2.3 on the 2.2.x branch, definitely not from main) |
That’s something I wouldn’t object to but I have moved away from this idea over the last few weeks tbh. I’ve objected to this deprecation in the past, but @rhshadrach addressed this with the keyword, the fix is a one line change… |
I don't have any opposition to 2.3 assuming it doesn't delay 3.0. Likewise, assuming 2.3, I have no opposition to changing this to a DeprecationWarning. |
I would prefer starting with a DeprecationWarning then |
Another question here: you added the
Is there a technical reason to not allow that option in the future? Or just because it is deemed unnecessary? (in the idea that most? users would expect it without the group columns) |
From a user standpoint, primarily for consistency with other groupby operations. We have |
…15006) Matching pandas-dev/pandas#54950 Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #15006
@jorisvandenbossche @phofl I plan to change this to a FutureWarning in 2.3. Let me know if there is any opposition. |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.This is a 2nd go at #52477 which was reverted in #52921.
Deprecates groupby.apply sometimes operating on the grouping columns. This only occurs when (a) the grouping columns are part of the DataFrame and (b) when the operation on the grouping columns does not raise a TypeError. Otherwise we fallback to excluding the grouping columns. Other operations (excluding filters) do not operate on the grouping columns. Users can still operate on the grouping columns by including them in selection (this works across groupby), e.g.
We do our best to not warn if we can guarantee that the grouping columns were not used. Unfortunately, we can't see if the supplied UDF subsets columns itself. To avoid the noise of the deprecation and adopt future behavior, users can specify
include_groups=False
. This will become the default and only valid value of this argument in pandas 3.0.All of the above also applies to
.groupby(...).resample
since that uses.groupby(...).apply
under the hood.cc @phofl @jorisvandenbossche @topper-123