Skip to content

Commit

Permalink
DEPR: DataFrameGroupBy.apply operating on the group keys (pandas-dev#…
Browse files Browse the repository at this point in the history
…52477)

* DEPR: DataFrameGroupBy.apply operating on the group keys

* Reorder whatsnew

* Remove warnings from pivot, minor refinements

* Handle warning in docs

* Improve warning message

* Add note to user guide

* Improve whatsnew

* Adjust docstrings
  • Loading branch information
rhshadrach authored Apr 12, 2023
1 parent 7eeec0d commit 9b20759
Show file tree
Hide file tree
Showing 31 changed files with 705 additions and 256 deletions.
4 changes: 2 additions & 2 deletions doc/source/user_guide/cookbook.rst
Original file line number Diff line number Diff line change
Expand Up @@ -459,7 +459,7 @@ Unlike agg, apply's callable is passed a sub-DataFrame which gives you access to
df
# List the size of the animals with the highest weight.
df.groupby("animal").apply(lambda subf: subf["size"][subf["weight"].idxmax()])
df.groupby("animal")[["size", "weight"]].apply(lambda subf: subf["size"][subf["weight"].idxmax()])
`Using get_group
<https://stackoverflow.com/questions/14734533/how-to-access-pandas-groupby-dataframe-by-key>`__
Expand All @@ -482,7 +482,7 @@ Unlike agg, apply's callable is passed a sub-DataFrame which gives you access to
return pd.Series(["L", avg_weight, True], index=["size", "weight", "adult"])
expected_df = gb.apply(GrowUp)
expected_df = gb[["size", "weight"]].apply(GrowUp)
expected_df
`Expanding apply
Expand Down
14 changes: 10 additions & 4 deletions doc/source/user_guide/groupby.rst
Original file line number Diff line number Diff line change
Expand Up @@ -430,6 +430,12 @@ This is mainly syntactic sugar for the alternative, which is much more verbose:
Additionally, this method avoids recomputing the internal grouping information
derived from the passed key.

You can also include the grouping columns if you want to operate on them.

.. ipython:: python
grouped[["A", "B"]].sum()
.. _groupby.iterating-label:

Iterating through groups
Expand Down Expand Up @@ -1067,7 +1073,7 @@ missing values with the ``ffill()`` method.
).set_index("date")
df_re
df_re.groupby("group").resample("1D").ffill()
df_re.groupby("group")[["val"]].resample("1D").ffill()
.. _groupby.filter:

Expand Down Expand Up @@ -1233,13 +1239,13 @@ the argument ``group_keys`` which defaults to ``True``. Compare

.. ipython:: python
df.groupby("A", group_keys=True).apply(lambda x: x)
df.groupby("A", group_keys=True)[["B", "C", "D"]].apply(lambda x: x)
with

.. ipython:: python
df.groupby("A", group_keys=False).apply(lambda x: x)
df.groupby("A", group_keys=False)[["B", "C", "D"]].apply(lambda x: x)
Numba Accelerated Routines
Expand Down Expand Up @@ -1722,7 +1728,7 @@ column index name will be used as the name of the inserted column:
result = {"b_sum": x["b"].sum(), "c_mean": x["c"].mean()}
return pd.Series(result, name="metrics")
result = df.groupby("a").apply(compute_metrics)
result = df.groupby("a")[["b", "c"]].apply(compute_metrics)
result
Expand Down
22 changes: 17 additions & 5 deletions doc/source/whatsnew/v0.14.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -328,13 +328,25 @@ More consistent behavior for some groupby methods:

- groupby ``head`` and ``tail`` now act more like ``filter`` rather than an aggregation:

.. ipython:: python
.. code-block:: ipython
df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
g = df.groupby('A')
g.head(1) # filters DataFrame
In [1]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
In [2]: g = df.groupby('A')
In [3]: g.head(1) # filters DataFrame
Out[3]:
A B
0 1 2
2 5 6
In [4]: g.apply(lambda x: x.head(1)) # used to simply fall-through
Out[4]:
A B
A
1 0 1 2
5 2 5 6
g.apply(lambda x: x.head(1)) # used to simply fall-through
- groupby head and tail respect column selection:

Expand Down
93 changes: 87 additions & 6 deletions doc/source/whatsnew/v0.18.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -77,9 +77,52 @@ Previously you would have to do this to get a rolling window mean per-group:
df = pd.DataFrame({"A": [1] * 20 + [2] * 12 + [3] * 8, "B": np.arange(40)})
df
.. ipython:: python
.. code-block:: ipython
df.groupby("A").apply(lambda x: x.rolling(4).B.mean())
In [1]: df.groupby("A").apply(lambda x: x.rolling(4).B.mean())
Out[1]:
A
1 0 NaN
1 NaN
2 NaN
3 1.5
4 2.5
5 3.5
6 4.5
7 5.5
8 6.5
9 7.5
10 8.5
11 9.5
12 10.5
13 11.5
14 12.5
15 13.5
16 14.5
17 15.5
18 16.5
19 17.5
2 20 NaN
21 NaN
22 NaN
23 21.5
24 22.5
25 23.5
26 24.5
27 25.5
28 26.5
29 27.5
30 28.5
31 29.5
3 32 NaN
33 NaN
34 NaN
35 33.5
36 34.5
37 35.5
38 36.5
39 37.5
Name: B, dtype: float64
Now you can do:

Expand All @@ -101,15 +144,53 @@ For ``.resample(..)`` type of operations, previously you would have to:
df
.. ipython:: python
.. code-block:: ipython
df.groupby("group").apply(lambda x: x.resample("1D").ffill())
In[1]: df.groupby("group").apply(lambda x: x.resample("1D").ffill())
Out[1]:
group val
group date
1 2016-01-03 1 5
2016-01-04 1 5
2016-01-05 1 5
2016-01-06 1 5
2016-01-07 1 5
2016-01-08 1 5
2016-01-09 1 5
2016-01-10 1 6
2 2016-01-17 2 7
2016-01-18 2 7
2016-01-19 2 7
2016-01-20 2 7
2016-01-21 2 7
2016-01-22 2 7
2016-01-23 2 7
2016-01-24 2 8
Now you can do:

.. ipython:: python
.. code-block:: ipython
df.groupby("group").resample("1D").ffill()
In[1]: df.groupby("group").resample("1D").ffill()
Out[1]:
group val
group date
1 2016-01-03 1 5
2016-01-04 1 5
2016-01-05 1 5
2016-01-06 1 5
2016-01-07 1 5
2016-01-08 1 5
2016-01-09 1 5
2016-01-10 1 6
2 2016-01-17 2 7
2016-01-18 2 7
2016-01-19 2 7
2016-01-20 2 7
2016-01-21 2 7
2016-01-22 2 7
2016-01-23 2 7
2016-01-24 2 8
.. _whatsnew_0181.enhancements.method_chain:

Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -200,6 +200,7 @@ Other API changes

Deprecations
~~~~~~~~~~~~
- Deprecated :meth:`.DataFrameGroupBy.apply` and methods on the objects returned by :meth:`.DataFrameGroupBy.resample` operating on the grouping column(s); select the columns to operate on after groupby to either explicitly include or exclude the groupings and avoid the ``FutureWarning`` (:issue:`7155`)
- Deprecated silently dropping unrecognized timezones when parsing strings to datetimes (:issue:`18702`)
- Deprecated :meth:`DataFrame._data` and :meth:`Series._data`, use public APIs instead (:issue:`33333`)
- Deprecated :meth:`.Groupby.all` and :meth:`.GroupBy.any` with datetime64 or :class:`PeriodDtype` values, matching the :class:`Series` and :class:`DataFrame` deprecations (:issue:`34479`)
Expand Down
26 changes: 13 additions & 13 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -8595,20 +8595,20 @@ def update(
>>> df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
... 'Parrot', 'Parrot'],
... 'Max Speed': [380., 370., 24., 26.]})
>>> df.groupby("Animal", group_keys=True).apply(lambda x: x)
Animal Max Speed
>>> df.groupby("Animal", group_keys=True)[['Max Speed']].apply(lambda x: x)
Max Speed
Animal
Falcon 0 Falcon 380.0
1 Falcon 370.0
Parrot 2 Parrot 24.0
3 Parrot 26.0
>>> df.groupby("Animal", group_keys=False).apply(lambda x: x)
Animal Max Speed
0 Falcon 380.0
1 Falcon 370.0
2 Parrot 24.0
3 Parrot 26.0
Falcon 0 380.0
1 370.0
Parrot 2 24.0
3 26.0
>>> df.groupby("Animal", group_keys=False)[['Max Speed']].apply(lambda x: x)
Max Speed
0 380.0
1 370.0
2 24.0
3 26.0
"""
)
)
Expand Down
80 changes: 50 additions & 30 deletions pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -260,7 +260,7 @@ class providing the base-class of operations.
each group together into a Series, including setting the index as
appropriate:
>>> g1.apply(lambda x: x.C.max() - x.B.min())
>>> g1[['B', 'C']].apply(lambda x: x.C.max() - x.B.min())
A
a 5
b 2
Expand Down Expand Up @@ -1487,6 +1487,16 @@ def f(g):
with option_context("mode.chained_assignment", None):
try:
result = self._python_apply_general(f, self._selected_obj)
if (
not isinstance(self.obj, Series)
and self._selection is None
and self._selected_obj.shape != self._obj_with_exclusions.shape
):
warnings.warn(
message=_apply_groupings_depr.format(type(self).__name__),
category=FutureWarning,
stacklevel=find_stack_level(),
)
except TypeError:
# gh-20949
# try again, with .apply acting as a filtering
Expand Down Expand Up @@ -2645,55 +2655,55 @@ def resample(self, rule, *args, **kwargs):
Downsample the DataFrame into 3 minute bins and sum the values of
the timestamps falling into a bin.
>>> df.groupby('a').resample('3T').sum()
a b
>>> df.groupby('a')[['b']].resample('3T').sum()
b
a
0 2000-01-01 00:00:00 0 2
2000-01-01 00:03:00 0 1
5 2000-01-01 00:00:00 5 1
0 2000-01-01 00:00:00 2
2000-01-01 00:03:00 1
5 2000-01-01 00:00:00 1
Upsample the series into 30 second bins.
>>> df.groupby('a').resample('30S').sum()
a b
>>> df.groupby('a')[['b']].resample('30S').sum()
b
a
0 2000-01-01 00:00:00 0 1
2000-01-01 00:00:30 0 0
2000-01-01 00:01:00 0 1
2000-01-01 00:01:30 0 0
2000-01-01 00:02:00 0 0
2000-01-01 00:02:30 0 0
2000-01-01 00:03:00 0 1
5 2000-01-01 00:02:00 5 1
0 2000-01-01 00:00:00 1
2000-01-01 00:00:30 0
2000-01-01 00:01:00 1
2000-01-01 00:01:30 0
2000-01-01 00:02:00 0
2000-01-01 00:02:30 0
2000-01-01 00:03:00 1
5 2000-01-01 00:02:00 1
Resample by month. Values are assigned to the month of the period.
>>> df.groupby('a').resample('M').sum()
a b
>>> df.groupby('a')[['b']].resample('M').sum()
b
a
0 2000-01-31 0 3
5 2000-01-31 5 1
0 2000-01-31 3
5 2000-01-31 1
Downsample the series into 3 minute bins as above, but close the right
side of the bin interval.
>>> df.groupby('a').resample('3T', closed='right').sum()
a b
>>> df.groupby('a')[['b']].resample('3T', closed='right').sum()
b
a
0 1999-12-31 23:57:00 0 1
2000-01-01 00:00:00 0 2
5 2000-01-01 00:00:00 5 1
0 1999-12-31 23:57:00 1
2000-01-01 00:00:00 2
5 2000-01-01 00:00:00 1
Downsample the series into 3 minute bins and close the right side of
the bin interval, but label each bin using the right edge instead of
the left.
>>> df.groupby('a').resample('3T', closed='right', label='right').sum()
a b
>>> df.groupby('a')[['b']].resample('3T', closed='right', label='right').sum()
b
a
0 2000-01-01 00:00:00 0 1
2000-01-01 00:03:00 0 2
5 2000-01-01 00:03:00 5 1
0 2000-01-01 00:00:00 1
2000-01-01 00:03:00 2
5 2000-01-01 00:03:00 1
"""
from pandas.core.resample import get_resampler_for_grouping

Expand Down Expand Up @@ -4309,3 +4319,13 @@ def _insert_quantile_level(idx: Index, qs: npt.NDArray[np.float64]) -> MultiInde
else:
mi = MultiIndex.from_product([idx, qs])
return mi


# GH#7155
_apply_groupings_depr = (
"{}.apply operated on the grouping columns. This behavior is deprecated, "
"and in a future version of pandas the grouping columns will be excluded "
"from the operation. Select the columns to operate on after groupby to"
"either explicitly include or exclude the groupings and silence "
"this warning."
)
Loading

0 comments on commit 9b20759

Please sign in to comment.