Skip to content

Commit

Permalink
More docs
Browse files Browse the repository at this point in the history
  • Loading branch information
dcherian committed Apr 18, 2024
1 parent 32e70d4 commit 738fc5a
Show file tree
Hide file tree
Showing 3 changed files with 92 additions and 25 deletions.
1 change: 1 addition & 0 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,7 @@
"CategoricalIndex": "~pandas.CategoricalIndex",
"TimedeltaIndex": "~pandas.TimedeltaIndex",
"DatetimeIndex": "~pandas.DatetimeIndex",
"IntervalIndex": "~pandas.IntervalIndex",
"Series": "~pandas.Series",
"DataFrame": "~pandas.DataFrame",
"Categorical": "~pandas.Categorical",
Expand Down
72 changes: 60 additions & 12 deletions doc/user-guide/groupby.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.. currentmodule:: xarray

.. _groupby:

GroupBy: Group and Bin Data
Expand All @@ -15,19 +17,20 @@ __ https://www.jstatsoft.org/v40/i01/paper
- Apply some function to each group.
- Combine your groups back into a single data object.

Group by operations work on both :py:class:`~xarray.Dataset` and
:py:class:`~xarray.DataArray` objects. Most of the examples focus on grouping by
Group by operations work on both :py:class:`Dataset` and
:py:class:`DataArray` objects. Most of the examples focus on grouping by
a single one-dimensional variable, although support for grouping
over a multi-dimensional variable has recently been implemented. Note that for
one-dimensional data, it is usually faster to rely on pandas' implementation of
the same pipeline.

.. tip::

To substantially improve the performance of GroupBy operations, particularly
with dask `install the flox package <https://flox.readthedocs.io>`_. flox
`Install the flox package <https://flox.readthedocs.io>`_ to substantially improve the performance
of GroupBy operations, particularly with dask. flox
`extends Xarray's in-built GroupBy capabilities <https://flox.readthedocs.io/en/latest/xarray.html>`_
by allowing grouping by multiple variables, and lazy grouping by dask arrays. If installed, Xarray will automatically use flox by default.
by allowing grouping by multiple variables, and lazy grouping by dask arrays.
If installed, Xarray will automatically use flox by default.

Split
~~~~~
Expand Down Expand Up @@ -87,7 +90,7 @@ Binning
Sometimes you don't want to use all the unique values to determine the groups
but instead want to "bin" the data into coarser groups. You could always create
a customized coordinate, but xarray facilitates this via the
:py:meth:`~xarray.Dataset.groupby_bins` method.
:py:meth:`Dataset.groupby_bins` method.

.. ipython:: python
Expand All @@ -110,7 +113,7 @@ Apply
~~~~~

To apply a function to each group, you can use the flexible
:py:meth:`~xarray.core.groupby.DatasetGroupBy.map` method. The resulting objects are automatically
:py:meth:`core.groupby.DatasetGroupBy.map` method. The resulting objects are automatically
concatenated back together along the group axis:

.. ipython:: python
Expand All @@ -121,8 +124,8 @@ concatenated back together along the group axis:
arr.groupby("letters").map(standardize)
GroupBy objects also have a :py:meth:`~xarray.core.groupby.DatasetGroupBy.reduce` method and
methods like :py:meth:`~xarray.core.groupby.DatasetGroupBy.mean` as shortcuts for applying an
GroupBy objects also have a :py:meth:`core.groupby.DatasetGroupBy.reduce` method and
methods like :py:meth:`core.groupby.DatasetGroupBy.mean` as shortcuts for applying an
aggregation function:

.. ipython:: python
Expand Down Expand Up @@ -183,7 +186,7 @@ Iterating and Squeezing
Previously, Xarray defaulted to squeezing out dimensions of size one when iterating over
a GroupBy object. This behaviour is being removed.
You can always squeeze explicitly later with the Dataset or DataArray
:py:meth:`~xarray.DataArray.squeeze` methods.
:py:meth:`DataArray.squeeze` methods.

.. ipython:: python
Expand Down Expand Up @@ -217,7 +220,7 @@ __ https://cfconventions.org/cf-conventions/v1.6.0/cf-conventions.html#_two_dime
da.groupby("lon").map(lambda x: x - x.mean(), shortcut=False)
Because multidimensional groups have the ability to generate a very large
number of bins, coarse-binning via :py:meth:`~xarray.Dataset.groupby_bins`
number of bins, coarse-binning via :py:meth:`Dataset.groupby_bins`
may be desirable:

.. ipython:: python
Expand All @@ -238,4 +241,49 @@ applying your function, and then unstacking the result:
Extending GroupBy: Grouper Objects
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

...
The first step in executing a GroupBy analysis is to *identify* the groups and create an intermediate array where each group member is identified
by a unique integer code. Commonly this step is executed using :py:func:`pandas.factorize` for grouping by a categorical variable (e.g. ``['a', 'b', 'a', 'b']``)
and :py:func:`pandas.cut` or :py:func:`numpy.digitize` or :py:func:`numpy.searchsorted` for binning a numeric variable.

Much of the complexity in more complex GroupBy problems can be abstracted to a specialized "factorize" operation identifying the necessary groups.
:py:class:`groupers.Grouper` and :py:class:`groupers.Resampler` objects provide an extension point allowing Xarray's GroupBy machinery
to use specialized "factorization" operations.
Eventually, they will also provide a natural way to extend GroupBy to grouping by multiple variables: ``ds.groupby(x=BinGrouper(...), t=Resampler(freq="M", ...)).mean()``.

Xarray provides three Grouper objects today

1. :py:class:`groupers.UniqueGrouper` for categorical grouping
2. :py:class:`groupers.BinGrouper` for binned grouping
3. :py:class:`groupers.TimeResampler` for resampling along a datetime coordinate

These objects mean that

- ``ds.groupby("categories")`` is identical to ``ds.groupby(categories=UniqueGrouper())``
- ``ds.groupby_bins("values", bins=5)`` is identical to ``ds.groupby(value=BinGrouper(bins=7))``.
- ``ds.resample(time="H")`` is identical to ``ds.groupby(time=TimeResampler(freq="H"))``.

For example consider a seasonal grouping ``ds.groupby("time.season")``. This approach treats ``ds.time.dt.season`` as a categorical variable to group by and is naive
to the many complexities of time grouping. A specialized ``SeasonGrouper`` and ``SeasonResampler`` object would allow

- Supporting seasons that span a year-end.
- Only including seasons with complete data coverage.
- Grouping over seasons of unequal length
- Returning results with seasons in the appropriate chronological order

To define a custom grouper simply subclass either the :py:class:`Grouper` or :py:class:`Resampler` abstract base class
and provide a customized ``factorize`` method. This method must accept a :py:class:`DataArray` to group by and return
an instance of :py:class:`EncodedGroups`.

.. ipython:: python
import numpy as np
from xarray.groupers import Grouper, EncodedGroups
class YearGrouper(Grouper):
def factorize(self, group) -> EncodedGroups:
assert np.issubdtype(group.dtype, np.datetime64)
year = group.dt.year
codes, uniques = pd.factorize(year)
return EncodedGroups(codes=codes)
44 changes: 31 additions & 13 deletions xarray/core/groupers.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ class EncodedGroups:
Dataclass for storing intermediate values for GroupBy operation.
Returned by the ``factorize`` method on Grouper objects.
Parameters
Attributes
----------
codes: DataArray
Same shape as the DataArray to group by. Values consist of a unique integer code for each group.
Expand All @@ -65,18 +65,23 @@ class EncodedGroups:


class Grouper(ABC):
"""Base class for Grouper objects that allow specializing GroupBy instructions."""
"""Abstract base class for Grouper objects that allow specializing GroupBy instructions."""

@property
def can_squeeze(self) -> bool:
"""TODO: delete this when the `squeeze` kwarg is deprecated. Only `UniqueGrouper`
should override it."""
"""
Do not use.
.. deprecated:: 2023.03.0
This is a deprecated method. It will be deleted this when the `squeeze` kwarg is deprecated.
Only ``UniqueGrouper`` should override it.
"""
return False

@abstractmethod
def factorize(self, group: T_Group) -> EncodedGroups:
"""
Takes the group, and creates intermediates necessary for GroupBy.
Creates intermediates necessary for GroupBy.
Parameters
----------
Expand All @@ -91,7 +96,8 @@ def factorize(self, group: T_Group) -> EncodedGroups:


class Resampler(Grouper):
"""Base class for Grouper objects that allow specializing resampling-type GroupBy instructions.
"""
Abstract base class for Grouper objects that allow specializing resampling-type GroupBy instructions.
Currently only used for TimeResampler, but could be used for SpaceResampler in the future.
"""

Expand Down Expand Up @@ -175,12 +181,19 @@ def _factorize_dummy(self) -> EncodedGroups:

@dataclass
class BinGrouper(Grouper):
"""Grouper object for binning numeric data."""
"""
Grouper object for binning numeric data.
Attributes
----------
bins: int, sequence of scalars, or IntervalIndex
Speciication for bins either as integer, or as bin edges.
cut_kwargs: dict
Keyword arguments forwarded to :py:func:`pandas.cut`.
"""

bins: Any # TODO: What is the typing?
cut_kwargs: Mapping = field(default_factory=dict)
binned: Any = None
name: Any = None

def __post_init__(self) -> None:
if duck_array_ops.isnull(self.bins).all():
Expand Down Expand Up @@ -219,11 +232,15 @@ class TimeResampler(Resampler):
"""
Grouper object specialized to resampling the time coordinate.
Parameters
Attributes
----------
freq : str
Frequency to resample to. See `Pandas frequency
aliases <https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases>`_
for a list of possible values.
closed : {"left", "right"}, optional
Side of each interval to treat as closed.
label : {"left", "right"}, optional
Side of each interval to treat as closed.
label : {"left", "right"}, optional
Side of each interval to use for labeling.
base : int, optional
For frequencies that evenly subdivide 1 day, the "origin" of the
Expand All @@ -235,7 +252,7 @@ class TimeResampler(Resampler):
of the ``origin`` and ``offset`` parameters, and will be removed
in a future version of xarray.
origin : {'epoch', 'start', 'start_day', 'end', 'end_day'}, pd.Timestamp, datetime.datetime, np.datetime64, or cftime.datetime, default 'start_day'
origin : {'epoch', 'start', 'start_day', 'end', 'end_day'}, pandas.Timestamp, datetime.datetime, numpy.datetime64, or cftime.datetime, default 'start_day'
The datetime on which to adjust the grouping. The timezone of origin
must match the timezone of the index.
Expand All @@ -255,6 +272,7 @@ class TimeResampler(Resampler):
Following pandas, the ``loffset`` parameter is deprecated in favor
of using time offset arithmetic, and will be removed in a future
version of xarray.
"""

freq: str
Expand Down

0 comments on commit 738fc5a

Please sign in to comment.