More docs

pydata · Apr 18, 2024 · 738fc5a · 738fc5a
1 parent 32e70d4
commit 738fc5a
Show file tree

Hide file tree

Showing 3 changed files with 92 additions and 25 deletions.
diff --git a/doc/conf.py b/doc/conf.py
@@ -166,6 +166,7 @@
     "CategoricalIndex": "~pandas.CategoricalIndex",
     "TimedeltaIndex": "~pandas.TimedeltaIndex",
     "DatetimeIndex": "~pandas.DatetimeIndex",
+    "IntervalIndex": "~pandas.IntervalIndex",
     "Series": "~pandas.Series",
     "DataFrame": "~pandas.DataFrame",
     "Categorical": "~pandas.Categorical",

diff --git a/doc/user-guide/groupby.rst b/doc/user-guide/groupby.rst
@@ -1,3 +1,5 @@
+.. currentmodule:: xarray
+
 .. _groupby:
 
 GroupBy: Group and Bin Data
@@ -15,19 +17,20 @@ __ https://www.jstatsoft.org/v40/i01/paper
 - Apply some function to each group.
 - Combine your groups back into a single data object.
 
-Group by operations work on both :py:class:`~xarray.Dataset` and
-:py:class:`~xarray.DataArray` objects. Most of the examples focus on grouping by
+Group by operations work on both :py:class:`Dataset` and
+:py:class:`DataArray` objects. Most of the examples focus on grouping by
 a single one-dimensional variable, although support for grouping
 over a multi-dimensional variable has recently been implemented. Note that for
 one-dimensional data, it is usually faster to rely on pandas' implementation of
 the same pipeline.
 
 .. tip::
 
-   To substantially improve the performance of GroupBy operations, particularly
-   with dask `install the flox package <https://flox.readthedocs.io>`_. flox
+   `Install the flox package <https://flox.readthedocs.io>`_ to substantially improve the performance
+   of GroupBy operations, particularly with dask. flox
    `extends Xarray's in-built GroupBy capabilities <https://flox.readthedocs.io/en/latest/xarray.html>`_
-   by allowing grouping by multiple variables, and lazy grouping by dask arrays. If installed, Xarray will automatically use flox by default.
+   by allowing grouping by multiple variables, and lazy grouping by dask arrays.
+   If installed, Xarray will automatically use flox by default.
 
 Split
 ~~~~~
@@ -87,7 +90,7 @@ Binning
 Sometimes you don't want to use all the unique values to determine the groups
 but instead want to "bin" the data into coarser groups. You could always create
 a customized coordinate, but xarray facilitates this via the
-:py:meth:`~xarray.Dataset.groupby_bins` method.
+:py:meth:`Dataset.groupby_bins` method.
 
 .. ipython:: python
 
@@ -110,7 +113,7 @@ Apply
 ~~~~~
 
 To apply a function to each group, you can use the flexible
-:py:meth:`~xarray.core.groupby.DatasetGroupBy.map` method. The resulting objects are automatically
+:py:meth:`core.groupby.DatasetGroupBy.map` method. The resulting objects are automatically
 concatenated back together along the group axis:
 
 .. ipython:: python
@@ -121,8 +124,8 @@ concatenated back together along the group axis:
 
     arr.groupby("letters").map(standardize)
 
-GroupBy objects also have a :py:meth:`~xarray.core.groupby.DatasetGroupBy.reduce` method and
-methods like :py:meth:`~xarray.core.groupby.DatasetGroupBy.mean` as shortcuts for applying an
+GroupBy objects also have a :py:meth:`core.groupby.DatasetGroupBy.reduce` method and
+methods like :py:meth:`core.groupby.DatasetGroupBy.mean` as shortcuts for applying an
 aggregation function:
 
 .. ipython:: python
@@ -183,7 +186,7 @@ Iterating and Squeezing
 Previously, Xarray defaulted to squeezing out dimensions of size one when iterating over
 a GroupBy object. This behaviour is being removed.
 You can always squeeze explicitly later with the Dataset or DataArray
-:py:meth:`~xarray.DataArray.squeeze` methods.
+:py:meth:`DataArray.squeeze` methods.
 
 .. ipython:: python
 
@@ -217,7 +220,7 @@ __ https://cfconventions.org/cf-conventions/v1.6.0/cf-conventions.html#_two_dime
     da.groupby("lon").map(lambda x: x - x.mean(), shortcut=False)
 
 Because multidimensional groups have the ability to generate a very large
-number of bins, coarse-binning via :py:meth:`~xarray.Dataset.groupby_bins`
+number of bins, coarse-binning via :py:meth:`Dataset.groupby_bins`
 may be desirable:
 
 .. ipython:: python
@@ -238,4 +241,49 @@ applying your function, and then unstacking the result:
 Extending GroupBy: Grouper Objects
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-...
+The first step in executing a GroupBy analysis is to *identify* the groups and create an intermediate array where each group member is identified
+by a unique integer code. Commonly this step is executed using :py:func:`pandas.factorize` for grouping by a categorical variable (e.g. ``['a', 'b', 'a', 'b']``)
+and :py:func:`pandas.cut` or :py:func:`numpy.digitize` or :py:func:`numpy.searchsorted` for binning a numeric variable.
+
+Much of the complexity in more complex GroupBy problems can be abstracted to a specialized "factorize" operation identifying the necessary groups.
+:py:class:`groupers.Grouper` and :py:class:`groupers.Resampler` objects provide an extension point allowing Xarray's GroupBy machinery
+to use specialized "factorization" operations.
+Eventually, they will also provide a natural way to extend GroupBy to grouping by multiple variables: ``ds.groupby(x=BinGrouper(...), t=Resampler(freq="M", ...)).mean()``.
+
+Xarray provides three Grouper objects today
+
+1. :py:class:`groupers.UniqueGrouper` for categorical grouping
+2. :py:class:`groupers.BinGrouper` for binned grouping
+3. :py:class:`groupers.TimeResampler` for resampling along a datetime coordinate
+
+These objects mean that
+
+- ``ds.groupby("categories")`` is identical to ``ds.groupby(categories=UniqueGrouper())``
+- ``ds.groupby_bins("values", bins=5)`` is identical to ``ds.groupby(value=BinGrouper(bins=7))``.
+- ``ds.resample(time="H")`` is identical to ``ds.groupby(time=TimeResampler(freq="H"))``.
+
+For example consider a seasonal grouping ``ds.groupby("time.season")``. This approach treats ``ds.time.dt.season`` as a categorical variable to group by and is naive
+to the many complexities of time grouping. A specialized ``SeasonGrouper`` and ``SeasonResampler`` object would allow
+
+- Supporting seasons that span a year-end.
+- Only including seasons with complete data coverage.
+- Grouping over seasons of unequal length
+- Returning results with seasons in the appropriate chronological order
+
+To define a custom grouper simply subclass either the :py:class:`Grouper` or :py:class:`Resampler` abstract base class
+and provide a customized ``factorize`` method. This method must accept a :py:class:`DataArray` to group by and return
+an instance of :py:class:`EncodedGroups`.
+
+.. ipython:: python
+
+    import numpy as np
+    from xarray.groupers import Grouper, EncodedGroups
+
+
+    class YearGrouper(Grouper):
+        def factorize(self, group) -> EncodedGroups:
+            assert np.issubdtype(group.dtype, np.datetime64)
+            year = group.dt.year
+            codes, uniques = pd.factorize(year)
+
+            return EncodedGroups(codes=codes)
diff --git a/xarray/core/groupers.py b/xarray/core/groupers.py
@@ -44,7 +44,7 @@ class EncodedGroups:
     Dataclass for storing intermediate values for GroupBy operation.
     Returned by the ``factorize`` method on Grouper objects.
 
-    Parameters
+    Attributes
     ----------
     codes: DataArray
         Same shape as the DataArray to group by. Values consist of a unique integer code for each group.
@@ -65,18 +65,23 @@ class EncodedGroups:
 
 
 class Grouper(ABC):
-    """Base class for Grouper objects that allow specializing GroupBy instructions."""
+    """Abstract base class for Grouper objects that allow specializing GroupBy instructions."""
 
     @property
     def can_squeeze(self) -> bool:
-        """TODO: delete this when the `squeeze` kwarg is deprecated. Only `UniqueGrouper`
-        should override it."""
+        """
+        Do not use.
+
+        .. deprecated:: 2023.03.0
+            This is a deprecated method. It will be deleted this when the `squeeze` kwarg is deprecated.
+            Only ``UniqueGrouper`` should override it.
+        """
         return False
 
     @abstractmethod
     def factorize(self, group: T_Group) -> EncodedGroups:
         """
-        Takes the group, and creates intermediates necessary for GroupBy.
+        Creates intermediates necessary for GroupBy.
 
         Parameters
         ----------
@@ -91,7 +96,8 @@ def factorize(self, group: T_Group) -> EncodedGroups:
 
 
 class Resampler(Grouper):
-    """Base class for Grouper objects that allow specializing resampling-type GroupBy instructions.
+    """
+    Abstract base class for Grouper objects that allow specializing resampling-type GroupBy instructions.
     Currently only used for TimeResampler, but could be used for SpaceResampler in the future.
     """
 
@@ -175,12 +181,19 @@ def _factorize_dummy(self) -> EncodedGroups:
 
 @dataclass
 class BinGrouper(Grouper):
-    """Grouper object for binning numeric data."""
+    """
+    Grouper object for binning numeric data.
+
+    Attributes
+    ----------
+    bins: int, sequence of scalars, or IntervalIndex
+        Speciication for bins either as integer, or as bin edges.
+    cut_kwargs: dict
+        Keyword arguments forwarded to :py:func:`pandas.cut`.
+    """
 
     bins: Any  # TODO: What is the typing?
     cut_kwargs: Mapping = field(default_factory=dict)
-    binned: Any = None
-    name: Any = None
 
     def __post_init__(self) -> None:
         if duck_array_ops.isnull(self.bins).all():
@@ -219,11 +232,15 @@ class TimeResampler(Resampler):
     """
     Grouper object specialized to resampling the time coordinate.
 
-    Parameters
+    Attributes
     ----------
+    freq : str
+        Frequency to resample to. See `Pandas frequency
+        aliases <https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases>`_
+        for a list of possible values.
     closed : {"left", "right"}, optional
-            Side of each interval to treat as closed.
-        label : {"left", "right"}, optional
+        Side of each interval to treat as closed.
+    label : {"left", "right"}, optional
         Side of each interval to use for labeling.
     base : int, optional
         For frequencies that evenly subdivide 1 day, the "origin" of the
@@ -235,7 +252,7 @@ class TimeResampler(Resampler):
             of the ``origin`` and ``offset`` parameters, and will be removed
             in a future version of xarray.
 
-    origin : {'epoch', 'start', 'start_day', 'end', 'end_day'}, pd.Timestamp, datetime.datetime, np.datetime64, or cftime.datetime, default 'start_day'
+    origin : {'epoch', 'start', 'start_day', 'end', 'end_day'}, pandas.Timestamp, datetime.datetime, numpy.datetime64, or cftime.datetime, default 'start_day'
         The datetime on which to adjust the grouping. The timezone of origin
         must match the timezone of the index.
 
@@ -255,6 +272,7 @@ class TimeResampler(Resampler):
             Following pandas, the ``loffset`` parameter is deprecated in favor
             of using time offset arithmetic, and will be removed in a future
             version of xarray.
+
     """
 
     freq: str