Skip to content

Commit

Permalink
Merge branch 'main' into zip_subtree
Browse files Browse the repository at this point in the history
  • Loading branch information
shoyer committed Oct 15, 2024
2 parents 23da8ca + 33ead65 commit 4480e11
Show file tree
Hide file tree
Showing 28 changed files with 1,783 additions and 225 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/ci-additional.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ jobs:
python xarray/util/print_versions.py
- name: Install mypy
run: |
python -m pip install "mypy" --force-reinstall
python -m pip install "mypy==1.11.2" --force-reinstall
- name: Run mypy
run: |
Expand Down Expand Up @@ -176,7 +176,7 @@ jobs:
python xarray/util/print_versions.py
- name: Install mypy
run: |
python -m pip install "mypy" --force-reinstall
python -m pip install "mypy==1.11.2" --force-reinstall
- name: Run mypy
run: |
Expand Down
42 changes: 19 additions & 23 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -687,7 +687,7 @@ For manipulating, traversing, navigating, or mapping over the tree structure.
DataTree.relative_to
DataTree.iter_lineage
DataTree.find_common_ancestor
DataTree.map_over_subtree
DataTree.map_over_datasets
DataTree.pipe
DataTree.match
DataTree.filter
Expand Down Expand Up @@ -828,30 +828,26 @@ Index into all nodes in the subtree simultaneously.
.. DataTree.polyfit
.. DataTree.curvefit
.. Aggregation
.. -----------
Aggregation
-----------

.. Aggregate data in all nodes in the subtree simultaneously.
Aggregate data in all nodes in the subtree simultaneously.

.. .. autosummary::
.. :toctree: generated/
.. autosummary::
:toctree: generated/

.. DataTree.all
.. DataTree.any
.. DataTree.argmax
.. DataTree.argmin
.. DataTree.idxmax
.. DataTree.idxmin
.. DataTree.max
.. DataTree.min
.. DataTree.mean
.. DataTree.median
.. DataTree.prod
.. DataTree.sum
.. DataTree.std
.. DataTree.var
.. DataTree.cumsum
.. DataTree.cumprod
DataTree.all
DataTree.any
DataTree.max
DataTree.min
DataTree.mean
DataTree.median
DataTree.prod
DataTree.sum
DataTree.std
DataTree.var
DataTree.cumsum
DataTree.cumprod

.. ndarray methods
.. ---------------
Expand Down Expand Up @@ -958,7 +954,7 @@ DataTree methods

open_datatree
open_groups
map_over_subtree
map_over_datasets
DataTree.to_dict
DataTree.to_netcdf
DataTree.to_zarr
Expand Down
4 changes: 2 additions & 2 deletions doc/getting-started-guide/quick-overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -307,11 +307,11 @@ We can get a copy of the :py:class:`~xarray.Dataset` including the inherited coo
ds_inherited = dt["simulation/coarse"].to_dataset()
ds_inherited
And you can get a copy of just the node local values of :py:class:`~xarray.Dataset` by setting the ``inherited`` keyword to ``False``:
And you can get a copy of just the node local values of :py:class:`~xarray.Dataset` by setting the ``inherit`` keyword to ``False``:

.. ipython:: python
ds_node_local = dt["simulation/coarse"].to_dataset(inherited=False)
ds_node_local = dt["simulation/coarse"].to_dataset(inherit=False)
ds_node_local
.. note::
Expand Down
7 changes: 4 additions & 3 deletions doc/user-guide/data-structures.rst
Original file line number Diff line number Diff line change
Expand Up @@ -771,7 +771,7 @@ Here there are four different coordinate variables, which apply to variables in
``station`` is used only for ``weather`` variables
``lat`` and ``lon`` are only use for ``satellite`` images

Coordinate variables are inherited to descendent nodes, which means that
Coordinate variables are inherited to descendent nodes, which is only possible because
variables at different levels of a hierarchical DataTree are always
aligned. Placing the ``time`` variable at the root node automatically indicates
that it applies to all descendent nodes. Similarly, ``station`` is in the base
Expand All @@ -792,14 +792,15 @@ automatically includes coordinates from higher levels (e.g., ``time`` and
dt2["/weather/temperature"].dataset
Similarly, when you retrieve a Dataset through :py:func:`~xarray.DataTree.to_dataset` , the inherited coordinates are
included by default unless you exclude them with the ``inherited`` flag:
included by default unless you exclude them with the ``inherit`` flag:

.. ipython:: python
dt2["/weather/temperature"].to_dataset()
dt2["/weather/temperature"].to_dataset(inherited=False)
dt2["/weather/temperature"].to_dataset(inherit=False)
For more examples and further discussion see :ref:`alignment and coordinate inheritance <hierarchical-data.alignment-and-coordinate-inheritance>`.

.. _coordinates:

Expand Down
159 changes: 153 additions & 6 deletions doc/user-guide/hierarchical-data.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
.. _hierarchical-data:
.. _userguide.hierarchical-data:

Hierarchical data
==============================
=================

.. ipython:: python
:suppress:
Expand All @@ -15,6 +15,8 @@ Hierarchical data
%xmode minimal
.. _why:

Why Hierarchical Data?
----------------------

Expand Down Expand Up @@ -547,13 +549,13 @@ See that the same change (fast-forwarding by adding 10 years to the age of each
Mapping Custom Functions Over Trees
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can map custom computation over each node in a tree using :py:meth:`xarray.DataTree.map_over_subtree`.
You can map custom computation over each node in a tree using :py:meth:`xarray.DataTree.map_over_datasets`.
You can map any function, so long as it takes :py:class:`xarray.Dataset` objects as one (or more) of the input arguments,
and returns one (or more) xarray datasets.

.. note::

Functions passed to :py:func:`~xarray.DataTree.map_over_subtree` cannot alter nodes in-place.
Functions passed to :py:func:`~xarray.DataTree.map_over_datasets` cannot alter nodes in-place.
Instead they must return new :py:class:`xarray.Dataset` objects.

For example, we can define a function to calculate the Root Mean Square of a timeseries
Expand All @@ -567,11 +569,11 @@ Then calculate the RMS value of these signals:

.. ipython:: python
voltages.map_over_subtree(rms)
voltages.map_over_datasets(rms)
.. _multiple trees:

We can also use the :py:meth:`~xarray.map_over_subtree` decorator to promote a function which accepts datasets into one which
We can also use the :py:meth:`~xarray.map_over_datasets` decorator to promote a function which accepts datasets into one which
accepts datatrees.

Operating on Multiple Trees
Expand Down Expand Up @@ -644,3 +646,148 @@ We could use this feature to quickly calculate the electrical power in our signa
power = currents * voltages
power
.. _hierarchical-data.alignment-and-coordinate-inheritance:

Alignment and Coordinate Inheritance
------------------------------------

.. _data-alignment:

Data Alignment
~~~~~~~~~~~~~~

The data in different datatree nodes are not totally independent. In particular dimensions (and indexes) in child nodes must be exactly aligned with those in their parent nodes.
Exact aligment means that shared dimensions must be the same length, and indexes along those dimensions must be equal.

.. note::
If you were a previous user of the prototype `xarray-contrib/datatree <https://github.com/xarray-contrib/datatree>`_ package, this is different from what you're used to!
In that package the data model was that the data stored in each node actually was completely unrelated. The data model is now slightly stricter.
This allows us to provide features like :ref:`coordinate-inheritance`.

To demonstrate, let's first generate some example datasets which are not aligned with one another:

.. ipython:: python
# (drop the attributes just to make the printed representation shorter)
ds = xr.tutorial.open_dataset("air_temperature").drop_attrs()
ds_daily = ds.resample(time="D").mean("time")
ds_weekly = ds.resample(time="W").mean("time")
ds_monthly = ds.resample(time="ME").mean("time")
These datasets have different lengths along the ``time`` dimension, and are therefore not aligned along that dimension.

.. ipython:: python
ds_daily.sizes
ds_weekly.sizes
ds_monthly.sizes
We cannot store these non-alignable variables on a single :py:class:`~xarray.Dataset` object, because they do not exactly align:

.. ipython:: python
:okexcept:
xr.align(ds_daily, ds_weekly, ds_monthly, join="exact")
But we :ref:`previously said <why>` that multi-resolution data is a good use case for :py:class:`~xarray.DataTree`, so surely we should be able to store these in a single :py:class:`~xarray.DataTree`?
If we first try to create a :py:class:`~xarray.DataTree` with these different-length time dimensions present in both parents and children, we will still get an alignment error:

.. ipython:: python
:okexcept:
xr.DataTree.from_dict({"daily": ds_daily, "daily/weekly": ds_weekly})
This is because DataTree checks that data in child nodes align exactly with their parents.

.. note::
This requirement of aligned dimensions is similar to netCDF's concept of `inherited dimensions <https://www.unidata.ucar.edu/software/netcdf/workshops/2007/groups-types/Introduction.html>`_, as in netCDF-4 files dimensions are `visible to all child groups <https://docs.unidata.ucar.edu/netcdf-c/current/groups.html>`_.

This alignment check is performed up through the tree, all the way to the root, and so is therefore equivalent to requiring that this :py:func:`~xarray.align` command succeeds:

.. code:: python
xr.align(child.dataset, *(parent.dataset for parent in child.parents), join="exact")
To represent our unalignable data in a single :py:class:`~xarray.DataTree`, we must instead place all variables which are a function of these different-length dimensions into nodes that are not direct descendents of one another, e.g. organize them as siblings.

.. ipython:: python
dt = xr.DataTree.from_dict(
{"daily": ds_daily, "weekly": ds_weekly, "monthly": ds_monthly}
)
dt
Now we have a valid :py:class:`~xarray.DataTree` structure which contains all the data at each different time frequency, stored in a separate group.

This is a useful way to organise our data because we can still operate on all the groups at once.
For example we can extract all three timeseries at a specific lat-lon location:

.. ipython:: python
dt.sel(lat=75, lon=300)
or compute the standard deviation of each timeseries to find out how it varies with sampling frequency:

.. ipython:: python
dt.std(dim="time")
.. _coordinate-inheritance:

Coordinate Inheritance
~~~~~~~~~~~~~~~~~~~~~~

Notice that in the trees we constructed above there is some redundancy - the ``lat`` and ``lon`` variables appear in each sibling group, but are identical across the groups.

.. ipython:: python
dt
We can use "Coordinate Inheritance" to define them only once in a parent group and remove this redundancy, whilst still being able to access those coordinate variables from the child groups.

.. note::
This is also a new feature relative to the prototype `xarray-contrib/datatree <https://github.com/xarray-contrib/datatree>`_ package.

Let's instead place only the time-dependent variables in the child groups, and put the non-time-dependent ``lat`` and ``lon`` variables in the parent (root) group:

.. ipython:: python
dt = xr.DataTree.from_dict(
{
"/": ds.drop_dims("time"),
"daily": ds_daily.drop_vars(["lat", "lon"]),
"weekly": ds_weekly.drop_vars(["lat", "lon"]),
"monthly": ds_monthly.drop_vars(["lat", "lon"]),
}
)
dt
This is preferred to the previous representation because it now makes it clear that all of these datasets share common spatial grid coordinates.
Defining the common coordinates just once also ensures that the spatial coordinates for each group cannot become out of sync with one another during operations.

We can still access the coordinates defined in the parent groups from any of the child groups as if they were actually present on the child groups:

.. ipython:: python
dt.daily.coords
dt["daily/lat"]
As we can still access them, we say that the ``lat`` and ``lon`` coordinates in the child groups have been "inherited" from their common parent group.

If we print just one of the child nodes, it will still display inherited coordinates, but explicitly mark them as such:

.. ipython:: python
print(dt["/daily"])
This helps to differentiate which variables are defined on the datatree node that you are currently looking at, and which were defined somewhere above it.

We can also still perform all the same operations on the whole tree:

.. ipython:: python
dt.sel(lat=[75], lon=[300])
dt.std(dim="time")
2 changes: 1 addition & 1 deletion doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ New Features
~~~~~~~~~~~~
- ``DataTree`` related functionality is now exposed in the main ``xarray`` public
API. This includes: ``xarray.DataTree``, ``xarray.open_datatree``, ``xarray.open_groups``,
``xarray.map_over_subtree``, ``xarray.register_datatree_accessor`` and
``xarray.map_over_datasets``, ``xarray.register_datatree_accessor`` and
``xarray.testing.assert_isomorphic``.
By `Owen Littlejohns <https://github.com/owenlittlejohns>`_,
`Eni Awowale <https://github.com/eni-awowale>`_,
Expand Down
4 changes: 2 additions & 2 deletions xarray/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@
from xarray.core.dataarray import DataArray
from xarray.core.dataset import Dataset
from xarray.core.datatree import DataTree
from xarray.core.datatree_mapping import TreeIsomorphismError, map_over_subtree
from xarray.core.datatree_mapping import TreeIsomorphismError, map_over_datasets
from xarray.core.extensions import (
register_dataarray_accessor,
register_dataset_accessor,
Expand Down Expand Up @@ -86,7 +86,7 @@
"load_dataarray",
"load_dataset",
"map_blocks",
"map_over_subtree",
"map_over_datasets",
"merge",
"ones_like",
"open_dataarray",
Expand Down
6 changes: 3 additions & 3 deletions xarray/backends/h5netcdf_.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import functools
import io
import os
from collections.abc import Callable, Iterable
from collections.abc import Iterable
from typing import TYPE_CHECKING, Any

import numpy as np
Expand Down Expand Up @@ -465,7 +465,7 @@ def open_datatree(
use_cftime=None,
decode_timedelta=None,
format=None,
group: str | Iterable[str] | Callable | None = None,
group: str | None = None,
lock=None,
invalid_netcdf=None,
phony_dims=None,
Expand Down Expand Up @@ -511,7 +511,7 @@ def open_groups_as_dict(
use_cftime=None,
decode_timedelta=None,
format=None,
group: str | Iterable[str] | Callable | None = None,
group: str | None = None,
lock=None,
invalid_netcdf=None,
phony_dims=None,
Expand Down
Loading

0 comments on commit 4480e11

Please sign in to comment.