API for N-dimensional combine (#2616)

* concatenates along a single dimension * Wrote function to find correct tile_IDs from nested list of datasets * Wrote function to check that combined_tile_ids structure is valid * Added test of 2d-concatenation * Tests now check that dataset ordering is correct * Test concatentation along a new dimension * Started generalising auto_combine to N-D by integrating the N-D concatentation algorithm * All unit tests now passing * Fixed a failing test which I didn't notice because I don't have pseudoNetCDF * Began updating open_mfdataset to handle N-D input * Refactored to remove duplicate logic in open_mfdataset & auto_combine * Implemented Shoyers suggestion in #2553 to rewrite the recursive nested list traverser as an iterator * --amend * Now raises ValueError if input not ordered correctly before concatenation * Added some more prototype tests defining desired behaviour more clearly * Now raises informative errors on invalid forms of input * Refactoring to alos merge along each dimension * Refactored to literally just apply the old auto_combine along each dimension * Added unit tests for open_mfdatset * Removed TODOs * Removed format strings * test_get_new_tile_ids now doesn't assume dicts are ordered * Fixed failing tests on python3.5 caused by accidentally assuming dict was ordered * Test for getting new tile id * Fixed itertoolz import so that it's compatible with older versions * Increased test coverage * Added toolz as an explicit dependency to pass tests on python2.7 * Updated 'what's new' * No longer attempts to shortcut all concatenation at once if concat_dims=None * Rewrote using itertools.groupby instead of toolz.itertoolz.groupby to remove hidden dependency on toolz * Fixed erroneous removal of utils import * Updated docstrings to include an example of multidimensional concatenation * Clarified auto_combine docstring for N-D behaviour * Added unit test for nested list of Datasets with different variables * Minor spelling and pep8 fixes * Started working on a new api with both auto_combine and manual_combine * Wrote basic function to infer concatenation order from coords. Needs better error handling though. * Attempt at finalised version of public-facing API. All the internals still need to be redone to match though. * No longer uses entire old auto_combine internally, only concat or merge * Updated what's new * Removed uneeded addition to what's new for old release * Fixed incomplete merge in docstring for open_mfdataset * Tests for manual combine passing * Tests for auto_combine now passing * xfailed weird behaviour with manual_combine trying to determine concat_dim * Add auto_combine and manual_combine to API page of docs * Tests now passing for open_mfdataset * Completed merge so that #2648 is respected, and added tests. Also moved concat to it's own file to avoid a circular dependency * Separated the tests for concat and both combines * Some PEP8 fixes * Pre-empting a test which will fail with opening uamiv format * Satisfy pep8speaks bot * Python 3.5 compatibile after changing some error string formatting * Order coords using pandas.Index objects * Fixed performance bug from GH #2662 * Removed ToDos about natural sorting of string coords * Generalized auto_combine to handle monotonically-decreasing coords too * Added more examples to docstring for manual_combine * Added note about globbing aspect of open_mfdataset * Removed auto-inferring of concatenation dimension in manual_combine * Added example to docstring for auto_combine * Minor correction to docstring * Another very minor docstring correction * Added test to guard against issue #2777 * Started deprecation cycle for auto_combine * Fully reverted open_mfdataset tests * Updated what's new to match deprecation cycle * Reverted uamiv test * Removed dependency on itertools * Deprecation tests fixed * Satisfy pycodestyle * Started deprecation cycle of auto_combine * Added specific error for edge case combine_manual can't handle * Check that global coordinates are monotonic * Highlighted weird behaviour when concatenating with no data variables * Added test for impossible-to-auto-combine coordinates * Removed uneeded test * Satisfy linter * Added airspeedvelocity benchmark for combining functions * Benchmark will take longer now * Updated version numbers in deprecation warnings to fit with recent release of 0.12 * Updated api docs for new function names * Fixed docs build failure * Revert "Fixed docs build failure" This reverts commit ddfc6dd. * Updated documentation with section explaining new functions * Suppressed deprecation warnings in test suite * Resolved ToDo by pointing to issue with concat, see #2975 * Various docs fixes * Slightly renamed tests to match new name of tested function * Included minor suggestions from shoyer * Removed trailing whitespace * Simplified error message for case combine_manual can't handle * Removed filter for deprecation warnings, and added test for if user doesn't supply concat_dim * Simple fixes suggested by shoyer * Change deprecation warning behaviour * linting
pydata · Jun 25, 2019 · 6b33ad8 · 6b33ad8
1 parent 76adf13
commit 6b33ad8
Show file tree

Hide file tree

Showing 13 changed files with 2,066 additions and 1,077 deletions.
diff --git a/asv_bench/benchmarks/combine.py b/asv_bench/benchmarks/combine.py
@@ -0,0 +1,37 @@
+import numpy as np
+import xarray as xr
+
+
+class Combine:
+    """Benchmark concatenating and merging large datasets"""
+
+    def setup(self):
+        """Create 4 datasets with two different variables"""
+
+        t_size, x_size, y_size = 100, 900, 800
+        t = np.arange(t_size)
+        data = np.random.randn(t_size, x_size, y_size)
+
+        self.dsA0 = xr.Dataset(
+            {'A': xr.DataArray(data, coords={'T': t},
+                               dims=('T', 'X', 'Y'))})
+        self.dsA1 = xr.Dataset(
+            {'A': xr.DataArray(data, coords={'T': t + t_size},
+                               dims=('T', 'X', 'Y'))})
+        self.dsB0 = xr.Dataset(
+            {'B': xr.DataArray(data, coords={'T': t},
+                               dims=('T', 'X', 'Y'))})
+        self.dsB1 = xr.Dataset(
+            {'B': xr.DataArray(data, coords={'T': t + t_size},
+                               dims=('T', 'X', 'Y'))})
+
+    def time_combine_manual(self):
+        datasets = [[self.dsA0, self.dsA1], [self.dsB0, self.dsB1]]
+
+        xr.combine_manual(datasets, concat_dim=[None, 't'])
+
+    def time_auto_combine(self):
+        """Also has to load and arrange t coordinate"""
+        datasets = [self.dsA0, self.dsA1, self.dsB0, self.dsB1]
+
+        xr.combine_auto(datasets)
diff --git a/doc/api.rst b/doc/api.rst
@@ -19,6 +19,9 @@ Top-level functions
    broadcast
    concat
    merge
+   auto_combine
+   combine_auto
+   combine_manual
    where
    set_options
    full_like

diff --git a/doc/combining.rst b/doc/combining.rst
@@ -11,9 +11,10 @@ Combining data
     import xarray as xr
     np.random.seed(123456)
 
-* For combining datasets or data arrays along a dimension, see concatenate_.
+* For combining datasets or data arrays along a single dimension, see concatenate_.
 * For combining datasets with different variables, see merge_.
 * For combining datasets or data arrays with different indexes or missing values, see combine_.
+* For combining datasets or data arrays along multiple dimensions see combining.multi_.
 
 .. _concatenate:
 
@@ -77,7 +78,7 @@ Merge
 ~~~~~
 
 To combine variables and coordinates between multiple ``DataArray`` and/or
-``Dataset`` object, use :py:func:`~xarray.merge`. It can merge a list of
+``Dataset`` objects, use :py:func:`~xarray.merge`. It can merge a list of
 ``Dataset``, ``DataArray`` or dictionaries of objects convertible to
 ``DataArray`` objects:
 
@@ -237,3 +238,76 @@ coordinates as long as any non-missing values agree or are disjoint:
 Note that due to the underlying representation of missing values as floating
 point numbers (``NaN``), variable data type is not always preserved when merging
 in this manner.
+
+.. _combining.multi:
+
+Combining along multiple dimensions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. note::
+
+  There are currently three combining functions with similar names:
+  :py:func:`~xarray.auto_combine`, :py:func:`~xarray.combine_auto`, and
+  :py:func:`~xarray.combine_manual`. This is because
+  ``auto_combine`` is in the process of being deprecated in favour of the other
+  two functions, which are more general. If your code currently relies on
+  ``auto_combine``, then you will be able to get similar functionality by using
+  ``combine_manual``.
+
+For combining many objects along multiple dimensions xarray provides
+:py:func:`~xarray.combine_manual`` and :py:func:`~xarray.combine_auto`. These
+functions use a combination of ``concat`` and ``merge`` across different
+variables to combine many objects into one.
+
+:py:func:`~xarray.combine_manual`` requires specifying the order in which the
+objects should be combined, while :py:func:`~xarray.combine_auto` attempts to
+infer this ordering automatically from the coordinates in the data.
+
+:py:func:`~xarray.combine_manual` is useful when you know the spatial
+relationship between each object in advance. The datasets must be provided in
+the form of a nested list, which specifies their relative position and
+ordering. A common task is collecting data from a parallelized simulation where
+each processor wrote out data to a separate file. A domain which was decomposed
+into 4 parts, 2 each along both the x and y axes, requires organising the
+datasets into a doubly-nested list, e.g:
+
+.. ipython:: python
+
+    arr = xr.DataArray(name='temperature', data=np.random.randint(5, size=(2, 2)), dims=['x', 'y'])
+    arr
+    ds_grid = [[arr, arr], [arr, arr]]
+    xr.combine_manual(ds_grid, concat_dim=['x', 'y'])
+
+:py:func:`~xarray.combine_manual` can also be used to explicitly merge datasets
+with different variables. For example if we have 4 datasets, which are divided
+along two times, and contain two different variables, we can pass ``None``
+to ``'concat_dim'`` to specify the dimension of the nested list over which
+we wish to use ``merge`` instead of ``concat``:
+
+.. ipython:: python
+
+    temp = xr.DataArray(name='temperature', data=np.random.randn(2), dims=['t'])
+    precip = xr.DataArray(name='precipitation', data=np.random.randn(2), dims=['t'])
+    ds_grid = [[temp, precip], [temp, precip]]
+    xr.combine_manual(ds_grid, concat_dim=['t', None])
+
+:py:func:`~xarray.combine_auto` is for combining objects which have dimension
+coordinates which specify their relationship to and order relative to one
+another, for example a linearly-increasing 'time' dimension coordinate.
+
+Here we combine two datasets using their common dimension coordinates. Notice
+they are concatenated in order based on the values in their dimension
+coordinates, not on their position in the list passed to ``combine_auto``.
+
+.. ipython:: python
+    :okwarning:
+
+    x1 = xr.DataArray(name='foo', data=np.random.randn(3), coords=[('x', [0, 1, 2])])
+    x2 = xr.DataArray(name='foo', data=np.random.randn(3), coords=[('x', [3, 4, 5])])
+    xr.combine_auto([x2, x1])
+
+These functions can be used by :py:func:`~xarray.open_mfdataset` to open many
+files as one dataset. The particular function used is specified by setting the
+argument ``'combine'`` to ``'auto'`` or ``'manual'``. This is useful for
+situations where your data is split across many files in multiple locations,
+which have some known relationship between one another.
diff --git a/doc/io.rst b/doc/io.rst
@@ -766,7 +766,10 @@ Combining multiple files
 
 NetCDF files are often encountered in collections, e.g., with different files
 corresponding to different model runs. xarray can straightforwardly combine such
-files into a single Dataset by making use of :py:func:`~xarray.concat`.
+files into a single Dataset by making use of :py:func:`~xarray.concat`,
+:py:func:`~xarray.merge`, :py:func:`~xarray.combine_manual` and
+:py:func:`~xarray.combine_auto`. For details on the difference between these
+functions see :ref:`combining data`.
 
 .. note::
 
@@ -779,7 +782,8 @@ files into a single Dataset by making use of :py:func:`~xarray.concat`.
     This function automatically concatenates and merges multiple files into a
     single xarray dataset.
     It is the recommended way to open multiple files with xarray.
-    For more details, see :ref:`dask.io` and a `blog post`_ by Stephan Hoyer.
+    For more details, see :ref:`combining.multi`, :ref:`dask.io` and a
+    `blog post`_ by Stephan Hoyer.
 
 .. _dask: http://dask.pydata.org
 .. _blog post: http://stephanhoyer.com/2015/06/11/xray-dask-out-of-core-labeled-arrays/

diff --git a/doc/whats-new.rst b/doc/whats-new.rst
@@ -56,6 +56,23 @@ Enhancements
   helpful for avoiding file-lock errors when trying to write to files opened
   using ``open_dataset()`` or ``open_dataarray()``. (:issue:`2887`)
   By `Dan Nowacki <https://github.com/dnowacki-usgs>`_.
+- Combining datasets along N dimensions:
+  Datasets can now be combined along any number of dimensions,
+  instead of just a one-dimensional list of datasets.
+
+  The new ``combine_manual`` will accept the datasets as a a nested
+  list-of-lists, and combine by applying a series of concat and merge
+  operations. The new ``combine_auto`` will instead use the dimension
+  coordinates of the datasets to order them.
+
+  ``open_mfdataset`` can use either ``combine_manual`` or ``combine_auto`` to
+  combine datasets along multiple dimensions, by specifying the argument
+  `combine='manual'` or `combine='auto'`.
+
+  This means that the original function ``auto_combine`` is being deprecated.
+  To avoid FutureWarnings switch to using `combine_manual` or `combine_auto`,
+  (or set the `combine` argument in `open_mfdataset`). (:issue:`2159`)
+  By `Tom Nicholas <http://github.com/TomNicholas>`_.
 - Better warning message when supplying invalid objects to ``xr.merge``
   (:issue:`2948`).  By `Mathias Hauser <https://github.com/mathause>`_.
 - Added ``strftime`` method to ``.dt`` accessor, making it simpler to hand a
@@ -203,6 +220,10 @@ Other enhancements
   report showing what exactly differs between the two objects (dimensions /
   coordinates / variables / attributes)  (:issue:`1507`).
   By `Benoit Bovy <https://github.com/benbovy>`_.
+- Resampling of standard and non-standard calendars indexed by
+  :py:class:`~xarray.CFTimeIndex` is now possible. (:issue:`2191`).
+  By `Jwen Fai Low <https://github.com/jwenfai>`_ and
+  `Spencer Clark <https://github.com/spencerkclark>`_.
 - Add ``tolerance`` option to ``resample()`` methods ``bfill``, ``pad``,
   ``nearest``. (:issue:`2695`)
   By `Hauke Schulz <https://github.com/observingClouds>`_.

diff --git a/xarray/__init__.py b/xarray/__init__.py
@@ -7,7 +7,8 @@
 
 from .core.alignment import align, broadcast, broadcast_arrays
 from .core.common import full_like, zeros_like, ones_like
-from .core.combine import concat, auto_combine
+from .core.concat import concat
+from .core.combine import combine_auto, combine_manual, auto_combine
 from .core.computation import apply_ufunc, dot, where
 from .core.extensions import (register_dataarray_accessor,
                               register_dataset_accessor)