Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

map_blocks #3276

Merged
merged 82 commits into from
Oct 10, 2019
Merged
Show file tree
Hide file tree
Changes from 81 commits
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
b090a9e
map_block attempt 2
dcherian Sep 2, 2019
3948798
Address reviews: errors, args + kwargs support.
dcherian Sep 2, 2019
4f159c8
Works with datasets!
dcherian Sep 3, 2019
9179f0b
remove wrong comment.
dcherian Sep 8, 2019
20c5d5b
Support chunks.
dcherian Sep 8, 2019
b16b237
infer template.
dcherian Sep 8, 2019
43ef2b7
cleanup
dcherian Sep 8, 2019
5ebf738
cleanup2
dcherian Sep 8, 2019
8a460bb
api.rst
dcherian Sep 8, 2019
505f3f0
simple shape change error check.
dcherian Sep 8, 2019
fe1982f
Make test more complicated.
dcherian Sep 8, 2019
066eb59
Fix for when user function doesn't set DataArray.name
dcherian Sep 9, 2019
83eb310
Now _to_temp_dataset works.
dcherian Sep 9, 2019
008ce29
Add whats-new
dcherian Sep 9, 2019
adbe48e
chunks kwarg makes no sense right now.
dcherian Sep 9, 2019
924bf69
review feedback:
dcherian Sep 19, 2019
8aed8e7
Support nondim coords in make_meta.
dcherian Sep 19, 2019
d0797f6
Add Dataset.unify_chunks
dcherian Sep 19, 2019
599b70a
Merge branch 'master' into map_blocks_2
dcherian Sep 19, 2019
765ca5d
doc updates.
dcherian Sep 19, 2019
180bbf2
Merge remote-tracking branch 'upstream/master' into map_blocks_2
dcherian Sep 19, 2019
f0de1db
minor.
dcherian Sep 19, 2019
1251a5d
update comment.
dcherian Sep 19, 2019
47a0e39
More complicated test dataset. Tests fail :X
dcherian Sep 19, 2019
fa44d32
Don't know why compute is needed.
dcherian Sep 19, 2019
a6e84ef
work with DataArray nondim coords.
dcherian Sep 19, 2019
c28b402
fastpath unify_chunks
dcherian Sep 20, 2019
1694d03
comment.
dcherian Sep 20, 2019
cf04ec8
much improved tests.
dcherian Sep 20, 2019
3e9db26
Change args, kwargs syntax.
dcherian Sep 20, 2019
20fdde6
Add dataset, dataarray methods.
dcherian Sep 20, 2019
22e9c4e
api.rst
dcherian Sep 20, 2019
b145787
docstrings.
dcherian Sep 20, 2019
f600c4a
Fix unify_chunks.
dcherian Sep 23, 2019
4af5a67
Move assert_chunks_equal to xarray.testing.
dcherian Sep 23, 2019
3ca4b7b
minor changes.
dcherian Sep 23, 2019
3345d25
Better error handling when inferring returned object
dcherian Sep 23, 2019
54c77dd
wip
dcherian Sep 23, 2019
fb1ff0b
Docstrings + nicer error message.
dcherian Sep 26, 2019
bad0855
wip
dcherian Sep 23, 2019
291e6e6
better to_array
dcherian Sep 23, 2019
b31537c
remove unify_chunks in map_blocks + better tests.
dcherian Sep 26, 2019
72e7913
typing for unify_chunks
dcherian Sep 26, 2019
0a6bbed
address more review comments.
dcherian Sep 26, 2019
210987e
more unify_chunks tests.
dcherian Sep 26, 2019
582e0d5
Just use dask.core.utils.meta_from_array
dcherian Sep 26, 2019
d0fd87e
get tests working. assert_equal needs a lot of fixing.
dcherian Sep 28, 2019
875264a
more unify_chunks test.
dcherian Sep 28, 2019
0f03e37
assert_chunks_equal fixes.
dcherian Sep 28, 2019
8175d73
copy over meta_from_array.
dcherian Sep 28, 2019
6ab8737
minor fixes.
dcherian Sep 28, 2019
08c41b9
raise chunks error earlier and test for map_blocks raising chunk error
dcherian Sep 28, 2019
76bc23c
fix.
dcherian Sep 28, 2019
49d3899
Type annotations
Oct 1, 2019
ae53b85
py35 compat
Oct 1, 2019
f6dfb12
make sure unify_chunks does not compute.
dcherian Oct 1, 2019
c73eda1
Make tests functional by call compute before assert_equal
dcherian Oct 1, 2019
8ad882b
Update whats-new
dcherian Oct 3, 2019
aa4ea00
Merge remote-tracking branch 'upstream/master' into map_blocks_2
dcherian Oct 3, 2019
3cda5ac
Work with attributes.
dcherian Oct 3, 2019
49969a7
Support attrs and name changes.
dcherian Oct 3, 2019
6faf79e
more assert_equal
dcherian Oct 3, 2019
47baf76
test changing coord attribute
dcherian Oct 3, 2019
1295499
Merge remote-tracking branch 'upstream/master' into map_blocks_2
dcherian Oct 3, 2019
ce252f2
fix whats new
dcherian Oct 3, 2019
50ae13f
rework tests to use fixtures (kind of)
dcherian Oct 7, 2019
cdcf221
more review changes.
dcherian Oct 7, 2019
f167537
cleanup
dcherian Oct 7, 2019
4390f73
more review feedback.
dcherian Oct 7, 2019
c936557
fix unify_chunks.
dcherian Oct 8, 2019
e34aafe
Merge remote-tracking branch 'upstream/master' into map_blocks_2
dcherian Oct 9, 2019
2c7938a
read dask_array_compat :)
dcherian Oct 9, 2019
08ed873
Dask 1.2.0 compat.
dcherian Oct 10, 2019
67663aa
Merge remote-tracking branch 'upstream/master' into map_blocks_2
crusaderky Oct 10, 2019
99d61fc
documentation polish
crusaderky Oct 10, 2019
687689e
make_meta reflow
crusaderky Oct 10, 2019
f588cb6
cosmetic
crusaderky Oct 10, 2019
d476e2f
polish
crusaderky Oct 10, 2019
26a6a0d
Fix tests
crusaderky Oct 10, 2019
6491753
isort
crusaderky Oct 10, 2019
b227bea
isort
crusaderky Oct 10, 2019
2a41906
Add func call to docstrings.
dcherian Oct 10, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ Top-level functions
zeros_like
ones_like
dot
map_blocks

Dataset
=======
Expand Down Expand Up @@ -499,6 +500,8 @@ Dataset methods
Dataset.persist
Dataset.load
Dataset.chunk
Dataset.unify_chunks
Dataset.map_blocks
Dataset.filter_by_attrs
Dataset.info

Expand Down Expand Up @@ -529,6 +532,8 @@ DataArray methods
DataArray.persist
DataArray.load
DataArray.chunk
DataArray.unify_chunks
DataArray.map_blocks

GroupBy objects
===============
Expand Down Expand Up @@ -629,6 +634,7 @@ Testing
testing.assert_equal
testing.assert_identical
testing.assert_allclose
testing.assert_chunks_equal

Exceptions
==========
Expand Down
5 changes: 5 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,11 @@ Breaking changes
New functions/methods
~~~~~~~~~~~~~~~~~~~~~

- Added :py:func:`~xarray.map_blocks`, modeled after :py:func:`dask.array.map_blocks`.
Also added :py:meth:`Dataset.unify_chunks`, :py:meth:`DataArray.unify_chunks` and
:py:meth:`testing.assert_chunks_equal`. By `Deepak Cherian <https://github.com/dcherian>`_
and `Guido Imperiale <https://github.com/crusaderky>`_.

Enhancements
~~~~~~~~~~~~

Expand Down
1 change: 1 addition & 0 deletions xarray/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
from .core.dataarray import DataArray
from .core.merge import merge, MergeError
from .core.options import set_options
from .core.parallel import map_blocks

from .backends.api import (
open_dataset,
Expand Down
1 change: 0 additions & 1 deletion xarray/coding/times.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@
unpack_for_encoding,
)


# standard calendars recognized by cftime
_STANDARD_CALENDARS = {"standard", "gregorian", "proleptic_gregorian"}

Expand Down
91 changes: 91 additions & 0 deletions xarray/core/dask_array_compat.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
from distutils.version import LooseVersion

import dask.array as da
import numpy as np
from dask import __version__ as dask_version

if LooseVersion(dask_version) >= LooseVersion("2.0.0"):
meta_from_array = da.utils.meta_from_array
else:
# Copied from dask v2.4.0
# Used under the terms of Dask's license, see licenses/DASK_LICENSE.
import numbers

def meta_from_array(x, ndim=None, dtype=None):
""" Normalize an array to appropriate meta object

Parameters
----------
x: array-like, callable
Either an object that looks sufficiently like a Numpy array,
or a callable that accepts shape and dtype keywords
ndim: int
Number of dimensions of the array
dtype: Numpy dtype
A valid input for ``np.dtype``

Returns
-------
array-like with zero elements of the correct dtype
"""
# If using x._meta, x must be a Dask Array, some libraries (e.g. zarr)
# implement a _meta attribute that are incompatible with Dask Array._meta
if hasattr(x, "_meta") and isinstance(x, da.Array):
x = x._meta

if dtype is None and x is None:
raise ValueError("You must specify the meta or dtype of the array")

if np.isscalar(x):
x = np.array(x)

if x is None:
x = np.ndarray

if isinstance(x, type):
x = x(shape=(0,) * (ndim or 0), dtype=dtype)

if (
not hasattr(x, "shape")
or not hasattr(x, "dtype")
or not isinstance(x.shape, tuple)
):
return x

if isinstance(x, list) or isinstance(x, tuple):
ndims = [
0
if isinstance(a, numbers.Number)
else a.ndim
if hasattr(a, "ndim")
else len(a)
for a in x
]
a = [a if nd == 0 else meta_from_array(a, nd) for a, nd in zip(x, ndims)]
return a if isinstance(x, list) else tuple(x)

if ndim is None:
ndim = x.ndim

try:
meta = x[tuple(slice(0, 0, None) for _ in range(x.ndim))]
if meta.ndim != ndim:
if ndim > x.ndim:
meta = meta[
(Ellipsis,) + tuple(None for _ in range(ndim - meta.ndim))
]
meta = meta[tuple(slice(0, 0, None) for _ in range(meta.ndim))]
elif ndim == 0:
meta = meta.sum()
else:
meta = meta.reshape((0,) * ndim)
except Exception:
meta = np.empty((0,) * ndim, dtype=dtype or x.dtype)

if np.isscalar(meta):
meta = np.array(meta)

if dtype and meta.dtype != dtype:
meta = meta.astype(dtype)

return meta
75 changes: 75 additions & 0 deletions xarray/core/dataarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
Optional,
Sequence,
Tuple,
TypeVar,
Union,
cast,
overload,
Expand Down Expand Up @@ -63,6 +64,8 @@
)

if TYPE_CHECKING:
T_DSorDA = TypeVar("T_DSorDA", "DataArray", Dataset)
dcherian marked this conversation as resolved.
Show resolved Hide resolved

try:
from dask.delayed import Delayed
except ImportError:
Expand Down Expand Up @@ -3038,6 +3041,78 @@ def integrate(
ds = self._to_temp_dataset().integrate(dim, datetime_unit)
return self._from_temp_dataset(ds)

def unify_chunks(self) -> "DataArray":
""" Unify chunk size along all chunked dimensions of this DataArray.

Returns
-------

DataArray with consistent chunk sizes for all dask-array variables

See Also
--------

dask.array.core.unify_chunks
"""
ds = self._to_temp_dataset().unify_chunks()
return self._from_temp_dataset(ds)

def map_blocks(
self,
func: "Callable[..., T_DSorDA]",
args: Sequence[Any] = (),
kwargs: Mapping[str, Any] = None,
) -> "T_DSorDA":
"""
Apply a function to each chunk of this DataArray. This method is experimental
and its signature may change.

Parameters
----------
func: callable
User-provided function that accepts a DataArray as its first parameter. The
function will receive a subset of this DataArray, corresponding to one chunk
along each chunked dimension.

The function will be first run on mocked-up data, that looks like this array
but has sizes 0, to determine properties of the returned object such as
dtype, variable names, new dimensions and new indexes (if any).

This function must return either a single DataArray or a single Dataset.

This function cannot change size of existing dimensions, or add new chunked
dimensions.
args: Sequence
Passed verbatim to func after unpacking, after the sliced DataArray. xarray
objects, if any, will not be split by chunks. Passing dask collections is
not allowed.
kwargs: Mapping
Passed verbatim to func after unpacking. xarray objects, if any, will not be
split by chunks. Passing dask collections is not allowed.

Returns
-------
A single DataArray or Dataset with dask backend, reassembled from the outputs of
the function.

Notes
-----
This method is designed for when one needs to manipulate a whole xarray object
within each chunk. In the more common case where one can work on numpy arrays,
it is recommended to use apply_ufunc.

If none of the variables in this DataArray is backed by dask, calling this
method is equivalent to calling ``func(self, *args, **kwargs)``.

See Also
--------
dask.array.map_blocks, xarray.apply_ufunc, xarray.map_blocks,
xarray.Dataset.map_blocks
"""
from .parallel import map_blocks

return map_blocks(func, self, args, kwargs)

# this needs to be at the end, or mypy will confuse with `str`
# https://mypy.readthedocs.io/en/latest/common_issues.html#dealing-with-conflicting-names
str = property(StringAccessor)
Expand Down
Loading