Support Dask interface #1674

mrocklin · 2017-10-31T09:15:52Z

This integrates the new dask interface methods into XArray. This will place XArray as a first-class dask collection and help in particular with newer dask.distributed features.

Closes Formalize contract between XArray and the dask.distributed scheduler pangeo-data/pangeo#5
Tests added / passed
Passes git diff upstream/master **/*py | flake8 --diff
Fully documented, including whats-new.rst for all changes and api.rst for new API

Builds on work from @jcrist here: dask/dask#2748
Depends on dask/dask#2847

mrocklin · 2017-10-31T09:17:56Z

I've only done Variable so far. Hopefully what's here seems straightforward. I'll do DataArray and DataSet next and then look at what legacy code I can clean up within XArray.

I'll be working on this while on a long flight and so may not respond quickly.

shoyer · 2017-10-31T18:24:22Z

xarray/core/variable.py

+
+    def visualize(self, **kwargs):
+        import dask
+        return dask.visualize(self, **kwargs)


My inclination would be to leave this out and require using dask.visualize(). My concern is that it could be easily confused with .plot().

Also test distributed computing

mrocklin · 2017-10-31T19:16:05Z

For the distributed work this now also uses dask/distributed#1513

mrocklin · 2017-10-31T19:16:26Z

I've updated this to dataarray and dataset as well

jhamman · 2017-11-01T04:56:10Z

@mrocklin - thanks for getting this started. Curious, does the test suite pass when you combine this with dask/dask#2847 ?

mrocklin · 2017-11-01T08:57:58Z

Generally yes, things work. There are a few xfailed failures in tests that used mock on functions that are no longer being used. Also xarray/tests/test_dask.py::test_dataarray_pickle is failing. Otherwise everything works well.

mrocklin · 2017-11-01T18:52:50Z

OK, this is now backwards compatible. I'll need to appropriately skip the tests if a new version of dask/dask and dask/distributed aren't around, but this change should be innocuous otherwise.

Review from @jhamman or @shoyer would be welcome.

mrocklin · 2017-11-01T19:57:11Z

OK, this is now backwards compatible. Tests should pass.

shoyer

Generally this looks great, thanks for putting this together!

shoyer · 2017-11-02T06:45:37Z

xarray/core/dataset.py

+                if dask.is_dask_collection(v) else
+                (False, k, v) for k, v in self._variables.items()]
+        return self._dask_postcompute, (info, self._coord_names, self._dims,
+                                     self._attrs, self._file_obj, self._encoding)


nit: please indent to match the opening ( on the previous line

Fixed. Is it possible to add this style concern to the flake8 tests?

shoyer · 2017-11-02T06:48:51Z

xarray/core/variable.py

+            return None
+
+    def __dask_keys__(self):
+        return self._data.__dask_keys__()


It is OK if these methods error (with AttributeError) when self._data is not a dask array?

Yes, we always check if the object is a dask collection first by calling __dask_graph__

shoyer · 2017-11-02T06:56:23Z

xarray/core/dataarray.py

@@ -576,6 +576,33 @@ def reset_coords(self, names=None, drop=False, inplace=False):
            dataset[self.name] = self.variable
            return dataset

+    def __dask_graph__(self):
+        return self._variable.__dask_graph__()


It's actually possible to have multiple dask arrays in an xarray.DataArray, if there are dask arrays in the coordinates. So it would be better to handle DataArray by converting to a Dataset than to a Variable. We use the _to_temp_dataset/_from_temp_dataset as a shortcut for these types of cases, e.g., see the current implementation of DataArray.persist().

OK, that will be a bit tricky. My guess is that it might be simpler to just account for all of the possible dask things explicitly, as we do in dataset. Otherwise we're converting to and from datasets in each of the __dask_foo__ methods, and I would not be surprised to run into oddness there. I'm not sure though.

Can you recommend a test case that includes dask arrays in the coordinates?

For some reason this fails when I use dask.arrays for coordinates

coord = da.arange(8, chunks=(4,)) data = da.random.random((8, 8), chunks=(4, 4)) + 1 array = DataArray(data, coords={'x': coord, 'y': coord}, dims=['x', 'y'])

Replacing coords with a numpy array works fine.

@mrocklin - what is the failure you are referring to?

In [1]: import xarray In [2]: import dask.array as da In [3]: coord = da.arange(8, chunks=(4,)) ...: data = da.random.random((8, 8), chunks=(4, 4)) + 1 ...: array = xarray.DataArray(data, ...: coords={'x': coord, 'y': coord}, ...: dims=['x', 'y']) ...: --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-3-b90a33ebf436> in <module>() 3 array = xarray.DataArray(data, 4 coords={'x': coord, 'y': coord}, ----> 5 dims=['x', 'y']) /home/mrocklin/workspace/xarray/xarray/core/dataarray.py in __init__(self, data, coords, dims, name, attrs, encoding, fastpath) 227 228 data = as_compatible_data(data) --> 229 coords, dims = _infer_coords_and_dims(data.shape, coords, dims) 230 variable = Variable(dims, data, attrs, encoding, fastpath=True) 231 /home/mrocklin/workspace/xarray/xarray/core/dataarray.py in _infer_coords_and_dims(shape, coords, dims) 68 if utils.is_dict_like(coords): 69 for k, v in coords.items(): ---> 70 new_coords[k] = as_variable(v, name=k) 71 elif coords is not None: 72 for dim, coord in zip(dims, coords): /home/mrocklin/workspace/xarray/xarray/core/variable.py in as_variable(obj, name) 94 '{}'.format(obj)) 95 elif utils.is_scalar(obj): ---> 96 obj = Variable([], obj) 97 elif getattr(obj, 'name', None) is not None: 98 obj = Variable(obj.name, obj) /home/mrocklin/workspace/xarray/xarray/core/variable.py in __init__(self, dims, data, attrs, encoding, fastpath) 275 """ 276 self._data = as_compatible_data(data, fastpath=fastpath) --> 277 self._dims = self._parse_dimensions(dims) 278 self._attrs = None 279 self._encoding = None /home/mrocklin/workspace/xarray/xarray/core/variable.py in _parse_dimensions(self, dims) 439 raise ValueError('dimensions %s must have the same length as the ' 440 'number of data dimensions, ndim=%s' --> 441 % (dims, self.ndim)) 442 return dims 443 ValueError: dimensions () must have the same length as the number of data dimensions, ndim=1

My objective here is to produce a case where a data array has dask arrays in its coordinates so that I can write code to handle such cases.

@mrocklin - something funny is going on here. I'm going to open a separate issue.

In the short term, this may help you move forward:

In [21]: x = xr.Variable('x', da.arange(8, chunks=(4,))) ...: y = xr.Variable('y', da.arange(8, chunks=(4,)) * 2) ...: data = da.random.random((8, 8), chunks=(4, 4)) + 1 ...: array = xr.DataArray(data, ...: coords={'xx': x, 'yy': y}, ...: dims=['x', 'y']) ...: In [22]: array Out[22]: <xarray.DataArray 'add-a034ba104341d3cca6b28ad7bf059b14' (x: 8, y: 8)> dask.array<shape=(8, 8), dtype=float64, chunksize=(4, 4)> Coordinates: xx (x) int64 dask.array<shape=(8,), chunksize=(4,)> yy (y) int64 dask.array<shape=(8,), chunksize=(4,)> Dimensions without coordinates: x, y

See #1684 for more information and #1685 for the fix.

Thanks @jhamman , using your suggested code and @shoyer 's suggestion to depend on the DataSet implementation I think that this is now resolved.

shoyer · 2017-11-02T06:57:27Z

xarray/core/dataset.py

+                if dask.is_dask_collection(v) else
+                (False, k, v) for k, v in self._variables.items()]
+        return self._dask_postpersist, (info, self._coord_names, self._dims,
+                                     self._attrs, self._file_obj, self._encoding)


nit: please indent

shoyer · 2017-11-02T07:06:26Z

xarray/tests/test_dask.py

+    lambda x: x.persist(),
+    pytest.mark.skipif(LooseVersion(dask.__version__) < '0.16',
+                       lambda x: dask.persist(x)[0],
+                       reason='Need Dask 0.16+')


This is pretty confusing at first glance, unless you already deeply understand how pytest marks work.

I don't really have a suggestion for how to make this better, but maybe a comment is in order?

Added a comment on the first such use of this parametrization

shoyer · 2017-11-02T07:12:00Z

xarray/tests/test_distributed.py

+    assert dask.is_dask_collection(y)
+    assert dask.is_dask_collection(y.var1)
+    assert dask.is_dask_collection(y.var2)
+    # assert not dask.is_dask_collection(y.var3)  # TODO: avoid chunking unnecessarily in dataset.py::maybe_chunk


We could probably argue about whether .chunk() should chunk variables that don't use the supplied dimension. But the default behavior is chunk everything when given an empty argument, so I think it's actually correct to do it this way (it certainly makes the return value easier to understand).

Probably a better way to do this would be to construct the dataset by hand from dask arrays, e.g.,

ds = Dataset({'foo': ('x', da.arange(3, chunks=(3,)), 'bar': ('x', np.arange(3))}) assert dask.is_dask_collection(ds) assert dask.is_dask_collection(ds.foo) assert not dask.is_dask_collection(ds.bar)

In a distributed context there is more cost to this behavior than with the threaded scheduler because we communicate the array around the network, rather than do things immediately/locally.

shoyer

Looks good to me, though it still needs the note in "What's new". This is pretty slow-risk in its current state (just adding new methods), so I would be OK including it in v0.10.

shoyer · 2017-11-04T00:49:55Z

xarray/tests/test_distributed.py

+    assert dask.is_dask_collection(z)
+    assert dask.is_dask_collection(z.var1)
+    assert dask.is_dask_collection(z.var2)
+    # assert not dask.is_dask_collection(z.var3)


mrocklin · 2017-11-04T03:50:23Z

Looks good to me, though it still needs the note in "What's new". This is pretty slow-risk in its current state (just adding new methods), so I would be OK including it in v0.10.

So, we're currently labeling the Dask collection interface (what we've implemented here) as experimental and subject to change without a deprecation cycle. I don't foresee much changing, but it might be wise to let people experiment with this in master for a while without putting it in the release. I would not be surprised if we find issues with it after moderate use.

mrocklin · 2017-11-06T18:50:33Z

I just tried things and persisting datasets seems to work well for me in practice.

mrocklin · 2017-11-07T12:37:06Z

Thank you for stepping in @shoyer .

From my perspective this is good to go. However I'm also not in any rush.

shoyer · 2017-11-07T18:33:09Z

Thanks Matt. I was just waiting for CI to pass.

I've indicated that this is experimental in the release notes, so as long as we keep the messaging consistent on the Dask side I think we have room to change this up if needed.

mrocklin · 2017-11-07T18:37:04Z

Great. I'm glad to see this in. Thanks for the help!

add dask interface to variable

f65cb52

mrocklin mentioned this pull request Oct 31, 2017

Update dask interface during XArray integration dask/dask#2847

Merged

mrocklin mentioned this pull request Oct 31, 2017

Formalize contract between XArray and the dask.distributed scheduler pangeo-data/pangeo#5

Closed

mrocklin added 2 commits October 31, 2017 05:29

redirect compute and visualize methods to dask

4b59040

add dask interface to DataArray

68cddff

shoyer reviewed Oct 31, 2017

View reviewed changes

mrocklin added 2 commits October 31, 2017 15:13

add dask interface to Dataset

5429da1

Also test distributed computing

remove visualize method

ffb0ca1

mrocklin force-pushed the dask-interface-2 branch from 91edf8b to da8a8dc Compare November 1, 2017 19:25

support backwards compatibility

56ec487

mrocklin force-pushed the dask-interface-2 branch from da8a8dc to 56ec487 Compare November 1, 2017 19:38

cleanup

f315099

mrocklin changed the title ~~WIP - Support Dask interface~~ Support Dask interface Nov 1, 2017

shoyer reviewed Nov 2, 2017

View reviewed changes

mrocklin added 2 commits November 2, 2017 06:41

style edits

fa968b9

change versions in tests to trigger on dask dev versions

9df0af7

jhamman mentioned this pull request Nov 3, 2017

fix bugs in is_scalar and as_variable for dask arrays #1685

Merged

4 tasks

support dask arrays in DataArray coordinates

d65569b

mrocklin force-pushed the dask-interface-2 branch from 3ea0dc1 to d65569b Compare November 3, 2017 23:48

shoyer reviewed Nov 4, 2017

View reviewed changes

mrocklin added 2 commits November 3, 2017 23:48

remove commented assertion

bbeafec

whats new

ff94d95

shoyer added 3 commits November 6, 2017 17:45

elaborate on what's new

a6cde57

Merge branch 'master' into dask-interface-2

79ea37c

Merge branch 'master' into dask-interface-2

c1b8e1c

shoyer merged commit 10495be into pydata:master Nov 7, 2017

mrocklin deleted the dask-interface-2 branch November 7, 2017 18:37

jhamman mentioned this pull request Nov 14, 2017

Formalize contract between XArray and the dask.distributed scheduler #1644

Closed

jhamman mentioned this pull request Mar 7, 2018

Compute multiple dask backed arrays at once #804

Closed

mmccarty mentioned this pull request Mar 20, 2018

xarray container type intake/intake#54

Closed

keewis mentioned this pull request Apr 21, 2020

Fix distributed tests on upstream-dev #3989

Merged

3 tasks

Support Dask interface #1674

Support Dask interface #1674

Conversation

mrocklin commented Oct 31, 2017 • edited Loading

mrocklin commented Oct 31, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin commented Oct 31, 2017

mrocklin commented Oct 31, 2017

jhamman commented Nov 1, 2017

mrocklin commented Nov 1, 2017

mrocklin commented Nov 1, 2017

mrocklin commented Nov 1, 2017

shoyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhamman Nov 2, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin commented Nov 4, 2017

mrocklin commented Nov 6, 2017

mrocklin commented Nov 7, 2017

shoyer commented Nov 7, 2017

mrocklin commented Nov 7, 2017

mrocklin commented Oct 31, 2017 •

edited

Loading

jhamman Nov 2, 2017 •

edited

Loading