Disabled auto-caching on dask; new .compute() method #1018

crusaderky · 2016-09-25T18:14:51Z

Disabled auto-caching dask arrays when pickling and when invoking the .values property.
Added new method .compute() to Variable, DataArray and Dataset.

… .values property. Added new method .compute().

shoyer

I have some minor structural suggestions, but generally this looks very nice.

Take a look at the tests in test_dask.py and see if you can remove the hacky _copy_at_variable_level function now that computing values no longer mutates.

shoyer · 2016-09-26T04:00:17Z

xarray/core/dataset.py

+        """
+        # Can't just do the below, because new.variables[k] is the
+        # same object as self.variables[k]!
+        # new = self.copy(deep=False)


Let's fix Dataset.copy() at the source instead so this works, e.g.,

def copy(self, deep=False): """Returns a copy of this dataset. If `deep=True`, a deep copy is made of each of the componentarray. Otherwise, a shallow copy is made, so each array in the new dataset is also a array in the original dataset. """ variables = OrderedDict((k, v.copy(deep=deep)) for k, v in iteritems(self._variables)) # skip __init__ to avoid costly validation return self._construct_direct(variables, self._coord_names.copy(), self._dims.copy(), self._attrs_copy())

I no longer have any idea why I thought it made sense to not copy variables in Dataset.copy(). Copying variables is cheap, and exposing details about the Variable object in the public API makes the Dataset abstraction leaky.

That was actually my first try, but hell broke loose in the regression tests when I did. I'll try again later...

It's done. However please see the caveat I wrote in the test script.

Further explanation to what I mean:

import numpy as np from xarray import Variable, IndexVariable from xarray.test import source_ndarray def test_copy(data, cls): data = np.array(data) a = cls('x', data) print source_ndarray(data) is source_ndarray(a.data) b = a.copy(deep=False) print source_ndarray(a.data) is source_ndarray(b.data) test_copy([1,2,3], Variable) test_copy(['a', 'b', 'c'], Variable) test_copy([1,2,3], IndexVariable) test_copy(['a', 'b', 'c'], IndexVariable)

Output:
True
True
True
True
True
True
False
False

shoyer · 2016-09-26T04:02:09Z

xarray/core/dataset.py

@@ -306,7 +309,7 @@ def load(self):
        """
        # access .data to coerce everything to numpy or dask arrays
        all_data = dict((k, v.data) for k, v in self.variables.items())
-        lazy_data = dict((k, v) for k, v in all_data.items()
+        lazy_data = OrderedDict((k, v) for k, v in all_data.items()


I don't think this matters. Variables are already assigned in the proper order in self.variables, and we never return this object to the user.

on line 318, you're iterating on the values of lazy_data to generate the evaluated_data list, and then you're recombining the two by iterating on the keys of lazy_data on line 320.
Typically, if you iterate on a dict twice in a row you should get the same order, but there's nothing guaranteeing that. So unless you use a container with order you're effectively relying on an implementation detail of CPython.

Iterating over different dicts is not guaranteed to happen in the same order, but it is guaranteed for the same dict without any modifications. See:

"If items(), keys(), values(), iteritems(), iterkeys(), and itervalues() are called with no intervening modifications to the dictionary, the lists will directly correspond."
https://docs.python.org/2/library/stdtypes.html#dict.items

"Keys and values are iterated over in an arbitrary order which is non-random, varies across Python implementations, and depends on the dictionary’s history of insertions and deletions. If keys, values and items views are iterated over with no intervening modifications to the dictionary, the order of items will directly correspond."
https://docs.python.org/3.6/library/stdtypes.html#dict-views

shoyer · 2016-09-26T04:07:51Z

xarray/core/variable.py

+        if isinstance(self._data, (np.ndarray, PandasIndexAdapter)):
+            new_data = self._data
+        else:
+            new_data = np.asarray(self._data)


Can we make a separate method for the logic above? e.g.,

def _data_cast(self): if isinstance(self._data, (np.ndarray, PandasIndexAdapter)): return self._data else: return np.asarray(self._data)

Then both Variable and IndexVariable can share the caching logic in _data_cached and only reimplement _data_cast.

shoyer · 2016-09-26T04:10:37Z

xarray/test/test_backends.py

@@ -992,6 +1020,38 @@ def test_deterministic_names(self):
                self.assertIn(tmp, dask_name)
            self.assertEqual(original_names, repeat_names)

+    def test_dataarray_compute(self):


These methods would belong better in test_dask.py

I don't think we should move test_dataarray_compute() to a different module as there's test_dataset_compute() that applies to all backends.
Moving the pickle and values tests.

shoyer · 2016-09-26T04:12:41Z

xarray/test/test_backends.py

+            computed = actual.compute()
+
+            # indexes are going to be loaded in memory anyway, even
+            # with compute(). This is by design.


I think I'm missing something here -- why does this happen? I don't see any data_vars specific logic above.

sorry, that was a comment from when I was trying to have IndexVariable do the caching and Variable do not. But then I realised that in some cases the data of a dask-backed array is actually an IndexVariable. Removing the comment.

shoyer · 2016-09-26T04:13:56Z

doc/whats-new.rst

 - It is now possible to set ``concat_dim=None`` explicitly in
  :py:func:`~xarray.open_mfdataset` to disable inferring a dimension along
  which to concatenate.
  By `Stephan Hoyer <https://github.com/shoyer>`_.
+- Added :py:meth:`compute` method to :py:class:`DataArray`, :py:class:`Dataset`,
+  and :py:class:`Variable` as a non-destructive alternative to :py:meth:`load`.


I would call it "non-mutating" rather than "non-destructive"

Also: I don't think this works:

:py:meth:`compute`

I think it needs the Dataset/DataArray qualification to be a correct link, e.g,.

:py:meth:`~DataArray.compute`

~ hides the preface DataArray in the generated HTML.

shoyer · 2016-09-26T04:18:13Z

doc/whats-new.rst

@@ -25,6 +25,10 @@ Breaking changes
  merges will now succeed in cases that previously raised
  ``xarray.MergeError``. Set ``compat='broadcast_equals'`` to restore the
  previous default.
+- Pickling an xarray object based on the dask backend, or reading its
+  :py:meth:`values` property, won't automatically convert the backend to numpy


Instead of "convert the backend to numpy" let's say "convert the array from dask to numpy"

Somewhat confusingly, we use the name backends in xarray to refer to IO methods (e.g., netCDF libraries), not dask vs numpy.

crusaderky · 2016-09-30T12:26:39Z

All comments answered, I think I'm done now

shoyer

Looks good, I have a few minor comments.

Did you see this from last time?

Take a look at the tests in test_dask.py and see if you can remove the hacky _copy_at_variable_level function now that computing values no longer mutates.

shoyer · 2016-09-30T15:18:12Z

xarray/core/dataset.py

@@ -306,7 +309,7 @@ def load(self):
        """
        # access .data to coerce everything to numpy or dask arrays
        all_data = dict((k, v.data) for k, v in self.variables.items())
-        lazy_data = dict((k, v) for k, v in all_data.items()
+        lazy_data = OrderedDict((k, v) for k, v in all_data.items()


Iterating over different dicts is not guaranteed to happen in the same order, but it is guaranteed for the same dict without any modifications. See:

"If items(), keys(), values(), iteritems(), iterkeys(), and itervalues() are called with no intervening modifications to the dictionary, the lists will directly correspond."
https://docs.python.org/2/library/stdtypes.html#dict.items

"Keys and values are iterated over in an arbitrary order which is non-random, varies across Python implementations, and depends on the dictionary’s history of insertions and deletions. If keys, values and items views are iterated over with no intervening modifications to the dictionary, the order of items will directly correspond."
https://docs.python.org/3.6/library/stdtypes.html#dict-views

shoyer · 2016-09-30T15:19:32Z

xarray/core/dataset.py

@@ -404,11 +420,8 @@ def copy(self, deep=False):
        Otherwise, a shallow copy is made, so each variable in the new dataset


can you update this to say it makes a shallow copy of each of the component variables?

shoyer · 2016-09-30T15:21:20Z

xarray/core/variable.py

@@ -288,11 +299,26 @@ def load(self):
        because all xarray functions should either work on deferred data or
        load data automatically.
        """
-        self._data_cached()
+        new_data = self._data_cached()


I would just assign unambiguously here, e.g.,

self._data = self._data_cast()

It does the same thing just a little simpler (no need to account for dask separately).

crusaderky · 2016-09-30T17:24:53Z

All done

shoyer · 2016-09-30T18:01:36Z

xarray/test/test_dask.py

-        expected_copy = _copy_at_variable_level(expected)
-        actual_copy = _copy_at_variable_level(actual)
+        expected.name = None
+        actual.name = None


Was there a failing tests that required setting this? I would rather disable this on a case by case basis than across the board (not all objects passed to this method even have a name attribute).

Yes, on several cases the test was failing because on one side the object automatically got its name from the underlying dask.name, while on the other side it was blank.

can you simply make a assertLazyAndEqual for those cases or set name='foo'? should be pretty quick

fixed and resolved conflict

# Conflicts: # xarray/test/test_dask.py

crusaderky · 2016-10-01T11:08:48Z

Sorry, still fighting against git #1024

Disabled auto-caching dask arrays when pickling and when invoking the…

f3d74e8

… .values property. Added new method .compute().

crusaderky mentioned this pull request Sep 25, 2016

Pickle and .value vs. dask backend #902

Closed

shoyer reviewed Sep 26, 2016

View reviewed changes

gimperiale added 2 commits September 30, 2016 12:33

Minor tweaks

26a6997

Simplified Dataset.copy() and Dataset.compute()

03fbdd1

shoyer reviewed Sep 30, 2016

View reviewed changes

shoyer mentioned this pull request Sep 30, 2016

Dask broadcast #1022

Closed

Minor cleanup

b04167c

shoyer reviewed Sep 30, 2016

View reviewed changes

gimperiale added 2 commits October 1, 2016 11:45

Cleaned up dask test

ca94cc7

Merge branch 'master' into no_dask_resolve

91b8084

# Conflicts: # xarray/test/test_dask.py

crusaderky closed this Oct 1, 2016

crusaderky deleted the no_dask_resolve branch October 1, 2016 11:02

crusaderky restored the no_dask_resolve branch October 1, 2016 11:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disabled auto-caching on dask; new .compute() method #1018

Disabled auto-caching on dask; new .compute() method #1018

crusaderky commented Sep 25, 2016

shoyer left a comment

shoyer Sep 26, 2016 •

edited

Loading

crusaderky Sep 30, 2016

crusaderky Sep 30, 2016

shoyer Sep 26, 2016

crusaderky Sep 30, 2016

shoyer Sep 30, 2016 •

edited

Loading

shoyer Sep 26, 2016

crusaderky Sep 30, 2016

shoyer Sep 26, 2016

crusaderky Sep 30, 2016

shoyer Sep 26, 2016

crusaderky Sep 30, 2016

shoyer Sep 26, 2016

shoyer Sep 26, 2016 •

edited

Loading

crusaderky Sep 30, 2016

shoyer Sep 26, 2016 •

edited

Loading

crusaderky Sep 30, 2016

crusaderky commented Sep 30, 2016

shoyer left a comment

shoyer Sep 30, 2016 •

edited

Loading

shoyer Sep 30, 2016

shoyer Sep 30, 2016 •

edited

Loading

crusaderky commented Sep 30, 2016

shoyer Sep 30, 2016

crusaderky Sep 30, 2016

shoyer Sep 30, 2016

crusaderky Oct 1, 2016

crusaderky commented Oct 1, 2016

		@@ -404,11 +420,8 @@ def copy(self, deep=False):
		Otherwise, a shallow copy is made, so each variable in the new dataset

Disabled auto-caching on dask; new .compute() method #1018

Disabled auto-caching on dask; new .compute() method #1018

Conversation

crusaderky commented Sep 25, 2016

shoyer left a comment

Choose a reason for hiding this comment

shoyer Sep 26, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer Sep 30, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer Sep 26, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer Sep 26, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky commented Sep 30, 2016

shoyer left a comment

Choose a reason for hiding this comment

shoyer Sep 30, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer Sep 30, 2016 • edited Loading

Choose a reason for hiding this comment

crusaderky commented Sep 30, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky commented Oct 1, 2016

shoyer Sep 26, 2016 •

edited

Loading

shoyer Sep 30, 2016 •

edited

Loading

shoyer Sep 26, 2016 •

edited

Loading

shoyer Sep 26, 2016 •

edited

Loading

shoyer Sep 30, 2016 •

edited

Loading

shoyer Sep 30, 2016 •

edited

Loading