-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disabled auto-caching on dask; new .compute() method #1018
Conversation
… .values property. Added new method .compute().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some minor structural suggestions, but generally this looks very nice.
Take a look at the tests in test_dask.py
and see if you can remove the hacky _copy_at_variable_level
function now that computing values no longer mutates.
""" | ||
# Can't just do the below, because new.variables[k] is the | ||
# same object as self.variables[k]! | ||
# new = self.copy(deep=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's fix Dataset.copy()
at the source instead so this works, e.g.,
def copy(self, deep=False):
"""Returns a copy of this dataset.
If `deep=True`, a deep copy is made of each of the componentarray.
Otherwise, a shallow copy is made, so each array in the new dataset
is also a array in the original dataset.
"""
variables = OrderedDict((k, v.copy(deep=deep))
for k, v in iteritems(self._variables))
# skip __init__ to avoid costly validation
return self._construct_direct(variables, self._coord_names.copy(),
self._dims.copy(), self._attrs_copy())
I no longer have any idea why I thought it made sense to not copy variables in Dataset.copy()
. Copying variables is cheap, and exposing details about the Variable
object in the public API makes the Dataset
abstraction leaky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was actually my first try, but hell broke loose in the regression tests when I did. I'll try again later...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's done. However please see the caveat I wrote in the test script.
Further explanation to what I mean:
import numpy as np
from xarray import Variable, IndexVariable
from xarray.test import source_ndarray
def test_copy(data, cls):
data = np.array(data)
a = cls('x', data)
print source_ndarray(data) is source_ndarray(a.data)
b = a.copy(deep=False)
print source_ndarray(a.data) is source_ndarray(b.data)
test_copy([1,2,3], Variable)
test_copy(['a', 'b', 'c'], Variable)
test_copy([1,2,3], IndexVariable)
test_copy(['a', 'b', 'c'], IndexVariable)
Output:
True
True
True
True
True
True
False
False
@@ -306,7 +309,7 @@ def load(self): | |||
""" | |||
# access .data to coerce everything to numpy or dask arrays | |||
all_data = dict((k, v.data) for k, v in self.variables.items()) | |||
lazy_data = dict((k, v) for k, v in all_data.items() | |||
lazy_data = OrderedDict((k, v) for k, v in all_data.items() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this matters. Variables are already assigned in the proper order in self.variables
, and we never return this object to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on line 318, you're iterating on the values of lazy_data to generate the evaluated_data list, and then you're recombining the two by iterating on the keys of lazy_data on line 320.
Typically, if you iterate on a dict twice in a row you should get the same order, but there's nothing guaranteeing that. So unless you use a container with order you're effectively relying on an implementation detail of CPython.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iterating over different dicts is not guaranteed to happen in the same order, but it is guaranteed for the same dict without any modifications. See:
"If items(), keys(), values(), iteritems(), iterkeys(), and itervalues() are called with no intervening modifications to the dictionary, the lists will directly correspond."
https://docs.python.org/2/library/stdtypes.html#dict.items
"Keys and values are iterated over in an arbitrary order which is non-random, varies across Python implementations, and depends on the dictionary’s history of insertions and deletions. If keys, values and items views are iterated over with no intervening modifications to the dictionary, the order of items will directly correspond."
https://docs.python.org/3.6/library/stdtypes.html#dict-views
if isinstance(self._data, (np.ndarray, PandasIndexAdapter)): | ||
new_data = self._data | ||
else: | ||
new_data = np.asarray(self._data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make a separate method for the logic above? e.g.,
def _data_cast(self):
if isinstance(self._data, (np.ndarray, PandasIndexAdapter)):
return self._data
else:
return np.asarray(self._data)
Then both Variable
and IndexVariable
can share the caching logic in _data_cached
and only reimplement _data_cast
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixing it
@@ -992,6 +1020,38 @@ def test_deterministic_names(self): | |||
self.assertIn(tmp, dask_name) | |||
self.assertEqual(original_names, repeat_names) | |||
|
|||
def test_dataarray_compute(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These methods would belong better in test_dask.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should move test_dataarray_compute() to a different module as there's test_dataset_compute() that applies to all backends.
Moving the pickle and values tests.
computed = actual.compute() | ||
|
||
# indexes are going to be loaded in memory anyway, even | ||
# with compute(). This is by design. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm missing something here -- why does this happen? I don't see any data_vars
specific logic above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, that was a comment from when I was trying to have IndexVariable do the caching and Variable do not. But then I realised that in some cases the data of a dask-backed array is actually an IndexVariable. Removing the comment.
- It is now possible to set ``concat_dim=None`` explicitly in | ||
:py:func:`~xarray.open_mfdataset` to disable inferring a dimension along | ||
which to concatenate. | ||
By `Stephan Hoyer <https://github.com/shoyer>`_. | ||
- Added :py:meth:`compute` method to :py:class:`DataArray`, :py:class:`Dataset`, | ||
and :py:class:`Variable` as a non-destructive alternative to :py:meth:`load`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would call it "non-mutating" rather than "non-destructive"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: I don't think this works:
:py:meth:`compute`
I think it needs the Dataset
/DataArray
qualification to be a correct link, e.g,.
:py:meth:`~DataArray.compute`
~
hides the preface DataArray
in the generated HTML.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixing it
@@ -25,6 +25,10 @@ Breaking changes | |||
merges will now succeed in cases that previously raised | |||
``xarray.MergeError``. Set ``compat='broadcast_equals'`` to restore the | |||
previous default. | |||
- Pickling an xarray object based on the dask backend, or reading its | |||
:py:meth:`values` property, won't automatically convert the backend to numpy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of "convert the backend to numpy" let's say "convert the array from dask to numpy"
Somewhat confusingly, we use the name backends
in xarray to refer to IO methods (e.g., netCDF libraries), not dask vs numpy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixing it
All comments answered, I think I'm done now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I have a few minor comments.
Did you see this from last time?
Take a look at the tests in test_dask.py and see if you can remove the hacky _copy_at_variable_level function now that computing values no longer mutates.
@@ -306,7 +309,7 @@ def load(self): | |||
""" | |||
# access .data to coerce everything to numpy or dask arrays | |||
all_data = dict((k, v.data) for k, v in self.variables.items()) | |||
lazy_data = dict((k, v) for k, v in all_data.items() | |||
lazy_data = OrderedDict((k, v) for k, v in all_data.items() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iterating over different dicts is not guaranteed to happen in the same order, but it is guaranteed for the same dict without any modifications. See:
"If items(), keys(), values(), iteritems(), iterkeys(), and itervalues() are called with no intervening modifications to the dictionary, the lists will directly correspond."
https://docs.python.org/2/library/stdtypes.html#dict.items
"Keys and values are iterated over in an arbitrary order which is non-random, varies across Python implementations, and depends on the dictionary’s history of insertions and deletions. If keys, values and items views are iterated over with no intervening modifications to the dictionary, the order of items will directly correspond."
https://docs.python.org/3.6/library/stdtypes.html#dict-views
@@ -404,11 +420,8 @@ def copy(self, deep=False): | |||
Otherwise, a shallow copy is made, so each variable in the new dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you update this to say it makes a shallow copy of each of the component variables?
@@ -288,11 +299,26 @@ def load(self): | |||
because all xarray functions should either work on deferred data or | |||
load data automatically. | |||
""" | |||
self._data_cached() | |||
new_data = self._data_cached() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just assign unambiguously here, e.g.,
self._data = self._data_cast()
It does the same thing just a little simpler (no need to account for dask separately).
All done |
expected_copy = _copy_at_variable_level(expected) | ||
actual_copy = _copy_at_variable_level(actual) | ||
expected.name = None | ||
actual.name = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was there a failing tests that required setting this? I would rather disable this on a case by case basis than across the board (not all objects passed to this method even have a name
attribute).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, on several cases the test was failing because on one side the object automatically got its name from the underlying dask.name, while on the other side it was blank.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you simply make a assertLazyAndEqual
for those cases or set name='foo'
? should be pretty quick
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed and resolved conflict
# Conflicts: # xarray/test/test_dask.py
Sorry, still fighting against git #1024 |
Disabled auto-caching dask arrays when pickling and when invoking the .values property.
Added new method .compute() to Variable, DataArray and Dataset.