Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disabled auto-caching on dask; new .compute() method #1018

Closed
wants to merge 6 commits into from

Conversation

crusaderky
Copy link
Contributor

Disabled auto-caching dask arrays when pickling and when invoking the .values property.
Added new method .compute() to Variable, DataArray and Dataset.

… .values property.

Added new method .compute().
Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some minor structural suggestions, but generally this looks very nice.

Take a look at the tests in test_dask.py and see if you can remove the hacky _copy_at_variable_level function now that computing values no longer mutates.

"""
# Can't just do the below, because new.variables[k] is the
# same object as self.variables[k]!
# new = self.copy(deep=False)
Copy link
Member

@shoyer shoyer Sep 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's fix Dataset.copy() at the source instead so this works, e.g.,

    def copy(self, deep=False):
        """Returns a copy of this dataset.

        If `deep=True`, a deep copy is made of each of the componentarray.
        Otherwise, a shallow copy is made, so each array in the new dataset
        is also a array in the original dataset.
        """
        variables = OrderedDict((k, v.copy(deep=deep))
                                for k, v in iteritems(self._variables))
        # skip __init__ to avoid costly validation
        return self._construct_direct(variables, self._coord_names.copy(),
                                      self._dims.copy(), self._attrs_copy())

I no longer have any idea why I thought it made sense to not copy variables in Dataset.copy(). Copying variables is cheap, and exposing details about the Variable object in the public API makes the Dataset abstraction leaky.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was actually my first try, but hell broke loose in the regression tests when I did. I'll try again later...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's done. However please see the caveat I wrote in the test script.

Further explanation to what I mean:

import numpy as np
from xarray import Variable, IndexVariable
from xarray.test import source_ndarray

def test_copy(data, cls):
    data = np.array(data)
    a = cls('x', data)
    print source_ndarray(data) is source_ndarray(a.data)
    b = a.copy(deep=False)
    print source_ndarray(a.data) is source_ndarray(b.data)

test_copy([1,2,3], Variable)
test_copy(['a', 'b', 'c'], Variable)
test_copy([1,2,3], IndexVariable)
test_copy(['a', 'b', 'c'], IndexVariable)

Output:
True
True
True
True
True
True
False
False

@@ -306,7 +309,7 @@ def load(self):
"""
# access .data to coerce everything to numpy or dask arrays
all_data = dict((k, v.data) for k, v in self.variables.items())
lazy_data = dict((k, v) for k, v in all_data.items()
lazy_data = OrderedDict((k, v) for k, v in all_data.items()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this matters. Variables are already assigned in the proper order in self.variables, and we never return this object to the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on line 318, you're iterating on the values of lazy_data to generate the evaluated_data list, and then you're recombining the two by iterating on the keys of lazy_data on line 320.
Typically, if you iterate on a dict twice in a row you should get the same order, but there's nothing guaranteeing that. So unless you use a container with order you're effectively relying on an implementation detail of CPython.

Copy link
Member

@shoyer shoyer Sep 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterating over different dicts is not guaranteed to happen in the same order, but it is guaranteed for the same dict without any modifications. See:

"If items(), keys(), values(), iteritems(), iterkeys(), and itervalues() are called with no intervening modifications to the dictionary, the lists will directly correspond."
https://docs.python.org/2/library/stdtypes.html#dict.items

"Keys and values are iterated over in an arbitrary order which is non-random, varies across Python implementations, and depends on the dictionary’s history of insertions and deletions. If keys, values and items views are iterated over with no intervening modifications to the dictionary, the order of items will directly correspond."
https://docs.python.org/3.6/library/stdtypes.html#dict-views

if isinstance(self._data, (np.ndarray, PandasIndexAdapter)):
new_data = self._data
else:
new_data = np.asarray(self._data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make a separate method for the logic above? e.g.,

def _data_cast(self):
    if isinstance(self._data, (np.ndarray, PandasIndexAdapter)):
        return self._data
    else:
        return np.asarray(self._data)

Then both Variable and IndexVariable can share the caching logic in _data_cached and only reimplement _data_cast.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixing it

@@ -992,6 +1020,38 @@ def test_deterministic_names(self):
self.assertIn(tmp, dask_name)
self.assertEqual(original_names, repeat_names)

def test_dataarray_compute(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These methods would belong better in test_dask.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should move test_dataarray_compute() to a different module as there's test_dataset_compute() that applies to all backends.
Moving the pickle and values tests.

computed = actual.compute()

# indexes are going to be loaded in memory anyway, even
# with compute(). This is by design.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm missing something here -- why does this happen? I don't see any data_vars specific logic above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, that was a comment from when I was trying to have IndexVariable do the caching and Variable do not. But then I realised that in some cases the data of a dask-backed array is actually an IndexVariable. Removing the comment.

- It is now possible to set ``concat_dim=None`` explicitly in
:py:func:`~xarray.open_mfdataset` to disable inferring a dimension along
which to concatenate.
By `Stephan Hoyer <https://github.com/shoyer>`_.
- Added :py:meth:`compute` method to :py:class:`DataArray`, :py:class:`Dataset`,
and :py:class:`Variable` as a non-destructive alternative to :py:meth:`load`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call it "non-mutating" rather than "non-destructive"

Copy link
Member

@shoyer shoyer Sep 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: I don't think this works:

:py:meth:`compute`

I think it needs the Dataset/DataArray qualification to be a correct link, e.g,.

:py:meth:`~DataArray.compute`

~ hides the preface DataArray in the generated HTML.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixing it

@@ -25,6 +25,10 @@ Breaking changes
merges will now succeed in cases that previously raised
``xarray.MergeError``. Set ``compat='broadcast_equals'`` to restore the
previous default.
- Pickling an xarray object based on the dask backend, or reading its
:py:meth:`values` property, won't automatically convert the backend to numpy
Copy link
Member

@shoyer shoyer Sep 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of "convert the backend to numpy" let's say "convert the array from dask to numpy"

Somewhat confusingly, we use the name backends in xarray to refer to IO methods (e.g., netCDF libraries), not dask vs numpy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixing it

@crusaderky
Copy link
Contributor Author

All comments answered, I think I'm done now

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I have a few minor comments.

Did you see this from last time?

Take a look at the tests in test_dask.py and see if you can remove the hacky _copy_at_variable_level function now that computing values no longer mutates.

@@ -306,7 +309,7 @@ def load(self):
"""
# access .data to coerce everything to numpy or dask arrays
all_data = dict((k, v.data) for k, v in self.variables.items())
lazy_data = dict((k, v) for k, v in all_data.items()
lazy_data = OrderedDict((k, v) for k, v in all_data.items()
Copy link
Member

@shoyer shoyer Sep 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterating over different dicts is not guaranteed to happen in the same order, but it is guaranteed for the same dict without any modifications. See:

"If items(), keys(), values(), iteritems(), iterkeys(), and itervalues() are called with no intervening modifications to the dictionary, the lists will directly correspond."
https://docs.python.org/2/library/stdtypes.html#dict.items

"Keys and values are iterated over in an arbitrary order which is non-random, varies across Python implementations, and depends on the dictionary’s history of insertions and deletions. If keys, values and items views are iterated over with no intervening modifications to the dictionary, the order of items will directly correspond."
https://docs.python.org/3.6/library/stdtypes.html#dict-views

@@ -404,11 +420,8 @@ def copy(self, deep=False):
Otherwise, a shallow copy is made, so each variable in the new dataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update this to say it makes a shallow copy of each of the component variables?

@@ -288,11 +299,26 @@ def load(self):
because all xarray functions should either work on deferred data or
load data automatically.
"""
self._data_cached()
new_data = self._data_cached()
Copy link
Member

@shoyer shoyer Sep 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just assign unambiguously here, e.g.,

self._data = self._data_cast()

It does the same thing just a little simpler (no need to account for dask separately).

@shoyer shoyer mentioned this pull request Sep 30, 2016
@crusaderky
Copy link
Contributor Author

All done

expected_copy = _copy_at_variable_level(expected)
actual_copy = _copy_at_variable_level(actual)
expected.name = None
actual.name = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there a failing tests that required setting this? I would rather disable this on a case by case basis than across the board (not all objects passed to this method even have a name attribute).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, on several cases the test was failing because on one side the object automatically got its name from the underlying dask.name, while on the other side it was blank.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you simply make a assertLazyAndEqual for those cases or set name='foo'? should be pretty quick

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed and resolved conflict

@crusaderky crusaderky closed this Oct 1, 2016
@crusaderky crusaderky deleted the no_dask_resolve branch October 1, 2016 11:02
@crusaderky crusaderky restored the no_dask_resolve branch October 1, 2016 11:07
@crusaderky
Copy link
Contributor Author

Sorry, still fighting against git #1024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants