-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove caching logic from xarray.Variable #1128
Conversation
This is a follow-up to generalize the changes from pydata#1024: - Caching and copy-on-write behavior has been moved to separate array classes that are explicitly used in `open_dataset` to wrap arrays loaded from disk (if `cache=True`). - Dask specific logic has been removed from the caching/loading logic on `xarray.Variable`. - Pickle no longer caches automatically under any circumstances. Still needs tests for the `cache` argument to `open_dataset`, but everything else seems to be working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the case mentioned below this seems to preserve lazy data in the few cases I tried.
In the long run I think it would be more robust to check for attributes (duck type style) rather than types in the various places.
One option could be to equip all array wrappers with a method like to_ndarray
(which would be a no-op for lazy wrappers) which would be called instead of explicitly casting to np.ndarray
. That would also open the door to using lazy arrays other than dask.
Does that sound feasible or am I missing something here?
Regardless, that shouldn't prevent this pr from going in.
for name, variable in dataset.variables.items(): | ||
if name not in variable.dims: | ||
# no need to protect IndexVariable objects | ||
data = indexing.CopyOnWriteArray(variable._data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran a test where I create a DataSet
from a custom data store which initializes Variables
using dask arrays for data
. In this case the dask arrays is still converted to an ndarray when accessing the Variable
's data
property, since it checks is for a dask array type, however here the array is wrapped into a CopyOnWriteArray
, which means Variable.values
is called, which loads eagerly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, you would need to use cache=False
in such a case.
Xarray's decoding logic in conventions.py
uses it's own array objects instead of dask arrays (for reasons I could get into), which unfortunately makes using dask.array objects to produce variables in a custom data store non-ideal. The problem is that the graphs from such dask arrays don't get linked up into xarray, which means that even if you rechunk the arrays in the xarray Dataset, they still get executed separately by dask. Duck typing for dask objects would probably help here (dask/dask#1068) .
if not isinstance(v.data, dask_array_type): | ||
v.load() | ||
"""Get this object's state for pickling""" | ||
# we need a custom method to avoid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete
Indeed, in particular I'm not very happy with the It exists so that As for type checking for dask arrays in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Stephen, I think that NoPickleMixin is conceptually a bad idea - see my comments to the code
# self.__dict__ is the default pickle object, we don't need to | ||
# implement our own __setstate__ method to make pickle work | ||
return self.__dict__ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't understand the purpose of this? The comment where you explain it is truncated.
# throw away any references to datastores in the pickle | ||
state['_file_obj'] = None | ||
return state | ||
return self.__dict__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't you remove this method entirely?
'cannot pickle objects of type %r: call .compute() or .load() ' | ||
'to load data into memory first.' % type(self)) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really not a fan of this design.
A better approach to pickling file-based objects is to save the file path on getstate and open a new file descriptor on setstate. Relative vs. absolute paths should be preserved.
Of course there's no guarantee that the file will be there when you unpickle, or that it will be identical to the one that you pickled - but this is a caveat that's easily explained in the documentation.
The benefits of this are huge when you work interactively in Jupyter with plugins that pickle all session variables at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually hurts my xarray-based financial simulation project at my company. I start with a DataArray loaded from disk through dask, and then heavily manipulate it. I end up with ~3000 dask-based xarrays, each with up to 80,000 key:value pairs. At that point, since just generating the final objects took 5-10 minutes (no actual calculations yet), I pickle them, save them to disk and send them over the network.
Insofar I've had to load bespoke data from disk (numpy.fromfile manually wrapped with dask). In the future, however, I'd love to move to NetCDF for my on-disk storage... in other words I would have a huge dask tree with, at the very bottom of it, a dataset created by open_dataset(). And I would need to pickle the whole thing, with the obvious caveat that all your NetCDF files have to be in the same place, or pickle.load with fail with FileNotFoundError.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- adding pickle support to xarray's data stores would be better (it would also enable dask distributed). I add this only because it's better to give a sensible error message than for it to be silently broken, which is the existing state of affairs. For example, until Unidata/netcdf4-python#604 netCDF4 object wouldn't even error when pickled, even though they cannot be restored.
I will take a look at enabling pickle, but I think it may be tricky to get right (lots of related discussion in #798).
Can you clarify how this hurts your simulation project compared to the present state of affairs? I thought it would be possible to restore the current behavior by simply calling .compute()
or .load()
prior to pickle.dump
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I wasn't too clear - this change isn't a regression as much as my project is concerned - it just keeps the netcdf backend unusable. I agree that a proper pickling support for file-based stores can be added later on.
I removed the custom pickle override on |
This isn't yet working with dask multiprocessing for reading a netCDF4 file with in-memory compression. I'm not quite sure why:
|
with close_on_error(ds): | ||
self.ds = _nc4_group(ds, group, mode) | ||
opener = functools.partial(_open_h5netcdf_group, filename, mode=mode, | ||
group=group) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth noting that this is only cloud-picklable, not stdlib pickleable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify why this won't work with stdlib pickle? Is the issue doing the h5netcdf
import inside the function definition? My understand was that functools.partial
is pickleable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're right. I'm surprised:
In [1]: from operator import add
In [2]: from functools import partial
In [3]: from pickle import dumps, loads
In [4]: loads(dumps(partial(add, 1)))
Out[4]: functools.partial(<built-in function add>, 1)
In [5]: loads(dumps(partial(add, 1)))(2)
Out[5]: 3
engine=engine) | ||
assert isinstance(restored.var1.data, da.Array) | ||
computed = restored.compute() | ||
assert_dataset_allclose(original, computed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test looks good to me.
Does your failure work with the following spawning pool in Python 3? In [1]: import multiprocessing
In [2]: ctx = multiprocessing.get_context('spawn')
In [3]: ctx.Pool(4)
Out[3]: <multiprocessing.pool.Pool at 0x7fec70afca20> |
Why, yes it does -- and it shows a nice speedup, as well! What was I missing here? |
Spawn is only available in Python 3, so it's not a full solution. Something isn't fork-safe, possibly something within the HDF5 library? You might also want to try |
roundtripped.close() | ||
unpickled_ds = pickle.loads(raw_pickle) | ||
self.assertDatasetIdentical(expected, unpickled_ds) | ||
unpickled_ds.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is only failing on Windows in NetCDF4ViaDaskDataTest
now.
I'm not quite sure how we are missing clean up there. In that case, the file/DataStore object is ending up in the dask graph of the arrays in roundtripped
. We do close the file with .close()
, but maybe those references to closed files hanging around are causing trouble?
OK, I'm ready to give up on the remaining test failures and merge this anyways (marking them as expected failures). They are specific to our test suite and for Windows only, due to the inability to delete files that are not closed. If these manifest themselves as issues for real users, I am happy to revisit, especially if someone who uses Windows can help debug. The 5 minute feedback cycle of pushing a commit and then seeing what Appveyor says is too painful. |
@mrocklin OK, so one option is to just ignore the permission errors and not remove the files on Windows. But is it really better to make the test suite leak temp files? |
I agree that it's not great. This was more a show of solidarity that we've also run into this same issue and had to resort to similar hacks. |
I decided that between the choices of not running these tests on Windows and leaking a few temp files, I would rather leak some temp files. So that's exactly what I've done in the latest commit, for explicitly whitelisted tests. |
@kynan @crusaderky Do you have concerns about merging this in the current state? |
All looks good, go on
…On 30 Nov 2016 16:50, "Stephan Hoyer" ***@***.***> wrote:
@kynan <https://github.com/kynan> @crusaderky
<https://github.com/crusaderky> Do you have concerns about merging this
in the current state?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1128 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AF7OMPt-arhYGraorYzYGwxMUNuLe_9zks5rDalpgaJpZM4K0Vld>
.
|
No objections, go ahead! |
OK, in it goes! |
I'm trying out the latest code to subset a set of netcdf4 files with dask.multiprocessing using |
@mangecoeur You still need to use The former should be updated internally, and the later should be a documentation note. |
@shoyer thanks, with a little testing it seems |
This is a follow-up to generalize the changes from #1024:
that are explicitly used in
open_dataset
to wrap arrays loaded from disk (ifcache=True
).xarray.Variable
.Still needs tests for thecache
argument toopen_dataset
, but everythingelse seems to be working.
@crusaderky @kynan please review