-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask Collection Interface #2748
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't started reviewing code yet, but I had a few small comments on the documentation (which is great by the way).
docs/source/custom-collections.rst
Outdated
>>> from dask.threaded import get as threaded_get | ||
>>> class MyCollection(object): | ||
... # Use the threaded scheduler by default | ||
... __dask_default_get__ = staticmethod(threaded_get) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dask.threaded.get
docs/source/custom-collections.rst
Outdated
|
||
.. method:: __dask_postcompute__(self) | ||
|
||
Finalizer and (optional) extra arguments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unable to understand the intent of this function from this description
|
||
.. note:: It's also recommended to define ``__dask_tokenize__``, | ||
see :ref:`deterministic-hashing`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be helpful to see all of these methods laid out in a mock compute
function, just to get a context of how they are likely to be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strong 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may eventually choose to rename this function from postpersist
to something like collection_from_graph
. I can imagine it being used in other contexts. I'm not suggesting any action now. Just rambling.
Note that with the new design for |
I'll be interested to hear from @shoyer on his take on the benefits and effort required to factor this into xarray. It does seem like a much more streamlined interface than what we have now. |
This would give us things like |
result = [(name,) + args + (i,) for i in range(numblocks[ind])] | ||
else: | ||
result = [keys(*(args + (i,))) for i in range(numblocks[ind])] | ||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the nested function here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want a deprecation cycle around _keys
. I would not be surprised if this has leaked out into semi-public API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, from the way it's documented it's not entirely clear: http://dask.pydata.org/en/latest/array-design.html#keys-of-the-dask-graph
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah have even mucked with _keys
some myself to optimize things. ( #2472 ) That said, the general rule of thumb is that _
means not public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the nested function here
Array keys are implemented recursively. Since __dask_keys__
takes no parameters I moved the recursive function to be internal. Could also be moved external to a helper function, but this seemed cleaner and keeps things located together.
We may want a deprecation cycle around
_keys
:
All the old methods still exist and work, and will issue a deprecation warning on use. _keys
calls __dask_keys__
.
dask/base.py
Outdated
"""Base class for dask collections""" | ||
|
||
def is_dask_collection(x): | ||
"""Returns if ``x`` is a dask collection""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return True if ...
dask/base.py
Outdated
def is_dask_collection(x): | ||
"""Returns if ``x`` is a dask collection""" | ||
dask_graph = getattr(x, '__dask_graph__', None) | ||
return False if dask_graph is None else (dask_graph() is not None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style nit for clarity
try:
return x.__dask_graph__() is not None
except AttributeError:
return False
dask/base.py
Outdated
class Base(DaskMethodsMixin): | ||
"""DEPRECATED. The recommended way to create a custom dask object now is to | ||
implement the dask collection interface (see the docs), and optionally | ||
subclass from ``DaskMethodsMixin`` if desired.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we point to docs here?
dask/compatibility.py
Outdated
@@ -265,3 +265,20 @@ class to receive bound method | |||
setattr(cls, name, types.MethodType(func, None, cls)) | |||
else: | |||
setattr(cls, name, func) | |||
|
|||
|
|||
# Borrowed from six |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to include a license?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends on your reading of substantial portions. 😜 Though probably best to be safe and include it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this code completely to something simpler, no need anymore.
dask/base.py
Outdated
|
||
if name in ('eq', 'gt', 'ge', 'lt', 'le', 'ne', 'getitem'): | ||
return | ||
Allows for applying the same optimizations and default scheduler.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name didn't make sense to me at first. I thought it meant something more like "compute this but then return a dask collection". I don't yet have an idea for a better name though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the name doesn't show up in the PR, but is in the file diff, adding the name to the comments to make it easier to follow. The name is compute_as_collection
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps compute_as_if_collection
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the docs. Generally looks great -- this seems like it should be a great fit for xarray.
docs/source/custom-collections.rst
Outdated
keys : list | ||
\*\*kwargs | ||
Extra keyword arguments forwarded from the call to ``compute`` or | ||
``persist``. Can be used or ignored as needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For xarray, I suppose we should simply forward to the optimize method from dask.array? e.g., da.Array.__dask_optimize__(dsk, **kwargs)
?
Is there a guarantee that keys
refers to only a single dask object of this type, or could they be taken from a collection of objects of this type or even objects of different dask types? (I think the last one, but that's not immediately obvious.)
Some pseudo-code for how this is called could by the scheduler could be helpful in thinking this all through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For xarray, I suppose we should simply forward to the optimize method from dask.array? e.g., da.Array.__dask_optimize__(dsk, **kwargs)
. Could also do something like:
class YourClass(...):
@property
def __dask_optimize__(self):
return da.Array.__dask_optimize__
Is there a guarantee that keys refers to only a single dask object of this type, or could they be taken from a collection of objects of this type or even objects of different dask types? (I think the last one, but that's not immediately obvious.)
The latter. I added a section documenting how these methods are used in the dask core methods (e.g. compute
) that hopefully clarifies this. Also clarified the method docs.
|
||
.. note:: It's also recommended to define ``__dask_tokenize__``, | ||
see :ref:`deterministic-hashing`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strong 👍
Returns | ||
------- | ||
finalize : callable | ||
A function with the signature ``finalize(results, *extra_args)``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly is the structure of results
? I'm guessing it matches the (nested) structure of __dask_keys__
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I've clarified this in the docs, let me know if it's not sufficient.
docs/source/custom-collections.rst
Outdated
A function with the signature ``finalize(results, *extra_args)``. | ||
Called with the computed keys from ``__dask_keys__`` and any extra | ||
arguments as specified in ``extra_args``. Should perform any necessary | ||
finalization before returning from ``compute``. For example, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to say something like "Should return an equivalent in-memory collection."
docs/source/custom-collections.rst
Outdated
To create your own dask collection, you need to fullfill the following | ||
interface. Note that there is no required base class: | ||
|
||
.. method:: __dask_graph__(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to name things like _dask_graph_
instead? Double underscore invokes name mangling in some cases, and I'm not sure that's actually helpful here.
Also, although obviously Python is never going to name methods with the name "dask" in them, the double underscore prefix+postfix is technically reserved for language (I've gotten occasional push-back about this before, e.g., when I proposed an ad-hoc protocol for pandas on python-ideas).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing this is done to match what NumPy does with its array interface.
Double underscore before and after is different from just double underscore before (without the double underscore after). Naming mangling only applies to the latter case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, this was done to mirror how numpy and other libraries implement interfaces. I see no issue with using __foo__
methods here, but happy to change if there's considerable pushback.
Personally I feel that private methods feel wrong here (usually indicate private interfaces that shouldn't be used), while dunder methods look more intentionally thought out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am more than happy for us to mirror NumPy. :)
Dask implements its own deterministic hash function to generate keys based on | ||
the value of arguments. This function is available as ``dask.base.tokenize``. | ||
Many common types already have implementations of ``tokenize``, which can be | ||
found in ``dask/base.py``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense for tokenize
to be available from the dask
namespace directly if it is being advertised as a public API function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think tokenize
should be top-level, as most users won't use it at all. It is public api though.
|
||
Where possible, it's recommended to define the ``__dask_tokenize__`` method. | ||
This method takes no arguments and should return a value fully | ||
representative of the object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean everything that was handled by dask.base.tokenize
will now get a __dask_tokenize__
method instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. This is just a potentially cleaner way to make tokenize
work with classes you control. Either adding a method to the dispatch or implementing __dask_tokenize__
works. For classes you don't control (e.g. numpy.ndarray
) the dispatch will still be needed.
This is similar to implementing the pickle methods or registering in copyreg
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry by "everything" I meant all objects in Dask proper or is this a no for that case too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look at the diff, classes internal to dask now use the method. The dispatch is still used elsewhere. I'd say this is a case-by-case decision, similar to using the pickle interface methods vs the copyreg
dispatch.
Related to this, it would be very useful to have an interface method to check whether two Dask objects point to the same thing. |
@jakirkham can you provide some motivation for that? Perhaps in a different issue? |
I am pleasantly surprised to learn that this doesn't seem to affect the dask/distributed test suite (other than deprecation warnings (which are themselves welcome)) |
I believe all comments have been addressed. |
I think I have worked around the initial problem that raised this issue. Though it is still nice periodically to be able to check what things share identity. Could give some more thought to other use cases if you would like. |
@jhamman it might be interesting to try to apply this to XArray. Are you still game for this? Regardless, it would be useful to have your perspective on the docs here. You are probably the target audience for them so it would be useful to hear any points that you find less than clear. |
@mrocklin - yes. I'm still game. I'd like to wait until after we release xarray 0.10. I'll give the docs another review, then let's loop back on this next week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pseudo-code and description of how all these methods work. I'm feeling pretty good about this.
docs/source/custom-collections.rst
Outdated
- If ``optimize_graph`` is ``True`` (default) then the collections are first | ||
grouped by their ``__dask_optimize__`` methods. All collections with the | ||
same ``__dask_optimize__`` method have their graphs merged and keys | ||
concatenated, and then a single call to ``__dask_optimize__`` is made with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are multiple types of collections, which __dask_optimize__
method is called once? An arbitrary one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't see this single call in the pseudo-code implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that we group our graphs the optimization function, merge each collection of graphs and call the optimize function on those graphs together. This happens separately for each optimization function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matt is correct. I've updated this section to hopefully clarify this.
I think that we should add appropriate disclaimers to the docs that this might change without notice, merge, and then try implementing this in XArray, providing feedback to Dask. |
This adds a standard protocol for the dask collections implemented via a set of methods (and no required base class). The methods are similar to the existing non-public api, but not exactly the same. - Specified collection interface - Switched all internals to use new interface - Deprecated previous interface - Broke out the core methods (e.g. `.compute` into a mixin) - Deprecated `Base` class - Documented interface
45396dd
to
4df46fd
Compare
I had to fix some hairy merge conflicts, which was easiest to do by squashing this PR into one large commit beforehand. Apologies for any lost comments :/. At this point I think this is good to go in. I've added a warning to the docs that parts of this protocol might change without deprecations, as well as clarified a few points in the docs. |
Looks good to me, at least from a docs perspective! |
I'll take another look through this shortly, but last time I went through it I had only minor comments so I'm fairly optimistic. |
docs/source/custom-collections.rst
Outdated
The optimized dask graph. | ||
|
||
|
||
.. staticmethod:: __dask_default_get__(dsk, keys, \*\*kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to change the name of get
in the future to something more informative. Is there a different name that we can use here that would be more future-proof? __dask_scheduler__
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, done.
I'm going to merge this now, we can add touch-ups later as needed. Thanks everyone for the reviews. |
This adds a standard protocol for the dask collections implemented via a set of methods (and no required base class). The methods are similar to the existing non-public api, but not exactly the same. I recommend the new doc page (
docs/source/custom-collections.rst
) as a starting point.Supersedes #1068.