-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel map/apply powered by dask.array #585
Comments
But do the xray objects have to exist in memory? I was thinking this could also work along with Like the idea of applying this to groupby objects. I wonder if it could be done transparently to the user... |
Indeed, there's no need to load the entire dataset into memory first. I think open_mfdataset is the model to emulate here -- it's parallelism that just works. I'm not quite sure how to do this transparently in groupby operations yet. The problem is that you do want to apply some groupby operations on dask arrays without loading the entire group into memory, if there are only a few groups on a large datasets and the function itself is written in terms of dask operations. I think we will probably need some syntax to disambiguate that scenario. |
👍 Very useful idea! |
With the single machine version of dask, we need to run one block first to infer the appropriate metadata for constructing the combined dataset. Potentially a better approach would be to optionally leverage dask.distributed, which has the ability to run computation at the same time as graph construction. |
I'm adding this note to express an interest in the functionality described in Stephan's original description, i.e. a |
Does #964 help on this? |
I think #964 provides a viable path forward here. Previously, I was imagining the user provides an function that maps In contrast, with a user defined function In principle, we could do this automatically -- especially if dask had a way to parallelize arbitrary NumPy generalized universal functions. Then the user could write something like |
This is good news for me as the functions I will apply take a ndarray as |
I have a preliminary implementation up in #1517 |
@rabernat Agreed, let's open a new issue for that. |
Dask is awesome, but it isn't always easy to use it for parallel operations. In many cases, especially when wrapping routines from external libraries, it is most straightforward to express operations in terms of a function that expects and returns xray objects loaded into memory.
Dask array has a
map_blocks
function/method, but it's applicability is limited because dask.array doesn't have axis names for unambiguously identifying dimensions.da.atop
can handle many of these cases, but it's not the easiest to use. Fortunately, we have sufficient metadata in xray that we could probably parallelize manyatop
operations automatically by inferring result dimensions and dtypes from applying the function once. See here for more discussion on the dask side: dask/dask#702So I would like to add some convenience methods for automatic parallelization with dask of a function defined on xray objects loaded into memory. In addition to a
map_blocks
method/function, it would be useful to add some sort ofparallel_apply
method to groupby objects that works very similarly, by lazily applying a function that takes and returns xray objects loaded into memory.The text was updated successfully, but these errors were encountered: