-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support out-of-core computation using dask #328
Comments
|
@shoyer can you clarify this one? Would the In [1]: import numpy as np
In [2]: a = np.arange(4).reshape(2, 2)
In [3]: a
Out[3]:
array([[0, 1],
[2, 3]])
In [4]: x = np.array([[True, False], [True, True]])
In [5]: np.choose(x, [-10, a])
Out[5]:
array([[ 0, -10],
[ 2, 3]]) |
Turns out what I was thinking of here can be written as a one liner in terms of
So I've crossed that one off the line.
What I need here is something similar to the private
(In xray, I implement this a little differently so that I can take along all multiple axes simultaneously using array indexing, but this version would suffice.) |
Am I right in thinking that this is almost equivalent to fancy indexing with a list of indices? |
Yes, take_nd is very similar to fancy indexing but only non-negative indices are valid (-1 means insert NaN). |
@mrocklin It occurs to me now that a much simpler version of the functionality I'm looking for with |
Basic support for dask.array is merged on master. Continued in #394 |
Dask is a library for out of core computation somewhat similar to biggus in conception, but with slightly grander aspirations. For examples of how Dask could be applied to weather data, see this blog post by @mrocklin: http://matthewrocklin.com/blog/work/2015/02/13/Towards-OOC-Slicing-and-Stacking/
It would be interesting to explore using dask internally in xray, so that we can implement lazy/out-of-core aggregations, concat and groupby to complement the existing lazy indexing. This functionality would be quite useful for xray, and even more so than merely supporting datasets-on-disk (#199).
A related issue is #79: we can easily imagine using Dask with groupby/apply to power out-of-core and multi-threaded computation.
Todos for xray:
Variable.concat
to make use of functions likeconcatenate
andstack
instead of in-place array modification (Dask arrays do not support mutation, for good reasons)reindex_variables
to not make direct use of mutation (e.g., by usingda.insert
below)numpy.ndarray
objects (done: this is thedata
attribute)reblock
in the public APIsome sort of API for user controlled lazy apply on dask arrays (using groupby, mostly likely)(not necessary for initial release)sin
andsqrt
Todos for dask (to be clear, none of these are blockers for a proof of concept):
support for interleaved concatenation (necessary for transformations by group, which are quite common)(turns out to be a one-liner with concatenate and take, see below)support for something liketake_nd
from pandas: likenp.take
, but with -1 as a sentinel value for "missing" (necessary for many alignment operations)da.insert
, modeled afternp.insert
would solve this problem.support "orthogonal" MATLAB-like array-based indexing along multiple dimensions(taking along one axis at a time is close enough)broadcast_to
: see ENH: add np.broadcast_to and reimplement np.broadcast_arrays numpy/numpy#5371The text was updated successfully, but these errors were encountered: