-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set allow_rechunk=True
in apply_ufunc
#4372
Comments
(this is causing downstream test failures: NCAR/pop-tools#59; thanks @mnlevy1981) |
Copying over that comment...
@shoyer could you please clarify what you meant? For example, this works with v0.16.0 but fails on master import operator
import numpy as np
import xarray as xr
a = xr.DataArray(np.ones((10, 10)), dims=("a", "b")).chunk({"a": 2, "b": 1})
b = xr.DataArray(np.ones((10, 10)), dims=("a", "b")).chunk({"a": -1, "b": 4})
xr.apply_ufunc(
operator.add, a, b, dask="parallelized", output_dtypes=[a.dtype],
).compute().equals(a.compute() + b.compute()) |
One solution would be to catch this ValueError, issue a FutureWarning and add xarray/xarray/core/computation.py Lines 646 to 657 in 9c85dd5
def func(*arrays):
import dask.array as da
gufunc = functools.partial(
da.apply_gufunc,
numpy_func,
signature.to_gufunc_string(exclude_dims),
*arrays,
vectorize=vectorize,
output_dtypes=output_dtypes,
)
try:
res = gufunc(**dask_gufunc_kwargs)
except ValueError as exc:
if "with different chunksize present" in str(exc):
warnings.warn(
f"``allow_rechunk=True`` need to be explicitely set in the "
f"``dask_gufunc_kwargs`` parameter. Not setting will raise dask "
f"ValueError ``{str(exc)}`` in a future version.",
FutureWarning,
stacklevel=2,
)
dask_gufunc_kwargs["allow_rechunk"] = True
res = gufunc(**dask_gufunc_kwargs)
else:
raise I could make a PR out of this. The message wording can surely be improved. WDYT @dcherian and @shoyer? |
Maybe we do want to set allow_rechunk=True? It seems that I was just mistaken about the current behavior. |
@shoyer In this case: Should we warn the user, that data might be loaded into memory? Another questions are, why does this kwarg exist in dask and why do they not rechunk per default? |
From the dask """
allow_rechunk: Optional, bool, keyword only
Allows rechunking, otherwise chunk sizes need to match and core dimensions are to consist only of one chunk.
Warning: enabling this can increase memory usage significantly. Defaults to False
""" Current code handling in dask: if not allow_rechunk:
chunksizes = chunksizess[dim]
#### Check if core dimensions consist of only one chunk
if (dim in core_shapes) and (chunksizes[0][0] < core_shapes[dim]):
raise ValueError(
"Core dimension `'{}'` consists of multiple chunks. To fix, rechunk into a single \
chunk along this dimension or set `allow_rechunk=True`, but beware that this may increase memory usage \
significantly.".format(
dim
)
)
#### Check if loop dimensions consist of same chunksizes, when they have sizes > 1
relevant_chunksizes = list(
unique(c for s, c in zip(sizes, chunksizes) if s > 1)
)
if len(relevant_chunksizes) > 1:
raise ValueError(
"Dimension `'{}'` with different chunksize present".format(dim)
) IIUTC, this not only rechunks non-core dimensions but also fixes core dimensions with more than one chunk. Would this be intended from the # core dimensions cannot span multiple chunks
for axis, dim in enumerate(core_dims, start=-len(core_dims)):
if len(data.chunks[axis]) != 1:
raise ValueError(
"dimension {!r} on {}th function argument to "
"apply_ufunc with dask='parallelized' consists of "
"multiple chunks, but is also a core dimension. To "
"fix, rechunk into a single dask array chunk along "
"this dimension, i.e., ``.chunk({})``, but beware "
"that this may significantly increase memory usage.".format(
dim, n, {dim: -1}
)
) Explicit That means setting |
Trying to answer this from looking at the dask code.
|
So to maintain backward compatibility, we should add that same check # core dimensions cannot span multiple chunks
for axis, dim in enumerate(core_dims, start=-len(core_dims)):
if len(data.chunks[axis]) != 1:
raise ValueError(
"dimension {!r} on {}th function argument to "
"apply_ufunc with dask='parallelized' consists of "
"multiple chunks, but is also a core dimension. To "
"fix, rechunk into a single dask array chunk along "
"this dimension, i.e., ``.chunk({})``, but beware "
"that this may significantly increase memory usage.".format(
dim, n, {dim: -1}
)
) and set We could deprecate and remove this check in a couple of versions but I don't know if it's worth the effort... |
What happened:
blockwise
callsunify_chunks
by default butapply_gufunc
does not; so we have a regression inapply_ufunc
now that we've switched fromblockwise
toapply_gufunc
.Minimal Complete Verifiable Example:
raises
on master but works with 0.16.0
I think we need to do
dask_gufunc_kwargs.setdefault("allow_rechunk", True)
If we want to avoid that, we'll need to go through a deprecation cycle.
The text was updated successfully, but these errors were encountered: