Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack: avoid re-chunking (dask) and insert new coordinates arbitrarily #4389

Closed
chrisroat opened this issue Aug 30, 2020 · 5 comments
Closed

Comments

@chrisroat
Copy link
Contributor

The behavior of stack was not quite intuitive to me, and I'd like to understand if this was an explicit technical decision or if it can be changed.

First, with regard to chunking:

arr = xr.DataArray(da.zeros((2, 3, 4), dtype=np.int, chunks=(1 ,1, 1)), dims=['z', 'y' ,'x'])
stacked = arr.stack(v=('y', 'x'))
print(stacked)
--
xarray.DataArray 'zeros-6eb2edd0fca7ec97141e1310bd303988' (z: 2, v: 12)>
dask.array<reshape, shape=(2, 12), dtype=int64, chunksize=(1, 4), chunktype=numpy.ndarray>
Coordinates:
  * v        (v) MultiIndex
  - y        (v) int64 0 0 0 0 1 1 1 1 2 2 2 2
  - x        (v) int64 0 1 2 3 0 1 2 3 0 1 2 3
Dimensions without coordinates: z

Why did the number of chunks change in this case? Couldn't the chunksize be (1,1)?

Next, why is it necessary to put the new dimension at the end? It seems there are often more natural (perhaps just to my naive thought process) placements. One example would be that same array above, but stacked on the first two dimensions. I would want the new dimension to be the first dimension (again without the rechunking above). To accomplish this, I do:

arr = xr.DataArray(da.zeros((2, 3, 4), dtype=np.int, chunks=(1 ,1, 1)), dims=['z', 'y' ,'x'])
stacked = arr.stack(v=('z', 'y')).transpose('v', ...).chunk({'v': 1})
print(stacked)
--
<xarray.DataArray 'zeros-6eb2edd0fca7ec97141e1310bd303988' (v: 6, x: 4)>
dask.array<rechunk-merge, shape=(6, 4), dtype=int64, chunksize=(1, 1), chunktype=numpy.ndarray>
Coordinates:
  * v        (v) MultiIndex
  - z        (v) int64 0 0 0 1 1 1
  - y        (v) int64 0 1 2 0 1 2
Dimensions without coordinates: x

The dask graph for this last bit insert a rechunk and two transposes, but my intent was not to have any of the underlying chunks change at all. Here is 1 of 8 pieces of the graph (with optimization off -- optimization combines operations, but doesn't change the topology or the operations):

out

Is it technically feasible for stack to avoid rechunking, and for the user to determine where the new dimensions should go?

@chrisroat chrisroat changed the title Stack: avoid re-chunking (dask) and insert new coordinates naturally where possible Stack: avoid re-chunking (dask) and insert new coordinates arbitrarily Aug 30, 2020
@dcherian
Copy link
Contributor

dask seems to rechunk when reshaping

dask.visualize(arr.data.reshape((6, 4)))
image

The core stack operation is here:

xarray/xarray/core/variable.py

Lines 1456 to 1478 in 2acd0fc

def _stack_once(self, dims, new_dim):
if not set(dims) <= set(self.dims):
raise ValueError("invalid existing dimensions: %s" % dims)
if new_dim in self.dims:
raise ValueError(
"cannot create a new dimension with the same "
"name as an existing dimension"
)
if len(dims) == 0:
# don't stack
return self.copy(deep=False)
other_dims = [d for d in self.dims if d not in dims]
dim_order = other_dims + list(dims)
reordered = self.transpose(*dim_order)
new_shape = reordered.shape[: len(other_dims)] + (-1,)
new_data = reordered.data.reshape(new_shape)
new_dims = reordered.dims[: len(other_dims)] + (new_dim,)
return Variable(new_dims, new_data, self._attrs, self._encoding, fastpath=True)

I think the dimension is at the end for coding convenience...

@chrisroat
Copy link
Contributor Author

There has been some discussion on the dask chunking issue here:
dask/dask#3650
dask/dask#5544

Regarding the position of the inserted variable, it is not related to the chunking. It seems possible to do this. Would this be an acceptable change? If so, the first problem is the signature, as multiple dimensions may be passed in. :/

@stale
Copy link

stale bot commented Apr 28, 2022

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

@stale stale bot added the stale label Apr 28, 2022
@dcherian dcherian removed the stale label Apr 28, 2022
@phofl
Copy link
Contributor

phofl commented Aug 19, 2024

The rechunking is now gone with the latest Dask release.

Here is the task graph
stack

@dcherian
Copy link
Contributor

Amazing. Thanks @phofl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants