Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variable.stack constructs extremely large chunks #5754

Closed
dcherian opened this issue Sep 1, 2021 · 6 comments
Closed

Variable.stack constructs extremely large chunks #5754

dcherian opened this issue Sep 1, 2021 · 6 comments

Comments

@dcherian
Copy link
Contributor

dcherian commented Sep 1, 2021

Minimal Complete Verifiable Example:

Here's a small array with too-small chunk sizes just as an example

# Put your MCVE code here
import dask.array
import xarray as xr

var = xr.Variable(("x", "y", "z"), dask.array.random.random((4, 18483, 1000), chunks=(1, 183, -1)))

image

Now stack two dimensions, this is a 100x increase in chunk size (in my actual code, 85MB chunks become 8.5GB chunks =) )

var.stack(new=("x", "y"))

image

But calling reshape on the dask array preserves the original chunk size

var.data.reshape((4*18483, -1))

image

Solution

Ah, found it , we transpose then reshape in Variable_stack_once.

xarray/xarray/core/variable.py

Lines 1521 to 1527 in f915515

other_dims = [d for d in self.dims if d not in dims]
dim_order = other_dims + list(dims)
reordered = self.transpose(*dim_order)
new_shape = reordered.shape[: len(other_dims)] + (-1,)
new_data = reordered.data.reshape(new_shape)
new_dims = reordered.dims[: len(other_dims)] + (new_dim,)

Writing those steps with pure dask yields the same 100x increase in chunksize

var.data.transpose([2, 0, 1]).reshape((-1, 4*18483))

image

Anything else we need to know?:

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.6 | packaged by conda-forge | (default, Jan 25 2021, 23:21:18)
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-1127.18.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 0.19.0
pandas: 1.3.1
numpy: 1.21.1
scipy: 1.5.3
netCDF4: 1.5.6
pydap: installed
h5netcdf: 0.11.0
h5py: 3.3.0
Nio: None
zarr: 2.8.3
cftime: 1.5.0
nc_time_axis: 1.3.1
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: 3.0.4
bottleneck: 1.3.2
dask: 2021.07.2
distributed: 2021.07.2
matplotlib: 3.4.2
cartopy: 0.19.0.post1
seaborn: 0.11.1
numbagg: None
pint: 0.17
setuptools: 49.6.0.post20210108
pip: 21.2.2
conda: 4.10.3
pytest: 6.2.4
IPython: 7.26.0
sphinx: 4.1.2

@dcherian
Copy link
Contributor Author

dcherian commented Sep 1, 2021

Ah this is dask/dask#5544 again. It looks like dask needs to break up the potentially-very-large intermediate chunks.

That said our strategy of transposing first means that the optimization implemented in dask/dask#5544 (comment) doesn't kick in in this case.

@dcherian
Copy link
Contributor Author

Fixed upstream

@yucsong
Copy link

yucsong commented Mar 22, 2023

Sorry, is this fixed?

@dcherian
Copy link
Contributor Author

It was fixed in dask, but we're still sub-optimal.

Do you have an example of a problem? Please open a new issue with a reproducible example if you do.

@yucsong
Copy link

yucsong commented Mar 22, 2023

/srv/conda/envs/notebook/lib/python3.10/site-packages/xarray/core/variable.py:1721: PerformanceWarning: Reshaping is producing a large chunk. To accept the large
chunk and silence this warning, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
    ...     array.reshape(shape)

I simply tested var.stack(new=("x", "y")) and I got the above message. I don't understand why From 1521 to 1527 line of xarray/core/variable.py they did reshape?

@dcherian
Copy link
Contributor Author

This is fine. That warning says they're fixing the issue reported here.

@yt87 yt87 mentioned this issue Oct 19, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants