Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify chunks in bytes #8021

Open
mrocklin opened this issue Jul 26, 2023 · 4 comments
Open

Specify chunks in bytes #8021

mrocklin opened this issue Jul 26, 2023 · 4 comments

Comments

@mrocklin
Copy link
Contributor

Is your feature request related to a problem?

I'm playing around with xarray performance and would like a way to easily tweak chunk sizes. I'm able to do this by backing out what xarray chooses in an open_zarr call and then provide the right chunks= argument. I'll admit though that I wouldn't mind giving Xarray a value like "1 GiB" though and having it use that when determining "auto" chunk sizes.

Dask array does this in two ways. We can provide a value in chunks as like the following:

x = da.random.random(..., chunks="1 GiB")

We also refer to a value in Dask config

In [1]: import dask

In [2]: dask.config.get("array.chunk-size")
Out[2]: '128MiB'

This is not very important (I'm unblocked) but I thought I'd mention it in case someone is looking for some fun work 🙂

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

@jhamman
Copy link
Member

jhamman commented Jul 26, 2023

I like this suggestion! The trick will be to find a general way to map the chunk specification efficiently over the underlying storage backend's "preferred chunks" (e.g. #7948).

Note that you can get most of what you want today with the following syntax:

xr.open_dataset(..., chunks=None).chunk("1 GiB")

In the future, I think it would be quite nice if we supported:

xr.open_dataset(..., chunks="1 GiB")

where the resulting chunks were constructed, as best as possible, to align with the chunks of the underlying dataset (e.g. Zarr, HDF5).

xref: #1440

@denis-bz
Copy link

denis-bz commented Oct 1, 2023

Hi @mrocklin

I'm playing around with xarray performance and would like a way to easily tweak chunk sizes.

me too. What happens in the null case, NO chunking ?
What I want to do is copy 2 GB slow.nc to fast.nc once, right after download
so that subsequent xr.open_dataset( "fast.nc", chunks=❓ ).to_numpy()
are as fast as possible. A simple testbench along the lines

cube = Randomcube( gb=2 )  # (8760, 247, 247)  2039 mb
xar = toxarray( cube.values, name="Randomcube" )
# xar.encoding = ❓
nc = "/tmp/tmp.nc"
xar.to_netcdf( nc, format="netCDF4", engine="netcdf4" )

for chunks in [ "auto", {}, None ]:  # last fastest ??
    ptime()
    with xr.load_dataarray( nc, chunks=chunks ) as load:  # load / open ?
        tonumpy = load.to_numpy()
    ptime( f"xr.load_dataarray( { chunks = }  {nc} ) .to_numpy" )

shows odd results on my old imac.
Can you comment (move to discussions)
or know of other testbenches ?
Thanks, cheers
-- denis

@jhamman
Copy link
Member

jhamman commented Oct 3, 2023

What happens in the null case, NO chunking ?

First thing to consider is whether your netcdf4 file is chunked or contiguous on disk. If it is not chunked on disk, Xarray and Dask can not do much to optimize partial array decompression. If it is chunked on disk, you'll likely find the best performance aligning your read chunks to the chunks on disk. #7948 recently added support chunks='auto' or chunks={} so make sure you are using the latest release.

Can you comment (move to discussions) or know of other testbenches ?

I'd like to leave this issue here because the feature described above still applies. I would encourage you to open a discussion for a more detailed conversation.

@denis-bz
Copy link

denis-bz commented Oct 6, 2023

@jhamman, apart from chunking, this testcase shows chunks="auto" much slower than {} and None --
on my old imac. Anyone have time or interest to run this on a modern machine or two ?
then I'd put the code in a gist, then discuss.

Multilevel caches SSD L2 L1 vary a LOT
=> many timing tests > 1 GB are for the Journal of Irreproducible Results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants