Specify chunks in bytes #8021

mrocklin · 2023-07-26T02:29:43Z

Is your feature request related to a problem?

I'm playing around with xarray performance and would like a way to easily tweak chunk sizes. I'm able to do this by backing out what xarray chooses in an open_zarr call and then provide the right chunks= argument. I'll admit though that I wouldn't mind giving Xarray a value like "1 GiB" though and having it use that when determining "auto" chunk sizes.

Dask array does this in two ways. We can provide a value in chunks as like the following:

x = da.random.random(..., chunks="1 GiB")

We also refer to a value in Dask config

In [1]: import dask

In [2]: dask.config.get("array.chunk-size")
Out[2]: '128MiB'

This is not very important (I'm unblocked) but I thought I'd mention it in case someone is looking for some fun work 🙂

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

jhamman · 2023-07-26T15:34:43Z

I like this suggestion! The trick will be to find a general way to map the chunk specification efficiently over the underlying storage backend's "preferred chunks" (e.g. #7948).

Note that you can get most of what you want today with the following syntax:

xr.open_dataset(..., chunks=None).chunk("1 GiB")

In the future, I think it would be quite nice if we supported:

xr.open_dataset(..., chunks="1 GiB")

where the resulting chunks were constructed, as best as possible, to align with the chunks of the underlying dataset (e.g. Zarr, HDF5).

xref: #1440

denis-bz · 2023-10-01T15:44:03Z

Hi @mrocklin

I'm playing around with xarray performance and would like a way to easily tweak chunk sizes.

me too. What happens in the null case, NO chunking ?
What I want to do is copy 2 GB slow.nc to fast.nc once, right after download
so that subsequent xr.open_dataset( "fast.nc", chunks=❓ ).to_numpy()
are as fast as possible. A simple testbench along the lines

cube = Randomcube( gb=2 )  # (8760, 247, 247)  2039 mb
xar = toxarray( cube.values, name="Randomcube" )
# xar.encoding = ❓
nc = "/tmp/tmp.nc"
xar.to_netcdf( nc, format="netCDF4", engine="netcdf4" )

for chunks in [ "auto", {}, None ]:  # last fastest ??
    ptime()
    with xr.load_dataarray( nc, chunks=chunks ) as load:  # load / open ?
        tonumpy = load.to_numpy()
    ptime( f"xr.load_dataarray( { chunks = }  {nc} ) .to_numpy" )

shows odd results on my old imac.
Can you comment (move to discussions)
or know of other testbenches ?
Thanks, cheers
-- denis

jhamman · 2023-10-03T17:10:33Z

What happens in the null case, NO chunking ?

First thing to consider is whether your netcdf4 file is chunked or contiguous on disk. If it is not chunked on disk, Xarray and Dask can not do much to optimize partial array decompression. If it is chunked on disk, you'll likely find the best performance aligning your read chunks to the chunks on disk. #7948 recently added support chunks='auto' or chunks={} so make sure you are using the latest release.

Can you comment (move to discussions) or know of other testbenches ?

I'd like to leave this issue here because the feature described above still applies. I would encourage you to open a discussion for a more detailed conversation.

denis-bz · 2023-10-06T10:09:33Z

@jhamman, apart from chunking, this testcase shows chunks="auto" much slower than {} and None --
on my old imac. Anyone have time or interest to run this on a modern machine or two ?
then I'd put the code in a gist, then discuss.

Multilevel caches SSD L2 L1 vary a LOT
=> many timing tests > 1 GB are for the Journal of Irreproducible Results

mrocklin added the enhancement label Jul 26, 2023

dcherian added the topic-dask label Aug 22, 2023

jhamman mentioned this issue Nov 30, 2023

performance regression 2023.08 -> 2023.09 to_zarr from netcdf4 open_mfdataset #8490

Closed

5 tasks

denis-bz mentioned this issue Jan 27, 2024

Support HDF5 reading/writing (with optional dependencies) pola-rs/polars#3520

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify chunks in bytes #8021

Specify chunks in bytes #8021

mrocklin commented Jul 26, 2023

jhamman commented Jul 26, 2023 •

edited

Loading

denis-bz commented Oct 1, 2023

jhamman commented Oct 3, 2023

denis-bz commented Oct 6, 2023

Specify chunks in bytes #8021

Specify chunks in bytes #8021

Comments

mrocklin commented Jul 26, 2023

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

jhamman commented Jul 26, 2023 • edited Loading

denis-bz commented Oct 1, 2023

jhamman commented Oct 3, 2023

denis-bz commented Oct 6, 2023

jhamman commented Jul 26, 2023 •

edited

Loading