opening a zarr dataset taking so much time #8902

DarshanSP19 · 2024-04-02T13:01:52Z

What is your issue?

I have an era5 dataset stored in GCS bucket as zarr. It contains 273 weather related variables and 4 dimensions. It's an hourly stored data from 1940 to 2023.
When I try to open with ds = xr.open_zarr("gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3") it takes 90 seconds to actually finish the open call.
The chunk scheme is { 'time': 1 }.

The text was updated successfully, but these errors were encountered:

welcome · 2024-04-02T13:01:54Z

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

slevang · 2024-04-02T18:50:52Z

I've felt the pain on this particular store as well, it's a nice test case. 3PB total, ~300,000,000 chunks in total.

Looks like this is a dask problem though. All the time is spent in single-threaded code creating the array chunks.

If we skip dask with xr.open_zarr(..., chunks=None) it takes 1.5s.

We currently have a drop_variables arg. When you have a dataset with 273 variables and you only want a couple, the inverse keep_variables would be a lot easier. It looks like drop_variables gets applied before we create the dask chunks for the arrays, so reading the store once and on the second read adding drop_variables=[v for v in ds.data_vars if v != "geopotential"], I recover a ~1.5s read time.

dcherian · 2024-04-02T18:57:40Z

Nice profile!

@jrbourbeau dask/dask#10648 probably improves this by a lot. Can that be reviewed/merged please?

edit: xref dask/dask#10269

slevang · 2024-04-02T19:59:27Z

Interestingly things are almost a factor of 4x worse on both those PRs, but both are out of date so may be missing other recent improvements.

riley-brady · 2024-04-03T18:10:01Z

@slevang, do you mind sharing how you are generating these profiles and associated graphs? I've struggled in the past to do this effectively with a dask cluster. This looks great!

dcherian · 2024-04-03T18:19:08Z

This is snakeviz: https://jiffyclub.github.io/snakeviz/

slevang · 2024-04-03T18:24:24Z

^yep. Probably not useful for distributed profiling but I haven't really tried. It's just a visualization layer for cProfile.

In this case the creation of this monster task graph would be happening serially in the main process even if your goal was to eventually use a distributed client to run processing. The graph (only describing the array chunks) is close to a billion objects in this case, so would run into issues even trying to serialize that out to workers.

max-sixty · 2024-04-03T18:56:22Z

FYI something like:

ds = xr.open_zarr(
    "gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3",
    chunks=dict(time=-1, level=-1, latitude="auto", longitude="auto"),
)

...may give a better balance between dask task graph size and chuck size.

(I do this "make the dask chunks bigger than the zarr chunks" a lot, because simple dask graphs can become huge with a 50TB dataset, since the default zarr encoding has a maximum chunk size of 2GB. Not sure it's necessarily the best way, very open to ideas...)

slevang · 2024-04-03T19:08:51Z

Edit: nevermind I actually have no idea where this profile came from. Disregard

DarshanSP19 · 2024-04-04T07:49:51Z

How do I get my work done?

I opened the dataset with chunks=None.
Then filter that as required like select only a few data variables only for some fixed time ranges and for some fixed lat lon ranges.
Then chunk that small dataset only.

ds = xr.open_zarr("gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3", chunks=None)
data_vars = ['surface_pressure', 'temperature']
ds = ds[data_vars]
ds = ds.sel(time=slice('2013-01-01', '2013-12-31'))
... and a few more filters ...
ds = ds.chunk()

This will only generate chunks for a filtered dataset.
These steps worked for me so closing the issue.

martindurant · 2024-11-20T19:29:42Z

This will only generate chunks for a filtered dataset.

This should absolutely be the way that xarray works! There is no need to create chunks for variables that never are accessed.

phofl · 2024-11-21T09:19:34Z

So this is running 130 seconds on my machine, 2 things take up 80 seconds of that:

tokenising the chunks (yikes...)
this warning (yikes too, 25-30 seconds)

I have a PR here that will cut 25 seconds of the runtime, the other parts are a bit trickier probably

cc @dcherian

dcherian · 2024-11-21T18:34:54Z

Yup saving 30s with #9808 . The cache is quite effective: CacheInfo(hits=826, misses=4, maxsize=None, currsize=4)

dcherian · 2024-12-16T16:45:09Z

Belatedly realizing that Xarray's call to normalize_chunks is a major time waster here given that chunks contains a tuple with O(1 million) elements hehe.

xref pydata#8902 xref pydata#1525

DarshanSP19 added the needs triage Issue that has not been reviewed by xarray team member label Apr 2, 2024

TomNicholas added topic-performance topic-zarr Related to zarr storage library and removed needs triage Issue that has not been reviewed by xarray team member labels Apr 2, 2024

dcherian added upstream issue topic-dask labels Apr 2, 2024

DarshanSP19 closed this as completed Apr 4, 2024

dcherian mentioned this issue Nov 20, 2024

Low memory way to write parquet refs for an existing Zarr dataset? fsspec/kerchunk#529

Closed

phofl mentioned this issue Nov 21, 2024

Speed up ArraySliceDep tokenization dask/dask#11551

Merged

3 tasks

This was referenced Nov 21, 2024

Use functools.cache more dask/dask#11554

Open

Faster chunk checking for backend datasets #9808

Merged

dcherian mentioned this issue Nov 22, 2024

Cache svg-representation for arrays dask/dask#11557

Closed

3 tasks

dcherian reopened this Dec 16, 2024

dcherian added a commit to dcherian/xarray that referenced this issue Dec 16, 2024

Try using uuids

de010ea

xref pydata#8902 xref pydata#1525

dcherian mentioned this issue Dec 16, 2024

Cache the result of DaskManager.normalize_chunks #9897

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opening a zarr dataset taking so much time #8902

opening a zarr dataset taking so much time #8902

DarshanSP19 commented Apr 2, 2024

welcome bot commented Apr 2, 2024

slevang commented Apr 2, 2024 •

edited

Loading

dcherian commented Apr 2, 2024 •

edited

Loading

slevang commented Apr 2, 2024

riley-brady commented Apr 3, 2024

dcherian commented Apr 3, 2024

slevang commented Apr 3, 2024 •

edited

Loading

max-sixty commented Apr 3, 2024 •

edited

Loading

slevang commented Apr 3, 2024 •

edited

Loading

DarshanSP19 commented Apr 4, 2024

martindurant commented Nov 20, 2024

phofl commented Nov 21, 2024 •

edited

Loading

dcherian commented Nov 21, 2024

dcherian commented Dec 16, 2024

opening a zarr dataset taking so much time #8902

opening a zarr dataset taking so much time #8902

Comments

DarshanSP19 commented Apr 2, 2024

What is your issue?

welcome bot commented Apr 2, 2024

slevang commented Apr 2, 2024 • edited Loading

dcherian commented Apr 2, 2024 • edited Loading

slevang commented Apr 2, 2024

riley-brady commented Apr 3, 2024

dcherian commented Apr 3, 2024

slevang commented Apr 3, 2024 • edited Loading

max-sixty commented Apr 3, 2024 • edited Loading

slevang commented Apr 3, 2024 • edited Loading

DarshanSP19 commented Apr 4, 2024

martindurant commented Nov 20, 2024

phofl commented Nov 21, 2024 • edited Loading

dcherian commented Nov 21, 2024

dcherian commented Dec 16, 2024

slevang commented Apr 2, 2024 •

edited

Loading

dcherian commented Apr 2, 2024 •

edited

Loading

slevang commented Apr 3, 2024 •

edited

Loading

max-sixty commented Apr 3, 2024 •

edited

Loading

slevang commented Apr 3, 2024 •

edited

Loading

phofl commented Nov 21, 2024 •

edited

Loading