-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
opening a zarr dataset taking so much time #8902
Comments
Thanks for opening your first issue here at xarray! Be sure to follow the issue template! |
I've felt the pain on this particular store as well, it's a nice test case. 3PB total, ~300,000,000 chunks in total. Looks like this is a dask problem though. All the time is spent in single-threaded code creating the array chunks. If we skip dask with We currently have a |
Nice profile! @jrbourbeau dask/dask#10648 probably improves this by a lot. Can that be reviewed/merged please? edit: xref dask/dask#10269 |
@slevang, do you mind sharing how you are generating these profiles and associated graphs? I've struggled in the past to do this effectively with a |
This is snakeviz: https://jiffyclub.github.io/snakeviz/ |
^yep. Probably not useful for distributed profiling but I haven't really tried. It's just a visualization layer for In this case the creation of this monster task graph would be happening serially in the main process even if your goal was to eventually use a distributed client to run processing. The graph (only describing the array chunks) is close to a billion objects in this case, so would run into issues even trying to serialize that out to workers. |
FYI something like: ds = xr.open_zarr(
"gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3",
chunks=dict(time=-1, level=-1, latitude="auto", longitude="auto"),
) ...may give a better balance between dask task graph size and chuck size. (I do this "make the dask chunks bigger than the zarr chunks" a lot, because simple dask graphs can become huge with a 50TB dataset, since the default zarr encoding has a maximum chunk size of 2GB. Not sure it's necessarily the best way, very open to ideas...) |
Edit: nevermind I actually have no idea where this profile came from. Disregard |
How do I get my work done?
This will only generate chunks for a filtered dataset. |
This should absolutely be the way that xarray works! There is no need to create chunks for variables that never are accessed. |
So this is running 130 seconds on my machine, 2 things take up 80 seconds of that:
I have a PR here that will cut 25 seconds of the runtime, the other parts are a bit trickier probably cc @dcherian |
Yup saving 30s with #9808 . The cache is quite effective: |
Belatedly realizing that Xarray's call to |
xref pydata#8902 xref pydata#1525
What is your issue?
I have an era5 dataset stored in GCS bucket as zarr. It contains 273 weather related variables and 4 dimensions. It's an hourly stored data from 1940 to 2023.
When I try to open with
ds = xr.open_zarr("gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3")
it takes 90 seconds to actually finish the open call.The chunk scheme is
{ 'time': 1 }
.The text was updated successfully, but these errors were encountered: