-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low memory way to write parquet refs for an existing Zarr dataset? #529
Comments
Yes, it ought to be totally possible to write parquet references by iterating over the contents of a zarr, it just hasn't been done yet, as the anticipated usecase is for combining several zarrs rather than speeding up opening of a single existing zarr. However, the dataset already has a .zmetadata file; so the question is: what is taking the time in opening the dataset? If it's loading many small coordinates arrays that could have been inlined, then kerchunk might indeed help you. If it's producing the dask graph, however (this has been an issue), then kerchunk won't help. You can test by removing the |
Okay, removing |
And to be clear: there is no good reason that dask should take any time to do this. I think it's da.from_numpy internally. I forget where the issue discussing this is, but it's been like this since forever. |
Indeed I was pondering what this dask graph is for, since we are just opening the dataset. Is it problem that it's loading the time coordinate as 1M+ dask tasks (since there are 1M+ time steps, and each chunk has only 1 time step)? |
Indeed! Well, that would be awesome if this could be fixed, but OBVI no rush, as it's been this way forever! Should a new issue be created somewhere? |
I don't mind a "still bad" issue on dask/dask... perhaps linking to the old one, if it can be found. |
This has been known for a while though I can't find the upstream issues :/ cc @phofl |
I can take a look |
Found it: pydata/xarray#8902 |
It takes several minutes to open this ERA5 Zarr dataset on Google Cloud:
So I thought I'd try speeding it up by writing references to parquet (the idea to then load the dataset with kerchunk).
the only problem is... generating the references blew out the memory on my 256GB machine (after 4 hours of chugging away)!
Is there another way to do this instead of loading the entire reference dict and then writing the parquet?
The text was updated successfully, but these errors were encountered: