Low memory way to write parquet refs for an existing Zarr dataset? #529

rsignell · 2024-11-20T13:58:57Z

It takes several minutes to open this ERA5 Zarr dataset on Google Cloud:

era5 = xr.open_dataset(
    "gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3",
    engine='zarr',
    chunks={})

So I thought I'd try speeding it up by writing references to parquet (the idea to then load the dataset with kerchunk).

refs = kerchunk.zarr.single_zarr("gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3", inline_threshold=300)
kerchunk.df.refs_to_dataframe(refs, 'era5.parq')

the only problem is... generating the references blew out the memory on my 256GB machine (after 4 hours of chugging away)!

Is there another way to do this instead of loading the entire reference dict and then writing the parquet?

The text was updated successfully, but these errors were encountered:

martindurant · 2024-11-20T14:20:02Z

Yes, it ought to be totally possible to write parquet references by iterating over the contents of a zarr, it just hasn't been done yet, as the anticipated usecase is for combining several zarrs rather than speeding up opening of a single existing zarr.

However, the dataset already has a .zmetadata file; so the question is: what is taking the time in opening the dataset? If it's loading many small coordinates arrays that could have been inlined, then kerchunk might indeed help you. If it's producing the dask graph, however (this has been an issue), then kerchunk won't help. You can test by removing the chunks={}.

rsignell · 2024-11-20T14:58:37Z

Okay, removing chunks={} it opens in 4s, so indeed, so indeed it must be the production of the dask task graph that takes all that time. Thanks for the lesson @martindurant !

martindurant · 2024-11-20T15:09:48Z

And to be clear: there is no good reason that dask should take any time to do this. I think it's da.from_numpy internally. I forget where the issue discussing this is, but it's been like this since forever.

rsignell · 2024-11-20T15:10:45Z

Indeed I was pondering what this dask graph is for, since we are just opening the dataset. Is it problem that it's loading the time coordinate as 1M+ dask tasks (since there are 1M+ time steps, and each chunk has only 1 time step)?

martindurant · 2024-11-20T15:25:07Z

No, I think it's producing tasks for each chunk of each variable. I suppose some of this is on xarray: the dask arrays backing each variable should not even be constructed until the said variable is selected!

I ran this myself now, and find that indeed the variables are backed by MaterializedLayer s ; but actually it seems that most generating layer objects takes 18% of the time and a whopping 38% is in tokenize (?!?!).

I'd say maybe I can fix this ... but not until after pydata-global.

rsignell · 2024-11-20T15:50:35Z

No, I think it's producing tasks for each chunk of each variable. I suppose some of this is on xarray: the dask arrays backing each variable should not even be constructed until the said variable is selected!

Indeed! Well, that would be awesome if this could be fixed, but OBVI no rush, as it's been this way forever! Should a new issue be created somewhere?

martindurant · 2024-11-20T15:52:43Z

I don't mind a "still bad" issue on dask/dask... perhaps linking to the old one, if it can be found.

dcherian · 2024-11-20T17:03:56Z

This has been known for a while though I can't find the upstream issues :/

cc @phofl

phofl · 2024-11-20T19:00:15Z

I can take a look

dcherian · 2024-11-20T19:02:46Z

Found it: pydata/xarray#8902

rsignell closed this as completed Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low memory way to write parquet refs for an existing Zarr dataset? #529

Low memory way to write parquet refs for an existing Zarr dataset? #529

rsignell commented Nov 20, 2024 •

edited

Loading

martindurant commented Nov 20, 2024

rsignell commented Nov 20, 2024

martindurant commented Nov 20, 2024

rsignell commented Nov 20, 2024 •

edited

Loading

martindurant commented Nov 20, 2024

rsignell commented Nov 20, 2024

martindurant commented Nov 20, 2024

dcherian commented Nov 20, 2024 •

edited

Loading

phofl commented Nov 20, 2024

dcherian commented Nov 20, 2024

Low memory way to write parquet refs for an existing Zarr dataset? #529

Low memory way to write parquet refs for an existing Zarr dataset? #529

Comments

rsignell commented Nov 20, 2024 • edited Loading

martindurant commented Nov 20, 2024

rsignell commented Nov 20, 2024

martindurant commented Nov 20, 2024

rsignell commented Nov 20, 2024 • edited Loading

martindurant commented Nov 20, 2024

rsignell commented Nov 20, 2024

martindurant commented Nov 20, 2024

dcherian commented Nov 20, 2024 • edited Loading

phofl commented Nov 20, 2024

dcherian commented Nov 20, 2024

rsignell commented Nov 20, 2024 •

edited

Loading

rsignell commented Nov 20, 2024 •

edited

Loading

dcherian commented Nov 20, 2024 •

edited

Loading