Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low memory way to write parquet refs for an existing Zarr dataset? #529

Closed
rsignell opened this issue Nov 20, 2024 · 10 comments
Closed

Low memory way to write parquet refs for an existing Zarr dataset? #529

rsignell opened this issue Nov 20, 2024 · 10 comments

Comments

@rsignell
Copy link

rsignell commented Nov 20, 2024

It takes several minutes to open this ERA5 Zarr dataset on Google Cloud:

era5 = xr.open_dataset(
    "gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3",
    engine='zarr',
    chunks={})

So I thought I'd try speeding it up by writing references to parquet (the idea to then load the dataset with kerchunk).

refs = kerchunk.zarr.single_zarr("gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3", inline_threshold=300)
kerchunk.df.refs_to_dataframe(refs, 'era5.parq')

the only problem is... generating the references blew out the memory on my 256GB machine (after 4 hours of chugging away)!

Is there another way to do this instead of loading the entire reference dict and then writing the parquet?

@martindurant
Copy link
Member

Yes, it ought to be totally possible to write parquet references by iterating over the contents of a zarr, it just hasn't been done yet, as the anticipated usecase is for combining several zarrs rather than speeding up opening of a single existing zarr.

However, the dataset already has a .zmetadata file; so the question is: what is taking the time in opening the dataset? If it's loading many small coordinates arrays that could have been inlined, then kerchunk might indeed help you. If it's producing the dask graph, however (this has been an issue), then kerchunk won't help. You can test by removing the chunks={}.

@rsignell
Copy link
Author

Okay, removing chunks={} it opens in 4s, so indeed, so indeed it must be the production of the dask task graph that takes all that time. Thanks for the lesson @martindurant !

@martindurant
Copy link
Member

And to be clear: there is no good reason that dask should take any time to do this. I think it's da.from_numpy internally. I forget where the issue discussing this is, but it's been like this since forever.

@rsignell
Copy link
Author

rsignell commented Nov 20, 2024

Indeed I was pondering what this dask graph is for, since we are just opening the dataset. Is it problem that it's loading the time coordinate as 1M+ dask tasks (since there are 1M+ time steps, and each chunk has only 1 time step)?

@martindurant
Copy link
Member

No, I think it's producing tasks for each chunk of each variable. I suppose some of this is on xarray: the dask arrays backing each variable should not even be constructed until the said variable is selected!

I ran this myself now, and find that indeed the variables are backed by MaterializedLayer s ; but actually it seems that most generating layer objects takes 18% of the time and a whopping 38% is in tokenize (?!?!).

Screenshot 2024-11-20 at 10 24 05

I'd say maybe I can fix this ... but not until after pydata-global.

@rsignell
Copy link
Author

No, I think it's producing tasks for each chunk of each variable. I suppose some of this is on xarray: the dask arrays backing each variable should not even be constructed until the said variable is selected!

Indeed! Well, that would be awesome if this could be fixed, but OBVI no rush, as it's been this way forever! Should a new issue be created somewhere?

@martindurant
Copy link
Member

I don't mind a "still bad" issue on dask/dask... perhaps linking to the old one, if it can be found.

@dcherian
Copy link
Contributor

dcherian commented Nov 20, 2024

This has been known for a while though I can't find the upstream issues :/

cc @phofl

@phofl
Copy link

phofl commented Nov 20, 2024

I can take a look

@dcherian
Copy link
Contributor

Found it: pydata/xarray#8902

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants