Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelization via dask #7

Open
TomNicholas opened this issue Mar 8, 2024 · 1 comment
Open

Parallelization via dask #7

TomNicholas opened this issue Mar 8, 2024 · 1 comment
Labels
performance references generation Reading byte ranges from archival files xarray Requires changes to xarray upstream

Comments

@TomNicholas
Copy link
Member

There are two places we could use xarray's machinery for parallelization to potentially speed up the generation of references.

  1. Using parallel=True in xr.open_mfdataset, which would then use dask.delayed to parallelize the generation of the byte ranges from each file. This could be a big speedup, as it would parallelize the opening of the legacy files.

  2. In theory we could also wrap the ManifestArray objects with dask.Array, then use dask's tree-reduce to do the concatenation. I think this is roughly what kerchunk.combine.auto_dask is approximating. However I'm not totally confident that (a) this is set up to work right now in dask.array or (b) this actually is a performance bottleneck in practice.

@TomNicholas TomNicholas added the xarray Requires changes to xarray upstream label Mar 10, 2024
@TomNicholas
Copy link
Member Author

earthaccess uses dask.delayed on kerchunk SingleHdf5ToZarr calls

https://github.com/nsidc/earthaccess/blob/main/earthaccess/kerchunk.py#L47

@dcherian instead used dask.bag to do a tree-reduce inside xr.open_mfdataset without shipping datasets all back to the head node, which might be exactly what we need

pydata/xarray#8523

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance references generation Reading byte ranges from archival files xarray Requires changes to xarray upstream
Projects
None yet
Development

No branches or pull requests

1 participant