Parallelization via dask #7

TomNicholas · 2024-03-08T23:04:46Z

There are two places we could use xarray's machinery for parallelization to potentially speed up the generation of references.

Using parallel=True in xr.open_mfdataset, which would then use dask.delayed to parallelize the generation of the byte ranges from each file. This could be a big speedup, as it would parallelize the opening of the legacy files.
In theory we could also wrap the ManifestArray objects with dask.Array, then use dask's tree-reduce to do the concatenation. I think this is roughly what kerchunk.combine.auto_dask is approximating. However I'm not totally confident that (a) this is set up to work right now in dask.array or (b) this actually is a performance bottleneck in practice.

The text was updated successfully, but these errors were encountered:

TomNicholas · 2024-06-11T19:30:30Z

earthaccess uses dask.delayed on kerchunk SingleHdf5ToZarr calls

https://github.com/nsidc/earthaccess/blob/main/earthaccess/kerchunk.py#L47

@dcherian instead used dask.bag to do a tree-reduce inside xr.open_mfdataset without shipping datasets all back to the head node, which might be exactly what we need

pydata/xarray#8523

TomNicholas added the xarray Requires changes to xarray upstream label Mar 10, 2024

TomNicholas mentioned this issue Jun 11, 2024

Aspirational use case: [C]Worthy mCDR OAE Atlas dataset #132

Open

21 tasks

TomNicholas added references generation Reading byte ranges from archival files performance labels Nov 20, 2024

TomNicholas mentioned this issue Nov 20, 2024

PetaByte-scale virtual ref performance benchmark earth-mover/icechunk#401

Open

This was referenced Dec 12, 2024

Opening virtual datasets (dmr-adapter) nsidc/earthaccess#606

Merged

open_virtual_mfdataset #345

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelization via dask #7

Parallelization via dask #7

TomNicholas commented Mar 8, 2024

TomNicholas commented Jun 11, 2024

Parallelization via dask #7

Parallelization via dask #7

Comments

TomNicholas commented Mar 8, 2024

TomNicholas commented Jun 11, 2024