ENH: Enable `xr.open_mfdataset` like functionality #1750

mortonjt · 2021-07-31T21:21:19Z

Tell us about it

There are some use-cases where a model can be decomposed into a bunch of parallel runs -- for instance breaking up a high dimensional dataset into a bunch of univariate datasets, where each dimension is fitted independently.

When fitting these models via MCMC or what not, the resulting model could generate many az.InferenceData objects.
In my own use cases, this could be anywhere from hundreds to tens of thousands of dimensions.
If the standard xr.concatenate was used to combine these models, it will could take days to merge all of these models -- even if dask was enabled. There is a workaround to load all of the az.InferenceData objects into memory and rechunk everything into dask, but it is non-trivial.

Thoughts on implementation

It could be useful if xr.open_mfdataset-like functionality was available to merge together many az.InferenceData objects from many files. Furthermore, there is out-of-box support in xr.open_mfdataset to use dask, which help better leverage parallelism when reading these files into memory.

The actual implementation

If I had to guestimate what this would look like (based on the xarray source), it would be something as follows

def open_mf_inference_data(inf_paths : List[Str], coords : dict, concatenate_name : str,  group_kwargs : dict, parallel=True) -> az.InferenceData:
    if parallel:
         open_f = dask.delayed(az.from_netcdf)
    else:
         open_f = az.from_netcdf 
    inf_list = [open_f(x, group_kwargs=group_kwargs) for x in inf_files]
    combined = concatenate_inferences(inf_list, coords, concatenate_name)
    return combined

where concatenate_inferences is already defined here. This implementation already works with hundreds of files, but this approach breaks down with tens of thousands of files. I have been able to get an implementation working with tens of thousands of files. However it is quite a hack, requiring subsetting the problem into smaller manageable chunks.

Relevant posts
#1749

pydata/xarray#5657

https://github.com/gibsramen/BIRDMAn/issues/57

Relevant papers that leverage this type of approach
https://science.sciencemag.org/content/364/6435/89.abstract
https://www.biorxiv.org/content/10.1101/757096v1

I imagine this will become increasingly common when dealing with high dimensional biological data.

@OriolAbril, do you anticipate that this sort of functionality would be useful to have in arviz?

CC @gibsramen

The text was updated successfully, but these errors were encountered:

ahartikainen · 2021-08-17T06:21:57Z

Hi, just interesting take.

We have a az.concat function, but it is not really meant for anything this big, but to consider something like this, then I think this new functionality should work against az.concat(dim="chain") <- (or draw). I'm not sure should it work against groups?

I wonder if there is a way to create final xarray object without creating many intermediate steps (xarray objects).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Enable `xr.open_mfdataset` like functionality #1750

ENH: Enable `xr.open_mfdataset` like functionality #1750

mortonjt commented Jul 31, 2021 •

edited

Loading

ahartikainen commented Aug 17, 2021

ENH: Enable xr.open_mfdataset like functionality #1750

ENH: Enable xr.open_mfdataset like functionality #1750

Comments

mortonjt commented Jul 31, 2021 • edited Loading

Tell us about it

Thoughts on implementation

The actual implementation

ahartikainen commented Aug 17, 2021

ENH: Enable `xr.open_mfdataset` like functionality #1750

ENH: Enable `xr.open_mfdataset` like functionality #1750

mortonjt commented Jul 31, 2021 •

edited

Loading