Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Enable xr.open_mfdataset like functionality #1750

Open
mortonjt opened this issue Jul 31, 2021 · 1 comment
Open

ENH: Enable xr.open_mfdataset like functionality #1750

mortonjt opened this issue Jul 31, 2021 · 1 comment

Comments

@mortonjt
Copy link
Contributor

mortonjt commented Jul 31, 2021

Tell us about it

There are some use-cases where a model can be decomposed into a bunch of parallel runs -- for instance breaking up a high dimensional dataset into a bunch of univariate datasets, where each dimension is fitted independently.

When fitting these models via MCMC or what not, the resulting model could generate many az.InferenceData objects.
In my own use cases, this could be anywhere from hundreds to tens of thousands of dimensions.
If the standard xr.concatenate was used to combine these models, it will could take days to merge all of these models -- even if dask was enabled. There is a workaround to load all of the az.InferenceData objects into memory and rechunk everything into dask, but it is non-trivial.

Thoughts on implementation

It could be useful if xr.open_mfdataset-like functionality was available to merge together many az.InferenceData objects from many files. Furthermore, there is out-of-box support in xr.open_mfdataset to use dask, which help better leverage parallelism when reading these files into memory.

The actual implementation

If I had to guestimate what this would look like (based on the xarray source), it would be something as follows

def open_mf_inference_data(inf_paths : List[Str], coords : dict, concatenate_name : str,  group_kwargs : dict, parallel=True) -> az.InferenceData:
    if parallel:
         open_f = dask.delayed(az.from_netcdf)
    else:
         open_f = az.from_netcdf 
    inf_list = [open_f(x, group_kwargs=group_kwargs) for x in inf_files]
    combined = concatenate_inferences(inf_list, coords, concatenate_name)
    return combined

where concatenate_inferences is already defined here. This implementation already works with hundreds of files, but this approach breaks down with tens of thousands of files. I have been able to get an implementation working with tens of thousands of files. However it is quite a hack, requiring subsetting the problem into smaller manageable chunks.

Relevant posts
#1749

pydata/xarray#5657

https://github.com/gibsramen/BIRDMAn/issues/57

Relevant papers that leverage this type of approach
https://science.sciencemag.org/content/364/6435/89.abstract
https://www.biorxiv.org/content/10.1101/757096v1

I imagine this will become increasingly common when dealing with high dimensional biological data.

@OriolAbril, do you anticipate that this sort of functionality would be useful to have in arviz?

CC @gibsramen

@ahartikainen
Copy link
Contributor

Hi, just interesting take.

We have a az.concat function, but it is not really meant for anything this big, but to consider something like this, then I think this new functionality should work against az.concat(dim="chain") <- (or draw). I'm not sure should it work against groups?

I wonder if there is a way to create final xarray object without creating many intermediate steps (xarray objects).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants