You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are some use-cases where a model can be decomposed into a bunch of parallel runs -- for instance breaking up a high dimensional dataset into a bunch of univariate datasets, where each dimension is fitted independently.
When fitting these models via MCMC or what not, the resulting model could generate many az.InferenceData objects.
In my own use cases, this could be anywhere from hundreds to tens of thousands of dimensions.
If the standard xr.concatenate was used to combine these models, it will could take days to merge all of these models -- even if dask was enabled. There is a workaround to load all of the az.InferenceData objects into memory and rechunk everything into dask, but it is non-trivial.
Thoughts on implementation
It could be useful if xr.open_mfdataset-like functionality was available to merge together many az.InferenceData objects from many files. Furthermore, there is out-of-box support in xr.open_mfdataset to use dask, which help better leverage parallelism when reading these files into memory.
The actual implementation
If I had to guestimate what this would look like (based on the xarray source), it would be something as follows
where concatenate_inferences is already defined here. This implementation already works with hundreds of files, but this approach breaks down with tens of thousands of files. I have been able to get an implementation working with tens of thousands of files. However it is quite a hack, requiring subsetting the problem into smaller manageable chunks.
We have a az.concat function, but it is not really meant for anything this big, but to consider something like this, then I think this new functionality should work against az.concat(dim="chain") <- (or draw). I'm not sure should it work against groups?
I wonder if there is a way to create final xarray object without creating many intermediate steps (xarray objects).
Tell us about it
There are some use-cases where a model can be decomposed into a bunch of parallel runs -- for instance breaking up a high dimensional dataset into a bunch of univariate datasets, where each dimension is fitted independently.
When fitting these models via MCMC or what not, the resulting model could generate many
az.InferenceData
objects.In my own use cases, this could be anywhere from hundreds to tens of thousands of dimensions.
If the standard
xr.concatenate
was used to combine these models, it will could take days to merge all of these models -- even if dask was enabled. There is a workaround to load all of theaz.InferenceData
objects into memory and rechunk everything into dask, but it is non-trivial.Thoughts on implementation
It could be useful if
xr.open_mfdataset
-like functionality was available to merge together manyaz.InferenceData
objects from many files. Furthermore, there is out-of-box support inxr.open_mfdataset
to use dask, which help better leverage parallelism when reading these files into memory.The actual implementation
If I had to guestimate what this would look like (based on the xarray source), it would be something as follows
where
concatenate_inferences
is already defined here. This implementation already works with hundreds of files, but this approach breaks down with tens of thousands of files. I have been able to get an implementation working with tens of thousands of files. However it is quite a hack, requiring subsetting the problem into smaller manageable chunks.Relevant posts
#1749
pydata/xarray#5657
https://github.com/gibsramen/BIRDMAn/issues/57
Relevant papers that leverage this type of approach
https://science.sciencemag.org/content/364/6435/89.abstract
https://www.biorxiv.org/content/10.1101/757096v1
I imagine this will become increasingly common when dealing with high dimensional biological data.
@OriolAbril, do you anticipate that this sort of functionality would be useful to have in arviz?
CC @gibsramen
The text was updated successfully, but these errors were encountered: