-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SMC sample_stats cannot be saved to netcdf #5263
Comments
That sounds like a more clean solution. Would you like to open a PR to this effect? |
We do need to work on ArviZ side to support SMC properly instead of forcing it's output into HMC based structure, however none of us has the time to do it. I think it important to note though that (at the date of writing) renaming the fake draw dimension to stage or making chains different variables (and thus no chain dimension nor draw one as hey would then be stage_chain1, stage_chain2...) will break everything and you won't be able to plot SMC data or run rhat on it. Whatever the approach I'd recommend testing the "chosen structure" beforehand on fake data to see what would still work or not work.
I was actually unaware that xarray already supported ragged data and assumed the nan filling was the default behaviour in such cases. That is interesting. I will play around a bit if I have time.
netCDF does support boolean data and I have never had any problem with that. In fact, if you do |
Thank you both for your comments on this. I see that a solution to this might not be as easy as I thought.
I am not super familiar with all of arviz' functionality, but I do not think it should be a problem with respect to plotting and calculating rhat (correct me if I am mistaken). A stage dimension should only occur in the sample_stats group. In the posterior, only data from the last stage is saved, so the variables would still have a chain and draw dimension. The variables in the sample_stats group (beta, accept_rate) really do not have (and should not have) a draw dimension, as far as I see. I don't know if arviz actually uses the information stored in the SMC sample_stats?
I don't think it does support this and that is part of the problem. The ragged array is forced into xarray but then core xarray functionality like label-based indexing does not work properly. |
Boolean variables (as in the rugby example) are not a problem. I think that Boolean values for the attrs don't work, however. I will try to verify this. |
rhat was a bad example yes, it would not be affected. I sometimes plot traces for example for lp, accept rate or stepsize (during warmup), but I don't use SMC and don't know it there would be something equivalent. It wasn't a "definitely don't do this" comment, sorry if it came off this way, more of a warning.
Currently only divergences and energy from hmc/nuts are used by ArviZ, other things like what I mentioned above are more manual.
Ooh, I thought it was a variable, my bad. Yeah, attributes in netcdf are very limited, we also have to json encode arrays and dictionaries as those aren't supported either. The converter should probably make the conversion to string you mentioned to prevent this from happening. Bit of a side note, if keeping the correct types in attributes is important, I would recommend using zarr instead of netcdf, it is still very new but will most probably end up replacing netcdf: https://arviz-devs.github.io/arviz/api/generated/arviz.InferenceData.from_zarr.html#arviz.InferenceData.from_zarr and https://arviz-devs.github.io/arviz/api/generated/arviz.InferenceData.to_zarr.html#arviz.InferenceData.to_zarr |
I think its fine to give a new chain unique dim to these sample stats that happen per stage or substage (not sure if we are outputting any substage stats at the moment), and still leave the "fake" chain, draw, for the posterior samples. CC @aloctavodia |
When sampling several chains with SMC, the different chains sometimes run a different number of stages. As a consequence, the beta, accept_rate and log_marginal_likelihood variables in the sample_stats of the inference data are non-square. PyMC currently deals with this by giving them an object data type (see this comment in the code).
While this works for converting to xarray I get an error when trying to save the InferenceData to netcdf.
The following example is not guaranteed to reproduce the error because I cannot force the two chains to run a different number of stages. However, I tried to pick an example where the number of stages is large, so that it is not very likely that both chains need the same number of stages.
Complete error traceback
In my opinion it would be nicer to save the sample stats as a square array even if that means filling up with NaNs in the case that the chains have a different number of stages. Then, it could be represented properly in xarray with a
stage
dimension (which currently does not exist, the sample_stats falsely have "chain" and "draw" dimensions, even though they do not depend on the draw). I think that filling up with NaNs is not too bad because the number of stages do not differ hugely between the chains anyway.Alternatively, sample stats could be saved as separate variables for each chain with a separate stage dimension for each chain.
Or would there be other solutions? Please let me know what you think about this and I would be happy to provide a pull request.
Besides, there is one more problem when saving to netcdf: Even if the chains do have the same number of stages, I get an error because the
tune_steps
attribute of the sample stats has a Boolean data type which does not seem to be supported in netcdf. When I convert it to a string saving works.Complete error traceback
Versions and main components
The text was updated successfully, but these errors were encountered: