SMC sample_stats cannot be saved to netcdf #5263

astoeriko · 2021-12-15T16:07:51Z

When sampling several chains with SMC, the different chains sometimes run a different number of stages. As a consequence, the beta, accept_rate and log_marginal_likelihood variables in the sample_stats of the inference data are non-square. PyMC currently deals with this by giving them an object data type (see this comment in the code).
While this works for converting to xarray I get an error when trying to save the InferenceData to netcdf.

The following example is not guaranteed to reproduce the error because I cannot force the two chains to run a different number of stages. However, I tried to pick an example where the number of stages is large, so that it is not very likely that both chains need the same number of stages.

import pymc as pm

with pm.Model() as model:
    # Create a model that leads to many stages in SMC because the posterior is
    # far from the prior.
    bar = pm.Normal("bar", sigma=0.5)
    pm.Normal("foo", mu=bar, observed=100, sigma=0.5)
    trace = pm.sample_smc(cores=1, draws=100)
trace.to_netcdf("test_sample_stats.nc")

Complete error traceback

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
TypeError: float() argument must be a string or a number, not 'list'

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
/tmp/ipykernel_106373/4238811079.py in <module>
----> 1 trace.to_netcdf("test_sample_stats.nc")

~/mambaforge/envs/pymc4-dev/lib/python3.9/site-packages/arviz/data/inference_data.py in to_netcdf(self, filename, compress, groups)
    390                 if compress:
    391                     kwargs["encoding"] = {var_name: {"zlib": True} for var_name in data.variables}
--> 392                 data.to_netcdf(filename, mode=mode, group=group, **kwargs)
    393                 data.close()
    394                 mode = "a"

~/mambaforge/envs/pymc4-dev/lib/python3.9/site-packages/xarray/core/dataset.py in to_netcdf(self, path, mode, format, group, engine, encoding, unlimited_dims, compute, invalid_netcdf)
   1900         from ..backends.api import to_netcdf
   1901 
-> 1902         return to_netcdf(
   1903             self,
   1904             path,

~/mambaforge/envs/pymc4-dev/lib/python3.9/site-packages/xarray/backends/api.py in to_netcdf(dataset, path_or_file, mode, format, group, engine, encoding, unlimited_dims, compute, multifile, invalid_netcdf)
   1070         # TODO: allow this work (setting up the file for writing array data)
   1071         # to be parallelized with dask
-> 1072         dump_to_store(
   1073             dataset, store, writer, encoding=encoding, unlimited_dims=unlimited_dims
   1074         )

~/mambaforge/envs/pymc4-dev/lib/python3.9/site-packages/xarray/backends/api.py in dump_to_store(dataset, store, writer, encoder, encoding, unlimited_dims)
   1117         variables, attrs = encoder(variables, attrs)
   1118 
-> 1119     store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
   1120 
   1121 

~/mambaforge/envs/pymc4-dev/lib/python3.9/site-packages/xarray/backends/common.py in store(self, variables, attributes, check_encoding_set, writer, unlimited_dims)
    259             writer = ArrayWriter()
    260 
--> 261         variables, attributes = self.encode(variables, attributes)
    262 
    263         self.set_attributes(attributes)

~/mambaforge/envs/pymc4-dev/lib/python3.9/site-packages/xarray/backends/common.py in encode(self, variables, attributes)
    348         # All NetCDF files get CF encoded by default, without this attempting
    349         # to write times, for example, would fail.
--> 350         variables, attributes = cf_encoder(variables, attributes)
    351         variables = {k: self.encode_variable(v) for k, v in variables.items()}
    352         attributes = {k: self.encode_attribute(v) for k, v in attributes.items()}

~/mambaforge/envs/pymc4-dev/lib/python3.9/site-packages/xarray/conventions.py in cf_encoder(variables, attributes)
    853     _update_bounds_encoding(variables)
    854 
--> 855     new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()}
    856 
    857     # Remove attrs from bounds variables (issue #2921)

~/mambaforge/envs/pymc4-dev/lib/python3.9/site-packages/xarray/conventions.py in <dictcomp>(.0)
    853     _update_bounds_encoding(variables)
    854 
--> 855     new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()}
    856 
    857     # Remove attrs from bounds variables (issue #2921)

~/mambaforge/envs/pymc4-dev/lib/python3.9/site-packages/xarray/conventions.py in encode_cf_variable(var, needs_copy, name)
    273     var = maybe_default_fill_value(var)
    274     var = maybe_encode_bools(var)
--> 275     var = ensure_dtype_not_object(var, name=name)
    276 
    277     for attr_name in CF_RELATED_DATA:

~/mambaforge/envs/pymc4-dev/lib/python3.9/site-packages/xarray/conventions.py in ensure_dtype_not_object(var, name)
    231             data[missing] = fill_value
    232         else:
--> 233             data = _copy_with_dtype(data, dtype=_infer_dtype(data, name))
    234 
    235         assert data.dtype.kind != "O" or data.dtype.metadata

~/mambaforge/envs/pymc4-dev/lib/python3.9/site-packages/xarray/conventions.py in _copy_with_dtype(data, dtype)
    189     """
    190     result = np.empty(data.shape, dtype)
--> 191     result[...] = data
    192     return result
    193 

ValueError: setting an array element with a sequence.

In my opinion it would be nicer to save the sample stats as a square array even if that means filling up with NaNs in the case that the chains have a different number of stages. Then, it could be represented properly in xarray with a stage dimension (which currently does not exist, the sample_stats falsely have "chain" and "draw" dimensions, even though they do not depend on the draw). I think that filling up with NaNs is not too bad because the number of stages do not differ hugely between the chains anyway.
Alternatively, sample stats could be saved as separate variables for each chain with a separate stage dimension for each chain.
Or would there be other solutions? Please let me know what you think about this and I would be happy to provide a pull request.

Besides, there is one more problem when saving to netcdf: Even if the chains do have the same number of stages, I get an error because the tune_steps attribute of the sample stats has a Boolean data type which does not seem to be supported in netcdf. When I convert it to a string saving works.

import pymc as pm

with pm.Model() as model:
    # Create a model that leads to many stages in SMC because the posterior is
    # far from the prior.
    bar = pm.Normal("bar", sigma=0.5)
    pm.Normal("foo", mu=bar, observed=3, sigma=0.5)
    trace = pm.sample_smc(cores=1, draws=100)

trace.to_netcdf("test_sample_stats.nc")

Complete error traceback

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_106373/1825203837.py in <module>
      8     trace = pm.sample_smc(cores=1, draws=100)
      9 
---> 10 trace.to_netcdf("test_sample_stats.nc")

~/mambaforge/envs/pymc4-dev/lib/python3.9/site-packages/arviz/data/inference_data.py in to_netcdf(self, filename, compress, groups)
    390                 if compress:
    391                     kwargs["encoding"] = {var_name: {"zlib": True} for var_name in data.variables}
--> 392                 data.to_netcdf(filename, mode=mode, group=group, **kwargs)
    393                 data.close()
    394                 mode = "a"

~/mambaforge/envs/pymc4-dev/lib/python3.9/site-packages/xarray/core/dataset.py in to_netcdf(self, path, mode, format, group, engine, encoding, unlimited_dims, compute, invalid_netcdf)
   1900         from ..backends.api import to_netcdf
   1901 
-> 1902         return to_netcdf(
   1903             self,
   1904             path,

~/mambaforge/envs/pymc4-dev/lib/python3.9/site-packages/xarray/backends/api.py in to_netcdf(dataset, path_or_file, mode, format, group, engine, encoding, unlimited_dims, compute, multifile, invalid_netcdf)
   1070         # TODO: allow this work (setting up the file for writing array data)
   1071         # to be parallelized with dask
-> 1072         dump_to_store(
   1073             dataset, store, writer, encoding=encoding, unlimited_dims=unlimited_dims
   1074         )

~/mambaforge/envs/pymc4-dev/lib/python3.9/site-packages/xarray/backends/api.py in dump_to_store(dataset, store, writer, encoder, encoding, unlimited_dims)
   1117         variables, attrs = encoder(variables, attrs)
   1118 
-> 1119     store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
   1120 
   1121 

~/mambaforge/envs/pymc4-dev/lib/python3.9/site-packages/xarray/backends/common.py in store(self, variables, attributes, check_encoding_set, writer, unlimited_dims)
    261         variables, attributes = self.encode(variables, attributes)
    262 
--> 263         self.set_attributes(attributes)
    264         self.set_dimensions(variables, unlimited_dims=unlimited_dims)
    265         self.set_variables(

~/mambaforge/envs/pymc4-dev/lib/python3.9/site-packages/xarray/backends/common.py in set_attributes(self, attributes)
    278         """
    279         for k, v in attributes.items():
--> 280             self.set_attribute(k, v)
    281 
    282     def set_variables(self, variables, check_encoding_set, writer, unlimited_dims=None):

~/mambaforge/envs/pymc4-dev/lib/python3.9/site-packages/xarray/backends/netCDF4_.py in set_attribute(self, key, value)
    446             self.ds.setncattr_string(key, value)
    447         else:
--> 448             self.ds.setncattr(key, value)
    449 
    450     def encode_variable(self, variable):

src/netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.setncattr()

src/netCDF4/_netCDF4.pyx in netCDF4._netCDF4._set_att()

TypeError: illegal data type for attribute b'tune_steps', must be one of dict_keys(['S1', 'i1', 'u1', 'i2', 'u2', 'i4', 'u4', 'i8', 'u8', 'f4', 'f8']), got b1

**Please provide any additional information below.**

Versions and main components

PyMC/PyMC3 Version: '4.0.0'
Aesara/Theano Version: '2.3.1'
Python Version: 3.9.7
Operating system: Linux
How did you install PyMC/PyMC3: conda

The text was updated successfully, but these errors were encountered:

ricardoV94 · 2021-12-15T16:44:24Z

Alternatively, sample stats could be saved as separate variables for each chain with a separate stage dimension for each chain.

That sounds like a more clean solution. Would you like to open a PR to this effect?

OriolAbril · 2021-12-17T12:19:38Z

Alternatively, sample stats could be saved as separate variables for each chain with a separate stage dimension for each chain.

That sounds like a more clean solution. Would you like to open a PR to this effect?

We do need to work on ArviZ side to support SMC properly instead of forcing it's output into HMC based structure, however none of us has the time to do it. I think it important to note though that (at the date of writing) renaming the fake draw dimension to stage or making chains different variables (and thus no chain dimension nor draw one as hey would then be stage_chain1, stage_chain2...) will break everything and you won't be able to plot SMC data or run rhat on it.

Whatever the approach I'd recommend testing the "chosen structure" beforehand on fake data to see what would still work or not work.

In my opinion it would be nicer to save the sample stats as a square array even if that means filling up with NaNs in the case that the chains have a different number of stages

I was actually unaware that xarray already supported ragged data and assumed the nan filling was the default behaviour in such cases. That is interesting. I will play around a bit if I have time.

Besides, there is one more problem when saving to netcdf: Even if the chains do have the same number of stages, I get an error because the tune_steps attribute of the sample stats has a Boolean data type which does not seem to be supported in netcdf. When I convert it to a string saving works.

netCDF does support boolean data and I have never had any problem with that. In fact, if you do az.load_arviz_data("rugby").sample_stats which loads an example inferencedata from a netcdf file you'll see that the diverging variable is boolean. You might have something wrong with your installation in addition to the ragged data limitation 🤔

astoeriko · 2021-12-17T13:29:38Z

Thank you both for your comments on this. I see that a solution to this might not be as easy as I thought.

I think it important to note though that (at the date of writing) renaming the fake draw dimension to stage or making chains different variables (and thus no chain dimension nor draw one as hey would then be stage_chain1, stage_chain2...) will break everything and you won't be able to plot SMC data or run rhat on it.

I am not super familiar with all of arviz' functionality, but I do not think it should be a problem with respect to plotting and calculating rhat (correct me if I am mistaken). A stage dimension should only occur in the sample_stats group. In the posterior, only data from the last stage is saved, so the variables would still have a chain and draw dimension. The variables in the sample_stats group (beta, accept_rate) really do not have (and should not have) a draw dimension, as far as I see. I don't know if arviz actually uses the information stored in the SMC sample_stats?

I was actually unaware that xarray already supported ragged data

I don't think it does support this and that is part of the problem. The ragged array is forced into xarray but then core xarray functionality like label-based indexing does not work properly.

astoeriko · 2021-12-17T13:36:16Z

netCDF does support boolean data and I have never had any problem with that. In fact, if you do az.load_arviz_data("rugby").sample_stats which loads an example inferencedata from a netcdf file you'll see that the diverging variable is boolean.

Boolean variables (as in the rugby example) are not a problem. I think that Boolean values for the attrs don't work, however. I will try to verify this.

OriolAbril · 2021-12-17T14:31:24Z

A stage dimension should only occur in the sample_stats group. In the posterior, only data from the last stage is saved, so the variables would still have a chain and draw dimension. The variables in the sample_stats group (beta, accept_rate) really do not have (and should not have) a draw dimension, as far as I see.

rhat was a bad example yes, it would not be affected. I sometimes plot traces for example for lp, accept rate or stepsize (during warmup), but I don't use SMC and don't know it there would be something equivalent. It wasn't a "definitely don't do this" comment, sorry if it came off this way, more of a warning.

I don't know if arviz actually uses the information stored in the SMC sample_stats?

Currently only divergences and energy from hmc/nuts are used by ArviZ, other things like what I mentioned above are more manual.

I think that Boolean values for the attrs don't work, however. I will try to verify this.

Ooh, I thought it was a variable, my bad. Yeah, attributes in netcdf are very limited, we also have to json encode arrays and dictionaries as those aren't supported either. The converter should probably make the conversion to string you mentioned to prevent this from happening.

Bit of a side note, if keeping the correct types in attributes is important, I would recommend using zarr instead of netcdf, it is still very new but will most probably end up replacing netcdf: https://arviz-devs.github.io/arviz/api/generated/arviz.InferenceData.from_zarr.html#arviz.InferenceData.from_zarr and https://arviz-devs.github.io/arviz/api/generated/arviz.InferenceData.to_zarr.html#arviz.InferenceData.to_zarr

ricardoV94 · 2021-12-18T08:28:54Z

I think its fine to give a new chain unique dim to these sample stats that happen per stage or substage (not sure if we are outputting any substage stats at the moment), and still leave the "fake" chain, draw, for the posterior samples.

CC @aloctavodia

ricardoV94 added enhancements SMC Sequential Monte Carlo labels Dec 15, 2021

ricardoV94 mentioned this issue Jan 8, 2022

SMC marginal test failing due to input type #5324

Closed

ricardoV94 added the beginner friendly label Jan 27, 2022

michaelosthege added the trace-backend Traces and ArviZ stuff label Jan 8, 2023

lmmx mentioned this issue Dec 7, 2023

Contributor Issue Tracker for PyData Global Sprint #7053

Open

57 tasks

tmolteno mentioned this issue Mar 13, 2024

BUG: return_inferencedata=True causes sample_smc to randomly fail #7192

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SMC sample_stats cannot be saved to netcdf #5263

SMC sample_stats cannot be saved to netcdf #5263

astoeriko commented Dec 15, 2021

ricardoV94 commented Dec 15, 2021

OriolAbril commented Dec 17, 2021 •

edited

Loading

astoeriko commented Dec 17, 2021 •

edited

Loading

astoeriko commented Dec 17, 2021

OriolAbril commented Dec 17, 2021

ricardoV94 commented Dec 18, 2021

SMC sample_stats cannot be saved to netcdf #5263

SMC sample_stats cannot be saved to netcdf #5263

Comments

astoeriko commented Dec 15, 2021

Versions and main components

ricardoV94 commented Dec 15, 2021

OriolAbril commented Dec 17, 2021 • edited Loading

astoeriko commented Dec 17, 2021 • edited Loading

astoeriko commented Dec 17, 2021

OriolAbril commented Dec 17, 2021

ricardoV94 commented Dec 18, 2021

OriolAbril commented Dec 17, 2021 •

edited

Loading

astoeriko commented Dec 17, 2021 •

edited

Loading