Fix modelbuilder load #229

AuguB · 2023-08-15T13:20:57Z

A suggestion for changing the way modelbuilder saves and loads the model.

The two issues this addresses is:

In hierarchical models, the predict data may contain a subset of the groups in the fit data, so the coords of the fit data need to be stored, so that the indices of the groups in the predict data can be matched.
The fit data can not be saved in the model, as this could lead to privacy issues if models are shared.

- `generate_and_preprocess_data` is replaced by `preprocess_data` and `save_model_coords`. Both are called in `fit`, and `predict` calls only `preprocess_data`. - self.X and self.y are no longer used - fit data is not stored in idata, because this could lead to data sharing issues - 'model_coords' are now saved in `idata.attrs`. This enables reconstruction of the correct computation graph if correct test X is provided.

review-notebook-app · 2023-08-15T13:21:01Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

- `generate_and_preprocess_data` is replaced by `preprocess_data` and `save_model_coords`. Both are called in `fit`, and `predict` calls only `preprocess_data`. - self.X and self.y are no longer used - fit data is not stored in idata, because this could lead to data sharing issues - 'model_coords' are now saved in `idata.attrs`. This enables reconstruction of the correct computation graph if correct test X is provided.

…-experimental into fix_modelbuilder_load

michaelraczycki · 2023-08-15T14:37:40Z

pymc_experimental/model_builder.py

        self.model_config = model_config  # parameters for priors etc.
+        self.model_coords = None


I think this property should be included in child classes IF they utilise coords. Otherwise it forces all classes that would inherit from MB even if they don't have Hierarchical nature.

Thanks for your suggestion. The suggestion does not force any non-hierarchical model to populate the coords attribute. If a model has no coords, this can stay None

michaelraczycki · 2023-08-15T14:39:56Z

pymc_experimental/model_builder.py

@@ -258,6 +254,23 @@ def build_model(
        """
        raise NotImplementedError

+    def save_model_coords(self, X: Union[pd.DataFrame, pd.Series], y: pd.Series):


This should be an abstract method, since it doesn't actually implement anything (the assignment of None, to a property which is originally initialised to None)

Yes, this method does not do anything. I don't know exactly how python abstract methods work, but like this I hoped the specific implementation would be optional for those cases where the model coords actually need to be set. For non-hierarchical models, this method does not need to be touched.

The reason I didn't even include it here is exactly what you're mentioning here: A lot of models won't use the coords at all, but this here would force the models to contain property that will never be used, along with its setter method. In my opinion is best to not include it at all, and allow for developers of specific child model to include this, as it was done in case of DelayedSaturatedMMM from pymc-marketing

michaelraczycki · 2023-08-15T14:45:59Z

pymc_experimental/model_builder.py

@@ -682,7 +684,11 @@ def predict_proba(
        **kwargs,
    ) -> xr.DataArray:
        """Alias for `predict_posterior`, for consistency with scikit-learn probabilistic estimators."""
-        return self.predict_posterior(X_pred, extend_idata, combined, **kwargs)
+        synth_y = pd.Series(np.zeros(len(X_pred)))
+        X_pred_prep, y_synth_prep = self.preprocess_model_data(X_pred, synth_y)


preprocess_model_data should not be defined in a way that forces creation of entire dummy dataset for y if we want to process X, so: preprocess_model_data(X_pred: pd.DataFrame, y_pred:Optional[pd.Series]= None) and the logic that checks first if the y was provided.

michaelraczycki · 2023-08-15T14:50:19Z

pymc_experimental/model_builder.py

@@ -258,6 +254,23 @@ def build_model(
        """
        raise NotImplementedError

+    def save_model_coords(self, X: Union[pd.DataFrame, pd.Series], y: pd.Series):


Why does save_model_coords require both X and y as inputs?

I did that because I thought there may be some highly unusual cases in which y is also required. I would have no problem with it myself if this function only takes X.

In that case y should be optional param, with default value being None

michaelraczycki · 2023-08-15T14:52:15Z

pymc_experimental/model_builder.py

-    def generate_and_preprocess_model_data(
-        self, X: Union[pd.DataFrame, pd.Series], y: pd.Series
-    ) -> None:
+    def preprocess_model_data(self, X: Union[pd.DataFrame, pd.Series], y: pd.Series = None) -> None:


Signature doesn't match the return value from docstring, will result in mypy error in child class about breaking Liskov's principle if it was implemented in a way suggested in a docstring

michaelraczycki · 2023-08-15T15:03:14Z

@AuguB, the example I've included in our discussion (DelayedSaturatedMMM) is an Hierarchical model using coordinates. Just in my opinion defining a property and its dedicated setter isn't justifiable when a lot of models simply don't use them.
I'm consulting your concerns regarding the fit_data with other PyMC devs. I'm not sure tho, if the case you're talking about is very frequent one, and in a scenario where data privacy could be an issue, you can simply erase the fit_data by calling <your_model_instance>.idata.fit_data = None, and only then save the model. The id generated by model builder, which is later on used for making sure that the details specifying the model were properly preserved to allow for replication don't include fit_data, so the privacy issues are addressed, with no consequences to the model or it's ability to load.

michaelraczycki · 2023-08-15T15:08:46Z

Also not that much related to the topic, but pymc-marketing is already integrated with this API, so any change done here would result in quite an effort there to adjust (models themselves, tests, example notebooks, tutorials), so probably the separation of generating data based on the input dataset, and preprocessing will be addressed in the future, but this PR shouldn't be merged, unless those models would be adjusted. Nevertheless thank you for your effort with preparing this PR, it shows where we can focus our efforts in the future to make ModelBuilder API more user-friendly

AuguB · 2023-08-15T15:48:27Z

@AuguB, the example I've included in our discussion (DelayedSaturatedMMM) is an Hierarchical model using coordinates. Just in my opinion defining a property and its dedicated setter isn't justifiable when a lot of models simply don't use them. I'm consulting your concerns regarding the fit_data with other PyMC devs. I'm not sure tho, if the case you're talking about is very frequent one, and in a scenario where data privacy could be an issue, you can simply erase the fit_data by calling <your_model_instance>.idata.fit_data = None, and only then save the model. The id generated by model builder, which is later on used for making sure that the details specifying the model were properly preserved to allow for replication don't include fit_data, so the privacy issues are addressed, with no consequences to the model or it's ability to load.

I don't see how omitting the fit_data from the idata would have no consequences for the model's ability to load. I see a clear dependence on this data in the example here.

        dataset = idata.fit_data.to_dataframe()
        X = dataset.drop(columns=[model.output_var])
        y = dataset[model.output_var].values
        model.build_model(X, y)

But besides that, to be honest I think that shipping the full training dataset with a model just to make sure the computation graph is correctly constructed feels a bit cumbersome.

michaelraczycki · 2023-08-15T15:55:59Z

you're right, I completely missed it while writing the response. @ricardoV94 , @twiecki do you see this as an issue that will impact a lot of usual business cases? If the data privacy is an edge case I don't think it should be addressed here, but by overloading the load function in the child class that would be defined for that specific customer

AuguB · 2023-08-15T15:57:12Z

Also not that much related to the topic, but pymc-marketing is already integrated with this API, so any change done here would result in quite an effort there to adjust (models themselves, tests, example notebooks, tutorials), so probably the separation of generating data based on the input dataset, and preprocessing will be addressed in the future, but this PR shouldn't be merged, unless those models would be adjusted. Nevertheless thank you for your effort with preparing this PR, it shows where we can focus our efforts in the future to make ModelBuilder API more user-friendly

Yes, I understand that this is an issue. I would love to see this pattern be used in a later version, for now I will just use my fork.

Thanks for all the helpful comments.

ricardoV94 · 2023-08-16T07:36:15Z

@AuguB it is generally better to first open an issue to explain what functionality is missing/broken before jumping to a PR. This will ensure there is consensus before work is invested on your part.

Anyway, about having coords stuff by default, I think that is totally fine. We promote coords as part of a good workflow.

About the privacy issues. That is interesting, but not really unique to ModelBuilder? When you do pm.sample by default the observed data (and any Constant/MutableData) are stored in the arviz trace. This is considered good practice for "reproducibility".

I agree with @michaelraczycki that the user should manually delete the observed data if they want to get rid of it (or use a subclass of ModelBuilder that does that).

ricardoV94 · 2023-08-16T07:40:26Z

RE: Model coords. Was it the case that loaded models would lose coords information?

michaelraczycki · 2023-08-16T07:46:56Z

RE: Model coords. Was it the case that loaded models would lose coords information?

Nope, just the implementation of coords and their preservation I left for child classes, as part of the generate_and_preprocess_data (coords being one of the generation parts). I kinda did it on purpose, because I believe that if half of the models won't be using hierarchical models, we shouldn't force them to implement, or inherit methods that are purely there to take care of them

michaelraczycki · 2023-08-16T07:47:36Z

once again, for Hierarchical models I'm treating DelayedSaturatedMMM as an example of implementation

ricardoV94 · 2023-08-16T07:54:30Z

I don't follow what hierarchical models have to do with coords. You can (and should) use coords for any kind of model

michaelraczycki · 2023-08-16T07:57:39Z

okey, in that case we can move the parameter definition to MB, but the setter function should be an abstract one here

ricardoV94 · 2023-08-16T07:58:24Z

@AuguB it is generally better to first open an issue to explain what functionality is missing/broken before jumping to a PR. This will ensure there is consensus before work is invested on your part.

Apologies, I see you started a discussion (not issue) on that. Ignore that comment :)

ricardoV94 · 2023-08-16T07:59:31Z

okey, in that case we can move the parameter definition to MB, but the setter function should be an abstract one here

By abstract you mean an abc.abstractmethod (which must be implemented by every subclass)?

ricardoV94 · 2023-08-16T08:00:20Z

Aren't the coords saved in the fitted idata though? Why is something extra needed?

AuguB · 2023-08-16T08:11:49Z

About the privacy issues. That is interesting, but not really unique to ModelBuilder? When you do pm.sample by default the observed data (and any Constant/MutableData) are stored in the arviz trace. This is considered good practice for "reproducibility". I agree with @michaelraczycki that the user should manually delete the observed data if they want to get rid of it (or use a subclass of ModelBuilder that does that).

Yes, the data is stored by default, but removing the data from the trace would break the modelbuilder.load as it is now.

RE: Model coords. Was it the case that loaded models would lose coords information?

No, the coords information is still there. I agree that it is suboptimal to force the user to inherit methods that aren't used (and I don't think we should force the user to implement it by making it abstract). However, right now, if the user has privacy-sensitive data with coords that may not overlap completely (like me), he is forced to overload fit, predict, save, and load. That, to me, seems like a bit much just to avoid the save_coords method.

Aren't the coords saved in the fitted idata though? Why is something extra needed?

Because the coords dict will be needed in build_model called by fit, and I thought it would be useful to store the dict explicitly along with the model config, so that it is loaded and re-used in build_model called by predict, without forcing the user to write the boilerplate code that extracts them from the idata.

michaelraczycki · 2023-08-16T08:20:03Z

I may be wrong again (mostly dealing with implementation, not the actual usage) but you shouldn't build model again if you want to make predictions. _data_setter method should be implemented using pymc.set_data function, which replaces the data stored within the model variables

AuguB · 2023-08-16T08:21:10Z

I don't follow what hierarchical models have to do with coords. You can (and should) use coords for any kind of model

You're right. The point is that the predict_data may not have instances on all coords that were in the fit_data, so to correctly load the model it will need to access the coords that were in the train data. In the current version of modelbuilder, the model is loaded using the train data itself, but that is not possible in all applications (I work in the medical domain). In my opinion, the overhead of saving the coords independently of the data is acceptable to enable this use case.

AuguB · 2023-08-16T08:22:36Z

I may be wrong again (mostly dealing with implementation, not the actual usage) but you shouldn't build model again if you want to make predictions. _data_setter method should be implemented using pymc.set_data function, which replaces the data stored within the model variables

Not always, but I do fit -> save -> load -> predict, and then I have to build the model again in load

michaelraczycki · 2023-08-16T08:30:16Z

Okey, now I get it. Regarding the setter function, in my opinion it's easy: either it's common enough that we should force every class to implement it, if it's not so commonly used to justify the abstract part of it, it doesn't belong here. In SOLID principles, the 3rd states that you should be able to replace any method defined by the parent function with a Childs and it should not cause problems

Derived classes must be substitutable for their base classes.
A subclass should enhance (or at least not reduce) the behavior of the parent class.
It's not just about method signatures; the behavior of the subclass should align with the expectations set by the parent class.

So in this case the setter method here (without a concrete implementation) would not fulfil those requirements, as the replacement from subclass would change the outcome.

I'll try to get the PR making the coords a property of ModelBuilder this week, and try to address the load issue, to make it possible to load a model which data has been wiped out. Any thoughts on that @ricardoV94 ? I can include something more, probably we'd make a release afterwards

AuguB · 2023-08-16T08:41:40Z

Okay, great.

One related thing that has not been adressed, and which guided my thinking in this issue, is the following: I want to apply the same preprocessing on the predict_data as I did on the fit_data. So the steps for the two methods (assuming the model has been saved and loaded intermediately) are:

for fit:
preprocess -> save coords -> build model -> sample
for predict:
preprocess -> build model -> sample

Deciding against save_coords, the user would have to write two cases in preprocess: one in which the coords are extracted from the fit_data, and one in which the coords are extracted from the idata. Saving the model_coords would take away the case distinction.

michaelraczycki · 2023-08-16T08:51:17Z

in that case please put the preprocessing logic in _data_setter, it should contain everything needed to successfully replace the data in your model for predictions. I understand your reasoning, but you're talking about 2 different edge cases that can't all be addressed here. Even with other pymc models we've integrated MB with so far each of them needed overloading some functions, simply because it's almost impossible to make a template class that will address every single scenario, unless you want to enforce a lot of stuff down the road.

ricardoV94 · 2023-08-16T08:52:18Z

My only vague issue with saving coords separately is that there are now two sources for the same information? This can eventually become a subtle source of bugs if/when they diverge (e.g., different results in arviz plots and summary statistics). Usually this is a bad practice, although generalities don't always apply.

Can't we have a default self.extract_coords? that retrieves them from self.idata by default? A user can then override this method if they don't have all the coords in idata for whatever reason (and that user-defined method can retrieve them from the config of course). Otherwise I would by default use InferenceData to carry this information because that's in a sense the standard vehicle.

For example, if arviz changes how coords are stored internally we don't have to do anything in ModelBuilder. But if we are now responsible for encoding/decoding coords we are duplicating this work and potentially diverging from arviz.

ricardoV94 · 2023-08-16T08:59:50Z

Somewhat related, some methods like sample_posterior_predictive (used in predict) will behave differently if the model has MutableData variables that are missing or have changed from the idata (it will resample intermediate variables from the prior instead of using the trace posterior). The same happens for mutable_coords. So you have to be quite careful if you want to do predictions with an artificially mutated dataset.

This can certainly be overcome, but just wanted to raise that concern so you don't get surprised down the road.

AuguB · 2023-08-16T09:07:29Z

in that case please put the preprocessing logic in _data_setter, it should contain everything needed to successfully replace the data in your model for predictions. I understand your reasoning, but you're talking about 2 different edge cases that can't all be addressed here. Even with other pymc models we've integrated MB with so far each of them needed overloading some functions, simply because it's almost impossible to make a template class that will address every single scenario, unless you want to enforce a lot of stuff down the road.

But wouldn't that mean that we have the same preprocessing logic in preprocess_data and _data_setter?

Can't we have a default self.extract_coords? that retrieves them from self.idata by default? A user can then override this method if they don't have all the coords in idata for whatever reason (and that user-defined method can retrieve them from the config of course). Otherwise I would by default use InferenceData to carry this information because that's in a sense the standard vehicle.

I agree that it would be cleaner to read the coords from the idata, so I think extract_coords is a good idea. But I also believe we will need to separate preprocess from save_coords. We could cover both cases with something like this:

def extract_coords(self, X, y):
    if self.is_fitted:
        self.model_coords = # Extract coords from idata
    else:
        self.model_coords = self.extract_coords_from_data(X)

where extract_coords_from_data is abstract.

Then for both fit and predict, we would have preprocess -> extract coords -> build model -> sample

ricardoV94 · 2023-08-16T09:14:58Z

Sounds reasonable at first glance. I would probably pass both X and y but that's a minor detail at this point.

where extract_coords_from_data is abstract.

I would make it raise NotImplementedError not abstract (so far we haven't needed it, so I wouldn't require every subclass to implement it)

Before we go down that road, it would be good to check this actually solves your problem? See my message above about sample_posterior_predictive and missing/modified mutable data/coords from the idata.

AuguB · 2023-08-16T09:23:17Z

Somewhat related, some methods like sample_posterior_predictive (used in predict) will behave differently if the model has MutableData variables that are missing or have changed from the idata (it will resample intermediate variables from the prior instead of using the trace posterior). The same happens for mutable_coords. So you have to be quite careful if you want to do predictions with an artificially mutated dataset.

This can certainly be overcome, but just wanted to raise that concern so you don't get surprised down the road.

I'm not quite sure if I understand exactly what you mean. Do you have an example where this happens?

ricardoV94 · 2023-08-16T09:27:55Z

Somewhat related, some methods like sample_posterior_predictive (used in predict) will behave differently if the model has MutableData variables that are missing or have changed from the idata (it will resample intermediate variables from the prior instead of using the trace posterior). The same happens for mutable_coords. So you have to be quite careful if you want to do predictions with an artificially mutated dataset.
This can certainly be overcome, but just wanted to raise that concern so you don't get surprised down the road.

I'm not quite sure if I understand exactly what you mean. Do you have an example where this happens?

If you have a model parameter that depends on mutable data or has mutable dims that are missing (or have changed value), then those will be resampled from the prior, as if you had not learned anything about them.

import pymc as pm

with pm.Model() as m:
    x = pm.MutableData("x", 0)
    a = pm.Normal("a", x)
    b = pm.Normal("b", a, observed=[100] * 10)
    
    idata = pm.sample()
    pm.sample_posterior_predictive(idata)  # Sampling: [b]
    del idata.constant_data["x"]
    pm.sample_posterior_predictive(idata)  # Sampling: [a, b]

AuguB · 2023-08-16T09:28:37Z

I would make it raise NotImplementedError not abstract (so far we haven't needed it, so I wouldn't require every subclass to implement it)

Wouldn't this then raise the error for every subclass that hasn't implemented it, forcing the user to implement, having the same effect as making it abstract? I think so far it wasn't needed because the coords were extracted in generate_and_preprocess_model_data.

ricardoV94 · 2023-08-16T09:31:05Z

Wouldn't this then raise the error for every subclass that hasn't implemented it, forcing the user to implement, having the same effect as making it abstract? I think so far it wasn't needed because the coords were extracted in generate_and_preprocess_model_data.

Ah okay, I missed that

AuguB · 2023-08-16T09:40:36Z

If you have a model parameter that depends on mutable data or has mutable dims that are missing (or have changed value), then those will be resampled from the prior, as if you had not learned anything about them.

Right, makes sense, thanks. I don't believe I ran into this issue before, but good to know. I think I will just do idata.constant_data['X'] *= 0

ricardoV94 · 2023-08-16T09:45:21Z

Right, makes sense, thanks. I don't believe I ran into this issue before, but good to know. I think I will just do idata.constant_data['X'] *= 0

When you rebuild the model X must also be set to zero (it has to match or the same thing will happen). This is an edge case, most models don't have mutable data/coords on the inferred parameters.

Anyway, setting everything to zero (or nan) may be a good way to get around the privacy issues without having to remove things.

Revised version addressing all (or at least most) of the feedback. Comments: * I decided to split the `extract_model_coords` into `set_model_coords_from_data` and `set_model_coords_from_idata`. The former is abstract, the latter has a sensible default implementation. Both are now setters, but could also return the dict. It made no sense to combine them in the end. * The `set_model_coords_from_idata` is called in `load`, and not in `predict`. * Wrote a little `is_fitted` property

twiecki · 2023-10-11T14:04:22Z

pymc_experimental/model_builder.py

@@ -13,6 +13,8 @@
 #   limitations under the License.


+#   Modified by Stijn de Boer


We don't have this info in files, it's in the release notes and commit history.

Stijn de Boer (imec-OnePlanet) added 2 commits August 15, 2023 10:27

Updated fit and predict docstring

e553f47

Stijn de Boer (imec-OnePlanet) added 6 commits August 15, 2023 15:50

Updated the example notebook

57ccdb8

Update modelbuilder_example.ipynb

c89022b

Update modelbuilder_example.ipynb

f6cdcca

Update modelbuilder_example.ipynb

0a6f1dc

Merge branch 'fix_modelbuilder_load' of https://github.com/AuguB/pymc…

c2ba4bd

…-experimental into fix_modelbuilder_load

michaelraczycki requested changes Aug 15, 2023

View reviewed changes

AuguB closed this Aug 16, 2023

AuguB reopened this Aug 16, 2023

Stijn de Boer (imec-OnePlanet) and others added 2 commits August 16, 2023 17:14

Added my name to the top of the file

4370eca

twiecki reviewed Oct 11, 2023

View reviewed changes

		self.model_config = model_config # parameters for priors etc.
		self.model_coords = None

		@@ -13,6 +13,8 @@
		# limitations under the License.


		# Modified by Stijn de Boer

Fix modelbuilder load #229

Are you sure you want to change the base?

Fix modelbuilder load #229

Conversation

AuguB commented Aug 15, 2023 • edited Loading

review-notebook-app bot commented Aug 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelraczycki Aug 15, 2023 • edited Loading

Choose a reason for hiding this comment

michaelraczycki commented Aug 15, 2023 • edited Loading

michaelraczycki commented Aug 15, 2023

AuguB commented Aug 15, 2023 • edited Loading

michaelraczycki commented Aug 15, 2023

AuguB commented Aug 15, 2023

ricardoV94 commented Aug 16, 2023 • edited Loading

ricardoV94 commented Aug 16, 2023

michaelraczycki commented Aug 16, 2023

michaelraczycki commented Aug 16, 2023

ricardoV94 commented Aug 16, 2023

michaelraczycki commented Aug 16, 2023

ricardoV94 commented Aug 16, 2023

ricardoV94 commented Aug 16, 2023

ricardoV94 commented Aug 16, 2023 • edited Loading

AuguB commented Aug 16, 2023 • edited Loading

michaelraczycki commented Aug 16, 2023

AuguB commented Aug 16, 2023

AuguB commented Aug 16, 2023

michaelraczycki commented Aug 16, 2023 • edited Loading

AuguB commented Aug 16, 2023 • edited Loading

michaelraczycki commented Aug 16, 2023

ricardoV94 commented Aug 16, 2023 • edited Loading

ricardoV94 commented Aug 16, 2023 • edited Loading

AuguB commented Aug 16, 2023 • edited Loading

ricardoV94 commented Aug 16, 2023

AuguB commented Aug 16, 2023 • edited Loading

ricardoV94 commented Aug 16, 2023 • edited Loading

AuguB commented Aug 16, 2023

ricardoV94 commented Aug 16, 2023

AuguB commented Aug 16, 2023

ricardoV94 commented Aug 16, 2023 • edited Loading

Choose a reason for hiding this comment

AuguB commented Aug 15, 2023 •

edited

Loading

michaelraczycki Aug 15, 2023 •

edited

Loading

michaelraczycki commented Aug 15, 2023 •

edited

Loading

AuguB commented Aug 15, 2023 •

edited

Loading

ricardoV94 commented Aug 16, 2023 •

edited

Loading

ricardoV94 commented Aug 16, 2023 •

edited

Loading

AuguB commented Aug 16, 2023 •

edited

Loading

michaelraczycki commented Aug 16, 2023 •

edited

Loading

AuguB commented Aug 16, 2023 •

edited

Loading

ricardoV94 commented Aug 16, 2023 •

edited

Loading

ricardoV94 commented Aug 16, 2023 •

edited

Loading

AuguB commented Aug 16, 2023 •

edited

Loading

AuguB commented Aug 16, 2023 •

edited

Loading

ricardoV94 commented Aug 16, 2023 •

edited

Loading

ricardoV94 commented Aug 16, 2023 •

edited

Loading