Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix modelbuilder load #229

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Conversation

AuguB
Copy link

@AuguB AuguB commented Aug 15, 2023

A suggestion for changing the way modelbuilder saves and loads the model.

The two issues this addresses is:

  1. In hierarchical models, the predict data may contain a subset of the groups in the fit data, so the coords of the fit data need to be stored, so that the indices of the groups in the predict data can be matched.
  2. The fit data can not be saved in the model, as this could lead to privacy issues if models are shared.

Stijn de Boer (imec-OnePlanet) added 2 commits August 15, 2023 10:27
- `generate_and_preprocess_data` is replaced by `preprocess_data` and `save_model_coords`. Both are called in `fit`, and `predict` calls only `preprocess_data`.
- self.X and self.y are no longer used
- fit data is not stored in idata, because this could lead to data sharing issues
- 'model_coords' are now saved in `idata.attrs`. This enables reconstruction of the correct computation graph if correct test X is provided.
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Stijn de Boer (imec-OnePlanet) added 6 commits August 15, 2023 15:50
- `generate_and_preprocess_data` is replaced by `preprocess_data` and `save_model_coords`. Both are called in `fit`, and `predict` calls only `preprocess_data`.
- self.X and self.y are no longer used
- fit data is not stored in idata, because this could lead to data sharing issues
- 'model_coords' are now saved in `idata.attrs`. This enables reconstruction of the correct computation graph if correct test X is provided.
self.model_config = model_config # parameters for priors etc.
self.model_coords = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this property should be included in child classes IF they utilise coords. Otherwise it forces all classes that would inherit from MB even if they don't have Hierarchical nature.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion. The suggestion does not force any non-hierarchical model to populate the coords attribute. If a model has no coords, this can stay None

@@ -258,6 +254,23 @@ def build_model(
"""
raise NotImplementedError

def save_model_coords(self, X: Union[pd.DataFrame, pd.Series], y: pd.Series):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an abstract method, since it doesn't actually implement anything (the assignment of None, to a property which is originally initialised to None)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this method does not do anything. I don't know exactly how python abstract methods work, but like this I hoped the specific implementation would be optional for those cases where the model coords actually need to be set. For non-hierarchical models, this method does not need to be touched.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I didn't even include it here is exactly what you're mentioning here: A lot of models won't use the coords at all, but this here would force the models to contain property that will never be used, along with its setter method. In my opinion is best to not include it at all, and allow for developers of specific child model to include this, as it was done in case of DelayedSaturatedMMM from pymc-marketing

@@ -682,7 +684,11 @@ def predict_proba(
**kwargs,
) -> xr.DataArray:
"""Alias for `predict_posterior`, for consistency with scikit-learn probabilistic estimators."""
return self.predict_posterior(X_pred, extend_idata, combined, **kwargs)
synth_y = pd.Series(np.zeros(len(X_pred)))
X_pred_prep, y_synth_prep = self.preprocess_model_data(X_pred, synth_y)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preprocess_model_data should not be defined in a way that forces creation of entire dummy dataset for y if we want to process X, so: preprocess_model_data(X_pred: pd.DataFrame, y_pred:Optional[pd.Series]= None) and the logic that checks first if the y was provided.

@@ -258,6 +254,23 @@ def build_model(
"""
raise NotImplementedError

def save_model_coords(self, X: Union[pd.DataFrame, pd.Series], y: pd.Series):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does save_model_coords require both X and y as inputs?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that because I thought there may be some highly unusual cases in which y is also required. I would have no problem with it myself if this function only takes X.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case y should be optional param, with default value being None

def generate_and_preprocess_model_data(
self, X: Union[pd.DataFrame, pd.Series], y: pd.Series
) -> None:
def preprocess_model_data(self, X: Union[pd.DataFrame, pd.Series], y: pd.Series = None) -> None:
Copy link
Collaborator

@michaelraczycki michaelraczycki Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signature doesn't match the return value from docstring, will result in mypy error in child class about breaking Liskov's principle if it was implemented in a way suggested in a docstring

@michaelraczycki
Copy link
Collaborator

michaelraczycki commented Aug 15, 2023

@AuguB, the example I've included in our discussion (DelayedSaturatedMMM) is an Hierarchical model using coordinates. Just in my opinion defining a property and its dedicated setter isn't justifiable when a lot of models simply don't use them.
I'm consulting your concerns regarding the fit_data with other PyMC devs. I'm not sure tho, if the case you're talking about is very frequent one, and in a scenario where data privacy could be an issue, you can simply erase the fit_data by calling <your_model_instance>.idata.fit_data = None, and only then save the model. The id generated by model builder, which is later on used for making sure that the details specifying the model were properly preserved to allow for replication don't include fit_data, so the privacy issues are addressed, with no consequences to the model or it's ability to load.

@michaelraczycki
Copy link
Collaborator

Also not that much related to the topic, but pymc-marketing is already integrated with this API, so any change done here would result in quite an effort there to adjust (models themselves, tests, example notebooks, tutorials), so probably the separation of generating data based on the input dataset, and preprocessing will be addressed in the future, but this PR shouldn't be merged, unless those models would be adjusted. Nevertheless thank you for your effort with preparing this PR, it shows where we can focus our efforts in the future to make ModelBuilder API more user-friendly

@AuguB
Copy link
Author

AuguB commented Aug 15, 2023

@AuguB, the example I've included in our discussion (DelayedSaturatedMMM) is an Hierarchical model using coordinates. Just in my opinion defining a property and its dedicated setter isn't justifiable when a lot of models simply don't use them. I'm consulting your concerns regarding the fit_data with other PyMC devs. I'm not sure tho, if the case you're talking about is very frequent one, and in a scenario where data privacy could be an issue, you can simply erase the fit_data by calling <your_model_instance>.idata.fit_data = None, and only then save the model. The id generated by model builder, which is later on used for making sure that the details specifying the model were properly preserved to allow for replication don't include fit_data, so the privacy issues are addressed, with no consequences to the model or it's ability to load.

I don't see how omitting the fit_data from the idata would have no consequences for the model's ability to load. I see a clear dependence on this data in the example here.

        dataset = idata.fit_data.to_dataframe()
        X = dataset.drop(columns=[model.output_var])
        y = dataset[model.output_var].values
        model.build_model(X, y)

But besides that, to be honest I think that shipping the full training dataset with a model just to make sure the computation graph is correctly constructed feels a bit cumbersome.

@michaelraczycki
Copy link
Collaborator

you're right, I completely missed it while writing the response. @ricardoV94 , @twiecki do you see this as an issue that will impact a lot of usual business cases? If the data privacy is an edge case I don't think it should be addressed here, but by overloading the load function in the child class that would be defined for that specific customer

@AuguB
Copy link
Author

AuguB commented Aug 15, 2023

Also not that much related to the topic, but pymc-marketing is already integrated with this API, so any change done here would result in quite an effort there to adjust (models themselves, tests, example notebooks, tutorials), so probably the separation of generating data based on the input dataset, and preprocessing will be addressed in the future, but this PR shouldn't be merged, unless those models would be adjusted. Nevertheless thank you for your effort with preparing this PR, it shows where we can focus our efforts in the future to make ModelBuilder API more user-friendly

Yes, I understand that this is an issue. I would love to see this pattern be used in a later version, for now I will just use my fork.

Thanks for all the helpful comments.

@ricardoV94
Copy link
Member

ricardoV94 commented Aug 16, 2023

@AuguB it is generally better to first open an issue to explain what functionality is missing/broken before jumping to a PR. This will ensure there is consensus before work is invested on your part.

Anyway, about having coords stuff by default, I think that is totally fine. We promote coords as part of a good workflow.

About the privacy issues. That is interesting, but not really unique to ModelBuilder? When you do pm.sample by default the observed data (and any Constant/MutableData) are stored in the arviz trace. This is considered good practice for "reproducibility".

I agree with @michaelraczycki that the user should manually delete the observed data if they want to get rid of it (or use a subclass of ModelBuilder that does that).

@ricardoV94
Copy link
Member

RE: Model coords. Was it the case that loaded models would lose coords information?

@michaelraczycki
Copy link
Collaborator

RE: Model coords. Was it the case that loaded models would lose coords information?

Nope, just the implementation of coords and their preservation I left for child classes, as part of the generate_and_preprocess_data (coords being one of the generation parts). I kinda did it on purpose, because I believe that if half of the models won't be using hierarchical models, we shouldn't force them to implement, or inherit methods that are purely there to take care of them

@michaelraczycki
Copy link
Collaborator

once again, for Hierarchical models I'm treating DelayedSaturatedMMM as an example of implementation

@ricardoV94
Copy link
Member

I don't follow what hierarchical models have to do with coords. You can (and should) use coords for any kind of model

@michaelraczycki
Copy link
Collaborator

okey, in that case we can move the parameter definition to MB, but the setter function should be an abstract one here

@ricardoV94
Copy link
Member

@AuguB it is generally better to first open an issue to explain what functionality is missing/broken before jumping to a PR. This will ensure there is consensus before work is invested on your part.

Apologies, I see you started a discussion (not issue) on that. Ignore that comment :)

@ricardoV94
Copy link
Member

okey, in that case we can move the parameter definition to MB, but the setter function should be an abstract one here

By abstract you mean an abc.abstractmethod (which must be implemented by every subclass)?

@ricardoV94
Copy link
Member

ricardoV94 commented Aug 16, 2023

Aren't the coords saved in the fitted idata though? Why is something extra needed?

@AuguB
Copy link
Author

AuguB commented Aug 16, 2023

About the privacy issues. That is interesting, but not really unique to ModelBuilder? When you do pm.sample by default the observed data (and any Constant/MutableData) are stored in the arviz trace. This is considered good practice for "reproducibility". I agree with @michaelraczycki that the user should manually delete the observed data if they want to get rid of it (or use a subclass of ModelBuilder that does that).

Yes, the data is stored by default, but removing the data from the trace would break the modelbuilder.load as it is now.

RE: Model coords. Was it the case that loaded models would lose coords information?

No, the coords information is still there. I agree that it is suboptimal to force the user to inherit methods that aren't used (and I don't think we should force the user to implement it by making it abstract). However, right now, if the user has privacy-sensitive data with coords that may not overlap completely (like me), he is forced to overload fit, predict, save, and load. That, to me, seems like a bit much just to avoid the save_coords method.

Aren't the coords saved in the fitted idata though? Why is something extra needed?

Because the coords dict will be needed in build_model called by fit, and I thought it would be useful to store the dict explicitly along with the model config, so that it is loaded and re-used in build_model called by predict, without forcing the user to write the boilerplate code that extracts them from the idata.

@michaelraczycki
Copy link
Collaborator

I may be wrong again (mostly dealing with implementation, not the actual usage) but you shouldn't build model again if you want to make predictions. _data_setter method should be implemented using pymc.set_data function, which replaces the data stored within the model variables

@AuguB
Copy link
Author

AuguB commented Aug 16, 2023

I don't follow what hierarchical models have to do with coords. You can (and should) use coords for any kind of model

You're right. The point is that the predict_data may not have instances on all coords that were in the fit_data, so to correctly load the model it will need to access the coords that were in the train data. In the current version of modelbuilder, the model is loaded using the train data itself, but that is not possible in all applications (I work in the medical domain). In my opinion, the overhead of saving the coords independently of the data is acceptable to enable this use case.

@AuguB
Copy link
Author

AuguB commented Aug 16, 2023

I may be wrong again (mostly dealing with implementation, not the actual usage) but you shouldn't build model again if you want to make predictions. _data_setter method should be implemented using pymc.set_data function, which replaces the data stored within the model variables

Not always, but I do fit -> save -> load -> predict, and then I have to build the model again in load

@AuguB AuguB closed this Aug 16, 2023
@AuguB AuguB reopened this Aug 16, 2023
@michaelraczycki
Copy link
Collaborator

michaelraczycki commented Aug 16, 2023

Okey, now I get it. Regarding the setter function, in my opinion it's easy: either it's common enough that we should force every class to implement it, if it's not so commonly used to justify the abstract part of it, it doesn't belong here. In SOLID principles, the 3rd states that you should be able to replace any method defined by the parent function with a Childs and it should not cause problems

Derived classes must be substitutable for their base classes.
A subclass should enhance (or at least not reduce) the behavior of the parent class.
It's not just about method signatures; the behavior of the subclass should align with the expectations set by the parent class.

So in this case the setter method here (without a concrete implementation) would not fulfil those requirements, as the replacement from subclass would change the outcome.

I'll try to get the PR making the coords a property of ModelBuilder this week, and try to address the load issue, to make it possible to load a model which data has been wiped out. Any thoughts on that @ricardoV94 ? I can include something more, probably we'd make a release afterwards

@AuguB
Copy link
Author

AuguB commented Aug 16, 2023

Okay, great.

One related thing that has not been adressed, and which guided my thinking in this issue, is the following: I want to apply the same preprocessing on the predict_data as I did on the fit_data. So the steps for the two methods (assuming the model has been saved and loaded intermediately) are:

for fit:
preprocess -> save coords -> build model -> sample
for predict:
preprocess -> build model -> sample

Deciding against save_coords, the user would have to write two cases in preprocess: one in which the coords are extracted from the fit_data, and one in which the coords are extracted from the idata. Saving the model_coords would take away the case distinction.

@michaelraczycki
Copy link
Collaborator

in that case please put the preprocessing logic in _data_setter, it should contain everything needed to successfully replace the data in your model for predictions. I understand your reasoning, but you're talking about 2 different edge cases that can't all be addressed here. Even with other pymc models we've integrated MB with so far each of them needed overloading some functions, simply because it's almost impossible to make a template class that will address every single scenario, unless you want to enforce a lot of stuff down the road.

@ricardoV94
Copy link
Member

ricardoV94 commented Aug 16, 2023

My only vague issue with saving coords separately is that there are now two sources for the same information? This can eventually become a subtle source of bugs if/when they diverge (e.g., different results in arviz plots and summary statistics). Usually this is a bad practice, although generalities don't always apply.

Can't we have a default self.extract_coords? that retrieves them from self.idata by default? A user can then override this method if they don't have all the coords in idata for whatever reason (and that user-defined method can retrieve them from the config of course). Otherwise I would by default use InferenceData to carry this information because that's in a sense the standard vehicle.

For example, if arviz changes how coords are stored internally we don't have to do anything in ModelBuilder. But if we are now responsible for encoding/decoding coords we are duplicating this work and potentially diverging from arviz.

@ricardoV94
Copy link
Member

ricardoV94 commented Aug 16, 2023

Somewhat related, some methods like sample_posterior_predictive (used in predict) will behave differently if the model has MutableData variables that are missing or have changed from the idata (it will resample intermediate variables from the prior instead of using the trace posterior). The same happens for mutable_coords. So you have to be quite careful if you want to do predictions with an artificially mutated dataset.

This can certainly be overcome, but just wanted to raise that concern so you don't get surprised down the road.

@AuguB
Copy link
Author

AuguB commented Aug 16, 2023

in that case please put the preprocessing logic in _data_setter, it should contain everything needed to successfully replace the data in your model for predictions. I understand your reasoning, but you're talking about 2 different edge cases that can't all be addressed here. Even with other pymc models we've integrated MB with so far each of them needed overloading some functions, simply because it's almost impossible to make a template class that will address every single scenario, unless you want to enforce a lot of stuff down the road.

But wouldn't that mean that we have the same preprocessing logic in preprocess_data and _data_setter?

Can't we have a default self.extract_coords? that retrieves them from self.idata by default? A user can then override this method if they don't have all the coords in idata for whatever reason (and that user-defined method can retrieve them from the config of course). Otherwise I would by default use InferenceData to carry this information because that's in a sense the standard vehicle.

I agree that it would be cleaner to read the coords from the idata, so I think extract_coords is a good idea. But I also believe we will need to separate preprocess from save_coords. We could cover both cases with something like this:

def extract_coords(self, X, y):
    if self.is_fitted:
        self.model_coords = # Extract coords from idata
    else:
        self.model_coords = self.extract_coords_from_data(X)

where extract_coords_from_data is abstract.

Then for both fit and predict, we would have preprocess -> extract coords -> build model -> sample

@ricardoV94
Copy link
Member

Sounds reasonable at first glance. I would probably pass both X and y but that's a minor detail at this point.

where extract_coords_from_data is abstract.

I would make it raise NotImplementedError not abstract (so far we haven't needed it, so I wouldn't require every subclass to implement it)

Before we go down that road, it would be good to check this actually solves your problem? See my message above about sample_posterior_predictive and missing/modified mutable data/coords from the idata.

@AuguB
Copy link
Author

AuguB commented Aug 16, 2023

Somewhat related, some methods like sample_posterior_predictive (used in predict) will behave differently if the model has MutableData variables that are missing or have changed from the idata (it will resample intermediate variables from the prior instead of using the trace posterior). The same happens for mutable_coords. So you have to be quite careful if you want to do predictions with an artificially mutated dataset.

This can certainly be overcome, but just wanted to raise that concern so you don't get surprised down the road.

I'm not quite sure if I understand exactly what you mean. Do you have an example where this happens?

@ricardoV94
Copy link
Member

ricardoV94 commented Aug 16, 2023

Somewhat related, some methods like sample_posterior_predictive (used in predict) will behave differently if the model has MutableData variables that are missing or have changed from the idata (it will resample intermediate variables from the prior instead of using the trace posterior). The same happens for mutable_coords. So you have to be quite careful if you want to do predictions with an artificially mutated dataset.
This can certainly be overcome, but just wanted to raise that concern so you don't get surprised down the road.

I'm not quite sure if I understand exactly what you mean. Do you have an example where this happens?

If you have a model parameter that depends on mutable data or has mutable dims that are missing (or have changed value), then those will be resampled from the prior, as if you had not learned anything about them.

import pymc as pm

with pm.Model() as m:
    x = pm.MutableData("x", 0)
    a = pm.Normal("a", x)
    b = pm.Normal("b", a, observed=[100] * 10)
    
    idata = pm.sample()
    pm.sample_posterior_predictive(idata)  # Sampling: [b]
    del idata.constant_data["x"]
    pm.sample_posterior_predictive(idata)  # Sampling: [a, b]

@AuguB
Copy link
Author

AuguB commented Aug 16, 2023

I would make it raise NotImplementedError not abstract (so far we haven't needed it, so I wouldn't require every subclass to implement it)

Wouldn't this then raise the error for every subclass that hasn't implemented it, forcing the user to implement, having the same effect as making it abstract? I think so far it wasn't needed because the coords were extracted in generate_and_preprocess_model_data.

@ricardoV94
Copy link
Member

Wouldn't this then raise the error for every subclass that hasn't implemented it, forcing the user to implement, having the same effect as making it abstract? I think so far it wasn't needed because the coords were extracted in generate_and_preprocess_model_data.

Ah okay, I missed that

@AuguB
Copy link
Author

AuguB commented Aug 16, 2023

If you have a model parameter that depends on mutable data or has mutable dims that are missing (or have changed value), then those will be resampled from the prior, as if you had not learned anything about them.

Right, makes sense, thanks. I don't believe I ran into this issue before, but good to know. I think I will just do idata.constant_data['X'] *= 0

@ricardoV94
Copy link
Member

ricardoV94 commented Aug 16, 2023

Right, makes sense, thanks. I don't believe I ran into this issue before, but good to know. I think I will just do idata.constant_data['X'] *= 0

When you rebuild the model X must also be set to zero (it has to match or the same thing will happen). This is an edge case, most models don't have mutable data/coords on the inferred parameters.

Anyway, setting everything to zero (or nan) may be a good way to get around the privacy issues without having to remove things.

Stijn de Boer (imec-OnePlanet) and others added 2 commits August 16, 2023 17:14
Revised version addressing all (or at least most) of the feedback.

Comments:
* I decided to split the `extract_model_coords` into `set_model_coords_from_data` and `set_model_coords_from_idata`. The former is abstract, the latter has a sensible default implementation. Both are now setters, but could also return the dict. It made no sense to combine them in the end.
* The `set_model_coords_from_idata` is called in `load`, and not in `predict`.
* Wrote a little `is_fitted` property
@@ -13,6 +13,8 @@
# limitations under the License.


# Modified by Stijn de Boer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have this info in files, it's in the release notes and commit history.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants