-
-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix modelbuilder load #229
base: main
Are you sure you want to change the base?
Conversation
- `generate_and_preprocess_data` is replaced by `preprocess_data` and `save_model_coords`. Both are called in `fit`, and `predict` calls only `preprocess_data`. - self.X and self.y are no longer used - fit data is not stored in idata, because this could lead to data sharing issues - 'model_coords' are now saved in `idata.attrs`. This enables reconstruction of the correct computation graph if correct test X is provided.
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
- `generate_and_preprocess_data` is replaced by `preprocess_data` and `save_model_coords`. Both are called in `fit`, and `predict` calls only `preprocess_data`. - self.X and self.y are no longer used - fit data is not stored in idata, because this could lead to data sharing issues - 'model_coords' are now saved in `idata.attrs`. This enables reconstruction of the correct computation graph if correct test X is provided.
…-experimental into fix_modelbuilder_load
self.model_config = model_config # parameters for priors etc. | ||
self.model_coords = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this property should be included in child classes IF they utilise coords. Otherwise it forces all classes that would inherit from MB even if they don't have Hierarchical nature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your suggestion. The suggestion does not force any non-hierarchical model to populate the coords
attribute. If a model has no coords, this can stay None
pymc_experimental/model_builder.py
Outdated
@@ -258,6 +254,23 @@ def build_model( | |||
""" | |||
raise NotImplementedError | |||
|
|||
def save_model_coords(self, X: Union[pd.DataFrame, pd.Series], y: pd.Series): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be an abstract method, since it doesn't actually implement anything (the assignment of None, to a property which is originally initialised to None)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this method does not do anything. I don't know exactly how python abstract methods work, but like this I hoped the specific implementation would be optional for those cases where the model coords actually need to be set. For non-hierarchical models, this method does not need to be touched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I didn't even include it here is exactly what you're mentioning here: A lot of models won't use the coords at all, but this here would force the models to contain property that will never be used, along with its setter method. In my opinion is best to not include it at all, and allow for developers of specific child model to include this, as it was done in case of DelayedSaturatedMMM from pymc-marketing
@@ -682,7 +684,11 @@ def predict_proba( | |||
**kwargs, | |||
) -> xr.DataArray: | |||
"""Alias for `predict_posterior`, for consistency with scikit-learn probabilistic estimators.""" | |||
return self.predict_posterior(X_pred, extend_idata, combined, **kwargs) | |||
synth_y = pd.Series(np.zeros(len(X_pred))) | |||
X_pred_prep, y_synth_prep = self.preprocess_model_data(X_pred, synth_y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
preprocess_model_data should not be defined in a way that forces creation of entire dummy dataset for y if we want to process X, so: preprocess_model_data(X_pred: pd.DataFrame, y_pred:Optional[pd.Series]= None) and the logic that checks first if the y was provided.
pymc_experimental/model_builder.py
Outdated
@@ -258,6 +254,23 @@ def build_model( | |||
""" | |||
raise NotImplementedError | |||
|
|||
def save_model_coords(self, X: Union[pd.DataFrame, pd.Series], y: pd.Series): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does save_model_coords require both X and y as inputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did that because I thought there may be some highly unusual cases in which y is also required. I would have no problem with it myself if this function only takes X.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case y should be optional param, with default value being None
pymc_experimental/model_builder.py
Outdated
def generate_and_preprocess_model_data( | ||
self, X: Union[pd.DataFrame, pd.Series], y: pd.Series | ||
) -> None: | ||
def preprocess_model_data(self, X: Union[pd.DataFrame, pd.Series], y: pd.Series = None) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signature doesn't match the return value from docstring, will result in mypy error in child class about breaking Liskov's principle if it was implemented in a way suggested in a docstring
@AuguB, the example I've included in our discussion (DelayedSaturatedMMM) is an Hierarchical model using coordinates. Just in my opinion defining a property and its dedicated setter isn't justifiable when a lot of models simply don't use them. |
Also not that much related to the topic, but pymc-marketing is already integrated with this API, so any change done here would result in quite an effort there to adjust (models themselves, tests, example notebooks, tutorials), so probably the separation of generating data based on the input dataset, and preprocessing will be addressed in the future, but this PR shouldn't be merged, unless those models would be adjusted. Nevertheless thank you for your effort with preparing this PR, it shows where we can focus our efforts in the future to make ModelBuilder API more user-friendly |
I don't see how omitting the fit_data from the idata would have no consequences for the model's ability to load. I see a clear dependence on this data in the example here. dataset = idata.fit_data.to_dataframe()
X = dataset.drop(columns=[model.output_var])
y = dataset[model.output_var].values
model.build_model(X, y) But besides that, to be honest I think that shipping the full training dataset with a model just to make sure the computation graph is correctly constructed feels a bit cumbersome. |
you're right, I completely missed it while writing the response. @ricardoV94 , @twiecki do you see this as an issue that will impact a lot of usual business cases? If the data privacy is an edge case I don't think it should be addressed here, but by overloading the load function in the child class that would be defined for that specific customer |
Yes, I understand that this is an issue. I would love to see this pattern be used in a later version, for now I will just use my fork. Thanks for all the helpful comments. |
@AuguB it is generally better to first open an issue to explain what functionality is missing/broken before jumping to a PR. This will ensure there is consensus before work is invested on your part. Anyway, about having coords stuff by default, I think that is totally fine. We promote coords as part of a good workflow. About the privacy issues. That is interesting, but not really unique to ModelBuilder? When you do I agree with @michaelraczycki that the user should manually delete the observed data if they want to get rid of it (or use a subclass of ModelBuilder that does that). |
RE: Model coords. Was it the case that loaded models would lose coords information? |
Nope, just the implementation of coords and their preservation I left for child classes, as part of the generate_and_preprocess_data (coords being one of the generation parts). I kinda did it on purpose, because I believe that if half of the models won't be using hierarchical models, we shouldn't force them to implement, or inherit methods that are purely there to take care of them |
once again, for Hierarchical models I'm treating DelayedSaturatedMMM as an example of implementation |
I don't follow what hierarchical models have to do with coords. You can (and should) use coords for any kind of model |
okey, in that case we can move the parameter definition to MB, but the setter function should be an abstract one here |
Apologies, I see you started a discussion (not issue) on that. Ignore that comment :) |
By abstract you mean an |
Aren't the coords saved in the fitted idata though? Why is something extra needed? |
Yes, the data is stored by default, but removing the data from the trace would break the modelbuilder.load as it is now.
No, the coords information is still there. I agree that it is suboptimal to force the user to inherit methods that aren't used (and I don't think we should force the user to implement it by making it abstract). However, right now, if the user has privacy-sensitive data with coords that may not overlap completely (like me), he is forced to overload fit, predict, save, and load. That, to me, seems like a bit much just to avoid the
Because the coords dict will be needed in |
I may be wrong again (mostly dealing with implementation, not the actual usage) but you shouldn't build model again if you want to make predictions. _data_setter method should be implemented using pymc.set_data function, which replaces the data stored within the model variables |
You're right. The point is that the |
Not always, but I do |
Okey, now I get it. Regarding the setter function, in my opinion it's easy: either it's common enough that we should force every class to implement it, if it's not so commonly used to justify the abstract part of it, it doesn't belong here. In SOLID principles, the 3rd states that you should be able to replace any method defined by the parent function with a Childs and it should not cause problems Derived classes must be substitutable for their base classes. So in this case the setter method here (without a concrete implementation) would not fulfil those requirements, as the replacement from subclass would change the outcome. I'll try to get the PR making the coords a property of ModelBuilder this week, and try to address the load issue, to make it possible to load a model which data has been wiped out. Any thoughts on that @ricardoV94 ? I can include something more, probably we'd make a release afterwards |
Okay, great. One related thing that has not been adressed, and which guided my thinking in this issue, is the following: I want to apply the same preprocessing on the for fit: Deciding against |
in that case please put the preprocessing logic in _data_setter, it should contain everything needed to successfully replace the data in your model for predictions. I understand your reasoning, but you're talking about 2 different edge cases that can't all be addressed here. Even with other pymc models we've integrated MB with so far each of them needed overloading some functions, simply because it's almost impossible to make a template class that will address every single scenario, unless you want to enforce a lot of stuff down the road. |
My only vague issue with saving coords separately is that there are now two sources for the same information? This can eventually become a subtle source of bugs if/when they diverge (e.g., different results in arviz plots and summary statistics). Usually this is a bad practice, although generalities don't always apply. Can't we have a default For example, if arviz changes how coords are stored internally we don't have to do anything in ModelBuilder. But if we are now responsible for encoding/decoding coords we are duplicating this work and potentially diverging from arviz. |
Somewhat related, some methods like This can certainly be overcome, but just wanted to raise that concern so you don't get surprised down the road. |
But wouldn't that mean that we have the same preprocessing logic in
I agree that it would be cleaner to read the coords from the def extract_coords(self, X, y):
if self.is_fitted:
self.model_coords = # Extract coords from idata
else:
self.model_coords = self.extract_coords_from_data(X) where Then for both |
Sounds reasonable at first glance. I would probably pass both
I would make it raise Before we go down that road, it would be good to check this actually solves your problem? See my message above about |
I'm not quite sure if I understand exactly what you mean. Do you have an example where this happens? |
If you have a model parameter that depends on mutable data or has mutable dims that are missing (or have changed value), then those will be resampled from the prior, as if you had not learned anything about them. import pymc as pm
with pm.Model() as m:
x = pm.MutableData("x", 0)
a = pm.Normal("a", x)
b = pm.Normal("b", a, observed=[100] * 10)
idata = pm.sample()
pm.sample_posterior_predictive(idata) # Sampling: [b]
del idata.constant_data["x"]
pm.sample_posterior_predictive(idata) # Sampling: [a, b] |
Wouldn't this then raise the error for every subclass that hasn't implemented it, forcing the user to implement, having the same effect as making it abstract? I think so far it wasn't needed because the coords were extracted in |
Ah okay, I missed that |
Right, makes sense, thanks. I don't believe I ran into this issue before, but good to know. I think I will just do |
When you rebuild the model Anyway, setting everything to zero (or nan) may be a good way to get around the privacy issues without having to remove things. |
Revised version addressing all (or at least most) of the feedback. Comments: * I decided to split the `extract_model_coords` into `set_model_coords_from_data` and `set_model_coords_from_idata`. The former is abstract, the latter has a sensible default implementation. Both are now setters, but could also return the dict. It made no sense to combine them in the end. * The `set_model_coords_from_idata` is called in `load`, and not in `predict`. * Wrote a little `is_fitted` property
@@ -13,6 +13,8 @@ | |||
# limitations under the License. | |||
|
|||
|
|||
# Modified by Stijn de Boer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have this info in files, it's in the release notes and commit history.
A suggestion for changing the way modelbuilder saves and loads the model.
The two issues this addresses is: