Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow passing masked arrays with missing values to pm.Data() #6645

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

kamicollo
Copy link

@kamicollo kamicollo commented Apr 2, 2023

This PR implements the changes required to support passing masked arrays to pm.Data() as discussed in issue #6626.

(rest of PR description TBD)
...

Checklist

Major / Breaking Changes

  • ...

New features

  • ...

Bugfixes

  • ...

Documentation

  • ...

Maintenance

  • ...

📚 Documentation preview 📚: https://pymc--6645.org.readthedocs.build/en/6645/

passed to pm.Data() and pm.Model().set_data()
- Integer masked arrays trigger an error message and
provide suggested alternatives
Copy link
Member

@ricardoV94 ricardoV94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening the PR.

We need to test if automatic imputation is working as that is the point of passing nan to observed in PyMC.

We also need to decide what to do if a user passes nan to a Mutabledata that is used as observed, only after defining the observed variable, as that won't trigger the automatic imputation routine. Maybe raise in that case?

I am also afraid MutableData+Imputation won't ever be useful during posterior predictive so maybe we shouldn't allow nan in Mutabledata at all? That avoids the problem above.


def test_masked_integer_data():
with pm.Model():
data = np.ma.MaskedArray([1, 2, 3], [0, 0, 1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integers should be fine, otherwise we can't input discrete variables?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we can't - you cannot have an integer NumPy array with nan values, ie. this throws an error: np.array([1,2,3, np.nan[. dtype=int). That's because nan is strictly a float concept. So yes, we would not be able to allow users pass an integer masked array into pm.Data(). If they want to benefit from automatic imputation, they can today (and will be after this PR) pass a masked integer array directly into observed parameter of an RV. I have an error message that explains the options in the code:

28d15f8#diff-823b37f218229d363550b4cc387cfffa180c5c6e0e5ad0e174f2f0be7aa4692aR102

if isinstance(data, np.ma.MaskedArray):
        if "int" in str(data.dtype):
            raise TypeError(
                "Masked integer arrays (integer type datasets with missing values) are not supported by pm.Data() / pm.Model.set_data() at this time. \n"
                "Consider if using a float type fits your use case. \n"
                "Alternatively, if you want to benefit from automatic imputation in pyMC, pass a masked array directly to `observed=` parameter when defining a distribution."
            )
        else:
            ret = data.filled(fill_value=np.nan)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't clear. Is any error raised if user pass floats observed to discrete variables? I think it works just fine.

@kamicollo
Copy link
Author

Thanks for the comments.

Agreed, the PR still needs to include the part that makes Mutable/Constant data to be used in the automatic imputation process (I just opened a draft PR as per guidelines to indicate I started working on it). I planned to extend the code that does automatic imputation to work directly with nan (instead of masks) when the observed variable is a Mutable/Constant data one. After all, that was the overall goal - to enable usage of pm.Data() with automatic imputation.

I'm not 100% sure what will happen with posterior predictive, but I think it should be the same as when passing a masked array to observed directly. I would have assumed that works - could you explain why you think it will be an issue with MutableData? (I can also test myself later today).

@ricardoV94
Copy link
Member

ricardoV94 commented Apr 2, 2023

The issue is what happens when you change Mutabledata between sampling and posterior predictive. PyMC will resample any variable that depends on a MutableData variable that has changed. This would include the imputed variables.

If that makes sense then it's fine. I am just not sure what people need when they use MutableData for both imputation and prediction.

@ricardoV94
Copy link
Member

And apologies, I didn't see you marked the PR as draft. No worries anyway :)

twiecki and others added 4 commits April 2, 2023 23:51
* ⬆️ UPGRADE: Autoupdate pre-commit config

---------

Co-authored-by: pymc-bot <[email protected]>
passed to pm.Data() and pm.Model().set_data()
- Integer masked arrays trigger an error message and
provide suggested alternatives
@kamicollo
Copy link
Author

kamicollo commented Apr 3, 2023

No problem at all :)

I have added the rest of the PR which enables automatic imputation support for pm.MutableData() / pm.ConstantData() variables. Testing-wise, I added extra tests to make_obs_var function (which performs automatic imputation, and went through all existing automatic imputation tests and parameterized them to run 3 times (original numpy masked/nan array, MutableData variable, ConstantData variable). It would be great if you could let me know if you think any other tests are needed.

Regarding posterior predictive - good question. I looked at how pyMC does it right now (when a masked numpy array is passed to observed) for a simple model where a covariate is missing, and I can see that:

  • sample_prior_predictive yields: X_missing and X in prior trace, and X_observed in prior_predictive trace.
  • sample yields X_missing and X in posterior trace.
  • sample_posterior_predictive yields: X_observed in observed_data trace; X and X_observed in posterior_predictive trace.

The X in posterior trace is the same as the X in the posterior_predictive trace for all indexes corresponding to missing data. It is different for non-missing values. That, by the way, makes me wonder if predicted values for Y are based on a mix of imputed values for X and observed values for X (which seems right) or predicted values of X (which seems wrong)...

So in a simple case, we don't seem to re-sample imputed values. However, if a user provides a new set of (missing) data, I would intuitively say it would make sense to resample the imputations, too. The missing entries in the new data are not the same as the old ones, after all.

Having said that - my current PR does not seem to work with sample_posterior_predictive and passing new data to a variable defined with pm.MutableData():

  • If I run the initial model with missing data, and then pass new missing data via model.set_data(), it seems it gets ignored in the posterior predictive sampling
  • If I run the initial model without missing data and then pass new missing data, the posterior predictive sampling errors out.

I suspect it's because the automatic imputation logic only gets triggered when the model is created, and so in the second case pps fails because of nan values. I am unsure why the first case doesn't work - possibly because set_data() doesn't propagate to the underlying sub-tensors behind pm.MutableData()? Would you have immediate thoughts for possible solutions?

Edit: actually, this even applies to situations where simple sample is used after set_data(). So my current PR effectively only enables passing Mutable/Constant data to observed but does not work when MutableData is updated.

@jonsedar
Copy link
Contributor

jonsedar commented Nov 7, 2024

Stumbling into this 18 months later, I'd like to be able to support the usual workflow, and allow for data to be missing in specified features:

  1. Sample posterior (and posterior predictive, lets live a little) on in-sample dataset
  2. Replace data with out-of-sample dataset and make forecast predictions

The auto-impute works great for the in-sample dataset, but when I want to replace data in the Data containers, there doesn't seem to be a way to replace the masked array.

E.g. inside model context I want to replace dfx_train with dfx_holdout, but there seems no straightforward way to do this

...
xk_ma = np.ma.masked_array(dfx_train[fts_xk].values, mask=np.isnan(dfx_train[fts_xk].values))
xk_mu = pm.Normal('xk_mu', mu=0.0, sigma=1, dims='xk_nm')
xk = pm.Normal('xk', mu=xk_mu, sigma=1.0, observed=xk_ma, dims=('oid', 'xk_nm'))
...

Any ideas of alternative methods?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants