Allow passing masked arrays with missing values to pm.Data() #6645

kamicollo · 2023-04-02T04:19:18Z

This PR implements the changes required to support passing masked arrays to pm.Data() as discussed in issue #6626.

(rest of PR description TBD)
...

Checklist

Explain important implementation details 👆
Make sure that the pre-commit linting/style checks pass.
Link relevant issues (preferably in nice commit messages)
Are the changes covered by tests and docstrings?
Fill out the short summary sections 👇

Major / Breaking Changes

...

New features

...

Bugfixes

...

Documentation

...

Maintenance

...

📚 Documentation preview 📚: https://pymc--6645.org.readthedocs.build/en/6645/

passed to pm.Data() and pm.Model().set_data() - Integer masked arrays trigger an error message and provide suggested alternatives

ricardoV94

Thanks for opening the PR.

We need to test if automatic imputation is working as that is the point of passing nan to observed in PyMC.

We also need to decide what to do if a user passes nan to a Mutabledata that is used as observed, only after defining the observed variable, as that won't trigger the automatic imputation routine. Maybe raise in that case?

I am also afraid MutableData+Imputation won't ever be useful during posterior predictive so maybe we shouldn't allow nan in Mutabledata at all? That avoids the problem above.

ricardoV94 · 2023-04-02T10:01:44Z

tests/test_data.py

+
+def test_masked_integer_data():
+    with pm.Model():
+        data = np.ma.MaskedArray([1, 2, 3], [0, 0, 1])


Integers should be fine, otherwise we can't input discrete variables?

Unfortunately, we can't - you cannot have an integer NumPy array with nan values, ie. this throws an error: np.array([1,2,3, np.nan[. dtype=int). That's because nan is strictly a float concept. So yes, we would not be able to allow users pass an integer masked array into pm.Data(). If they want to benefit from automatic imputation, they can today (and will be after this PR) pass a masked integer array directly into observed parameter of an RV. I have an error message that explains the options in the code:

28d15f8#diff-823b37f218229d363550b4cc387cfffa180c5c6e0e5ad0e174f2f0be7aa4692aR102

if isinstance(data, np.ma.MaskedArray): if "int" in str(data.dtype): raise TypeError( "Masked integer arrays (integer type datasets with missing values) are not supported by pm.Data() / pm.Model.set_data() at this time. \n" "Consider if using a float type fits your use case. \n" "Alternatively, if you want to benefit from automatic imputation in pyMC, pass a masked array directly to `observed=` parameter when defining a distribution." ) else: ret = data.filled(fill_value=np.nan)

I wasn't clear. Is any error raised if user pass floats observed to discrete variables? I think it works just fine.

kamicollo · 2023-04-02T17:28:47Z

Thanks for the comments.

Agreed, the PR still needs to include the part that makes Mutable/Constant data to be used in the automatic imputation process (I just opened a draft PR as per guidelines to indicate I started working on it). I planned to extend the code that does automatic imputation to work directly with nan (instead of masks) when the observed variable is a Mutable/Constant data one. After all, that was the overall goal - to enable usage of pm.Data() with automatic imputation.

I'm not 100% sure what will happen with posterior predictive, but I think it should be the same as when passing a masked array to observed directly. I would have assumed that works - could you explain why you think it will be an issue with MutableData? (I can also test myself later today).

ricardoV94 · 2023-04-02T17:37:44Z

The issue is what happens when you change Mutabledata between sampling and posterior predictive. PyMC will resample any variable that depends on a MutableData variable that has changed. This would include the imputed variables.

If that makes sense then it's fine. I am just not sure what people need when they use MutableData for both imputation and prediction.

ricardoV94 · 2023-04-02T17:38:23Z

And apologies, I didn't see you marked the PR as draft. No worries anyway :)

* ⬆️ UPGRADE: Autoupdate pre-commit config --------- Co-authored-by: pymc-bot <[email protected]>

passed to pm.Data() and pm.Model().set_data() - Integer masked arrays trigger an error message and provide suggested alternatives

…c into pm_data_imputation_support

kamicollo · 2023-04-03T02:40:06Z

No problem at all :)

I have added the rest of the PR which enables automatic imputation support for pm.MutableData() / pm.ConstantData() variables. Testing-wise, I added extra tests to make_obs_var function (which performs automatic imputation, and went through all existing automatic imputation tests and parameterized them to run 3 times (original numpy masked/nan array, MutableData variable, ConstantData variable). It would be great if you could let me know if you think any other tests are needed.

Regarding posterior predictive - good question. I looked at how pyMC does it right now (when a masked numpy array is passed to observed) for a simple model where a covariate is missing, and I can see that:

sample_prior_predictive yields: X_missing and X in prior trace, and X_observed in prior_predictive trace.
sample yields X_missing and X in posterior trace.
sample_posterior_predictive yields: X_observed in observed_data trace; X and X_observed in posterior_predictive trace.

The X in posterior trace is the same as the X in the posterior_predictive trace for all indexes corresponding to missing data. It is different for non-missing values. That, by the way, makes me wonder if predicted values for Y are based on a mix of imputed values for X and observed values for X (which seems right) or predicted values of X (which seems wrong)...

So in a simple case, we don't seem to re-sample imputed values. However, if a user provides a new set of (missing) data, I would intuitively say it would make sense to resample the imputations, too. The missing entries in the new data are not the same as the old ones, after all.

Having said that - my current PR does not seem to work with sample_posterior_predictive and passing new data to a variable defined with pm.MutableData():

If I run the initial model with missing data, and then pass new missing data via model.set_data(), it seems it gets ignored in the posterior predictive sampling
If I run the initial model without missing data and then pass new missing data, the posterior predictive sampling errors out.

I suspect it's because the automatic imputation logic only gets triggered when the model is created, and so in the second case pps fails because of nan values. I am unsure why the first case doesn't work - possibly because set_data() doesn't propagate to the underlying sub-tensors behind pm.MutableData()? Would you have immediate thoughts for possible solutions?

Edit: actually, this even applies to situations where simple sample is used after set_data(). So my current PR effectively only enables passing Mutable/Constant data to observed but does not work when MutableData is updated.

jonsedar · 2024-11-07T09:00:48Z

Stumbling into this 18 months later, I'd like to be able to support the usual workflow, and allow for data to be missing in specified features:

Sample posterior (and posterior predictive, lets live a little) on in-sample dataset
Replace data with out-of-sample dataset and make forecast predictions

The auto-impute works great for the in-sample dataset, but when I want to replace data in the Data containers, there doesn't seem to be a way to replace the masked array.

E.g. inside model context I want to replace dfx_train with dfx_holdout, but there seems no straightforward way to do this

...
xk_ma = np.ma.masked_array(dfx_train[fts_xk].values, mask=np.isnan(dfx_train[fts_xk].values))
xk_mu = pm.Normal('xk_mu', mu=0.0, sigma=1, dims='xk_nm')
xk = pm.Normal('xk', mu=xk_mu, sigma=1.0, observed=xk_ma, dims=('oid', 'xk_nm'))
...

Any ideas of alternative methods?

- Float masked arrays are filled with nan when

28d15f8

passed to pm.Data() and pm.Model().set_data() - Integer masked arrays trigger an error message and provide suggested alternatives

ricardoV94 reviewed Apr 2, 2023

View reviewed changes

twiecki and others added 4 commits April 2, 2023 23:51

⬆️ UPGRADE: Autoupdate pre-commit config (pymc-devs#6624)

264afdc

* ⬆️ UPGRADE: Autoupdate pre-commit config --------- Co-authored-by: pymc-bot <[email protected]>

- Float masked arrays are filled with nan when

c6ae913

passed to pm.Data() and pm.Model().set_data() - Integer masked arrays trigger an error message and provide suggested alternatives

Automatic imputation support when observed=pm.Data()

dffd237

Merge branch 'pm_data_imputation_support' of github.com:kamicollo/pym…

8136255

…c into pm_data_imputation_support

ricardoV94 mentioned this pull request Jun 27, 2023

Support automatic imputation for multivariate and symbolic distributions #6797

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow passing masked arrays with missing values to pm.Data() #6645

Allow passing masked arrays with missing values to pm.Data() #6645

kamicollo commented Apr 2, 2023 •

edited by github-actions bot

Loading

ricardoV94 left a comment •

edited

Loading

ricardoV94 Apr 2, 2023

kamicollo Apr 2, 2023

ricardoV94 Apr 4, 2023

kamicollo commented Apr 2, 2023

ricardoV94 commented Apr 2, 2023 •

edited

Loading

ricardoV94 commented Apr 2, 2023

kamicollo commented Apr 3, 2023 •

edited

Loading

jonsedar commented Nov 7, 2024

Allow passing masked arrays with missing values to pm.Data() #6645

Are you sure you want to change the base?

Allow passing masked arrays with missing values to pm.Data() #6645

Conversation

kamicollo commented Apr 2, 2023 • edited by github-actions bot Loading

Major / Breaking Changes

New features

Bugfixes

Documentation

Maintenance

ricardoV94 left a comment • edited Loading

Choose a reason for hiding this comment

ricardoV94 Apr 2, 2023

Choose a reason for hiding this comment

kamicollo Apr 2, 2023

Choose a reason for hiding this comment

ricardoV94 Apr 4, 2023

Choose a reason for hiding this comment

kamicollo commented Apr 2, 2023

ricardoV94 commented Apr 2, 2023 • edited Loading

ricardoV94 commented Apr 2, 2023

kamicollo commented Apr 3, 2023 • edited Loading

jonsedar commented Nov 7, 2024

kamicollo commented Apr 2, 2023 •

edited by github-actions bot

Loading

ricardoV94 left a comment •

edited

Loading

ricardoV94 commented Apr 2, 2023 •

edited

Loading

kamicollo commented Apr 3, 2023 •

edited

Loading