Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imputation does not work in combination with pm.Data #4441

Open
michaelosthege opened this issue Jan 25, 2021 · 14 comments
Open

Imputation does not work in combination with pm.Data #4441

michaelosthege opened this issue Jan 25, 2021 · 14 comments

Comments

@michaelosthege
Copy link
Member

michaelosthege commented Jan 25, 2021

Description

Even after #4439 imputations don't work in combination with pm.Data.
This is because pm.Data creates a SharedVariable that currently does not support a np.ma.MaskedArray.

Almost identical to the example from #4437:

data = numpy.array([
    [1,2,3],
    [4,5,float("nan")],
    [7,8,9],
])
print(data)
with pymc3.Model():
    pymc3.Normal(
        "L",
        mu=pymc3.Normal("x", shape=data.shape),
        sd=10,
        observed=pm.Data("D", data),
        shape=data.shape
    )
    pymc3.sample()

Please provide the full traceback.

SamplingError                             Traceback (most recent call last)
<ipython-input-32-e837139c32a4> in <module>
     13         shape=data.shape
     14     )
---> 15     pymc3.sample()

c:\users\osthege\repos\pymc3-dev\pymc3\sampling.py in sample(draws, step, init, n_init, start, trace, chain_idx, chains, cores, tune, progressbar, model, random_seed, discard_tuned_samples, compute_convergence_checks, callback, jitter_max_retries, return_inferencedata, idata_kwargs, mp_ctx, pickle_backend, **kwargs)
    425     model = modelcontext(model)
    426     if start is None:
--> 427         check_start_vals(model.test_point, model)
    428     else:
    429         if isinstance(start, dict):

c:\users\osthege\repos\pymc3-dev\pymc3\util.py in check_start_vals(start, model)
    236                 "Initial evaluation of model at starting point failed!\n"
    237                 "Starting values:\n{}\n\n"
--> 238                 "Initial evaluation results:\n{}".format(elem, str(initial_eval))
    239             )
    240 

SamplingError: Initial evaluation of model at starting point failed!
Starting values:
{'x': array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])}

Initial evaluation results:
x   -8.27
L     NaN
Name: Log-probability of test_point, dtype: float64

Versions and main components

@ricardoV94
Copy link
Member

ricardoV94 commented Jan 25, 2021

Is this something that might be problematic if automatically inputed (setting new Data with more/less missing values than the initial data)?

@michaelosthege
Copy link
Member Author

The whole implementation for switching the data in an existing model is broken.
Shape issues are one thing, but @ricardoV94 is right that with imputation it could get even worse. Imputation is realized by automatically changing the model graph. Switching out the data afterwards will almost certainly break it unless the mask is identical.

I see pm.Data primarily as a tool to get the data & coords nicely represented in the model graph and resulting InferenceData.
We should probably separate those two use cases into something like pm.Data and pm.MutableData.

@michaelosthege
Copy link
Member Author

I have a hunch that we'll have to revisit the whole imputation feature under the new RandomVariable paradigm. After merging #4439 I'm fine with doing observed=pm.Data(...).container.data as a workaround.

Let's label this "wontfix" and revisit it for PyMC3 >=4.0.

@AlexAndorra
Copy link
Contributor

Agreed 👌
Out of curiosity, what do you mean by "doing observed=pm.Data(...).container.data as a workaround" ?

@michaelosthege
Copy link
Member Author

Agreed 👌
Out of curiosity, what do you mean by "doing observed=pm.Data(...).container.data as a workaround" ?

This way I can use the pm.Data to include my data into the InferenceData, but also use it for imputation. Only the graph I get from pm.model_to_graphviz now has the Data node disconnected, but I can live with that.

@ricardoV94
Copy link
Member

After learning about imputed variables, this feature would require a considerable change in the internals, since all the imputation logic is happening during the model definition.

@michaelosthege
Copy link
Member Author

Yes, we can't have support for the combination of SharedVariable+imputation.
That brings me back to the proposal of distinguishing between pm.ConstantData and pm.MutableData or something like that.

@ricardoV94
Copy link
Member

Isn't the default observed ConstantData?

@michaelosthege
Copy link
Member Author

Observed are not automatically tracked with dims/coords and don't show up in model_to_graphviz.
Also it is not always the case that these arrays become observed. Sometimes you need a vector of x values as an input to the regression and so far only by making it a pm.Data you can get it into the InferenceData..

@ricardoV94
Copy link
Member

I see... should we check for nans in pm.set_data and just prohibit it? Or too much of an edge case to be worth bothering?

@michaelosthege
Copy link
Member Author

michaelosthege commented Dec 22, 2021

Let's implement pm.ConstantData and pm.MutableData. Then we can check and warn on NaN and point users to ConstantData which creates a TensorConstant and registers it for dims/coords with the model. .

Could give us some speed-up too, because with constant data the shape is known..

michaelosthege added a commit to michaelosthege/pymc that referenced this issue Dec 30, 2021
By passing `pm.Data(mutable=False)` one can create a `TensorConstant` instead of a `SharedVariable`.
Data variables with known, fixed shape can enhance performance and compatibility in some situations.
`pm.ConstantData` or `pm.MutableData` wrappers are provided as alternative syntax.

This is the basis for solving pymc-devs#4441.
twiecki pushed a commit that referenced this issue Jan 4, 2022
By passing `pm.Data(mutable=False)` one can create a `TensorConstant` instead of a `SharedVariable`.
Data variables with known, fixed shape can enhance performance and compatibility in some situations.
`pm.ConstantData` or `pm.MutableData` wrappers are provided as alternative syntax.

This is the basis for solving #4441.
@michaelosthege
Copy link
Member Author

I think this can be closed since one can now use pm.ConstantData

@ricardoV94
Copy link
Member

I think imputation will still fail with ConstantData

@ricardoV94 ricardoV94 reopened this Jun 18, 2022
@michaelosthege
Copy link
Member Author

I think imputation will still fail with ConstantData

Oh because we're not yet checking NaN in tensors? Yeah sorry, I forgot about that. Good call to open it again 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants