Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardising APIs between compiler and inference methods #16

Closed
xukai92 opened this issue Dec 20, 2018 · 35 comments
Closed

Standardising APIs between compiler and inference methods #16

xukai92 opened this issue Dec 20, 2018 · 35 comments

Comments

@xukai92
Copy link
Member

xukai92 commented Dec 20, 2018

See https://github.com/TuringLang/Turing.jl/issues/634#issuecomment-471339521 for a summary.

@yebai
Copy link
Member

yebai commented Dec 22, 2018

@willtebbutt @xukai92
how about we always drop all observe statements in gdemo()(), ie, we always sample all variables from the prior, even for the following case raised by @willtebbutt (slightly modified)

@model demo(y) = begin
   5 ~ Normal(0,1)   # rv1
   y ~ Normal(0,1) . # rv2
end

gdemo()()  # sample both `rv1` and `rv2` from the prior, ingoring `rv1=5`
gdemo(2)() # sample from the posterior

This leads to consistent behaviour and should be intuitive too.

@xukai92
Copy link
Member Author

xukai92 commented Dec 22, 2018

I see. But what I mean by this issue is to actually have some concrete functions generated by evaluating gdemo() (I think @willtebbutt also proposed this some time ago).

Currently behaviour:

@model demo(y) = begin
   5 ~ Normal(0,1)   # rv1
   y ~ Normal(0,1) . # rv2
end
mf = gdemo()
mf() # => sampling from prior

But instead we could have

@model demo(y) = begin
   5 ~ Normal(0,1)   # rv1
   y ~ Normal(0,1) . # rv2
end
mf = gdemo(???) # this returns mf::UniqueRuntimeModelType
                # and create some function definitions:
logpdf(::UniqueRuntimeModelType, vi) = ... # (1) return logpdf by evaluating using vi
rand(::UniqueRuntimeModelType) = ... # (2) gdemo()() in your example
sample(::UniqueRuntimeModelType) = ... # (3) gdemo(2)() in your example

I guess it could also potentially give more space for performance because during sample we are explicitly using the specific version of logpdf and rand.

@willtebbutt
Copy link
Member

willtebbutt commented Dec 23, 2018

@yebai your proposal has the issue that the behaviour of the following programme becomes ambiguous:

@model demo() = begin
    5 ~ Normal(0, 1)
end
demo() # what does this do?

This makes me nervous. On the one hand, this programme is weird, on the other it's a valid Turing programme, so we should ensure that the behaviour is well-defined. My proposal below will mostly resolve this.

@xukai92 I completely agree regarding the format of logpdf. Regarding rand and sample: certainly we could define one to sample from the prior and the other the posterior, but this wouldn't be particularly intuitive (there's nothing in the names to suggest the proposed behaviour). Instead, I propose the following:

  1. Don't define sample, just use rand and logpdf (or maybe logp, since logpdf isn't defined for all models supported by Turing?).
  2. The current proposal regarding gdemo(missing) to indicate that the prior should be used for y seems sensible to me.
  3. We implement gdemo(Prior()) to indicate that all random variables should be sampled from the prior. In this case, both rv1 and rv2 would be sample from the prior.
  4. rand(mf) produces a VarInfo. rand(mf, N) produces a Vector{<:VarInfo} of length N.
  5. logpdf(mf, vi) works as proposed by @xukai92.

@mohamed82008
Copy link
Member

mohamed82008 commented Dec 23, 2018

One the one hand, this programme is weird, on the other it's a valid Turing programme, so we should ensure that the behaviour is well-defined.

We can give an error if the model has no variables.

UniqueRuntimeModelType

This can be just const PriorModel{pvars, F} = Model{pvars, Tuple{}, F, NamedTuple{(), Tuple{}}} which means no data, all parameters. For this case, no special sampling alg is required. More generally however, we can make use of the existing Sampler type, which we can add model (and vi) to. Then we can simply do: rand(Sampler(model, alg)) or for models without parameters, it can be rand(Sampler(model)).

Btw, after the new refactor, if you call gdemo(), it will give you a PriorModel{pvars, F}, with pvars as Tuple{:y}.

@willtebbutt
Copy link
Member

We can give an error if the model has no variables.

Agreed, but this programme definitely has a random variable. It's anonymous, but it's definitely there, and if we define sample-from-prior functionality, then presumably something should happen.

Btw, after the new refactor, if you call gdemo(), it will give you a PriorModel{pvars, F}, with pvars as Tuple{:y}.

This is probably fine for now, but I'm not sure it's appropriate long-term for the same reasons as above.

This can be just const PriorModel{pvars, F} = Model{pvars, Tuple{}, F, NamedTuple{(), Tuple{}}} which means no data, all parameters. For this case, no special sampling alg is required. More generally however, we can make use of the existing Sampler type, which we can add model (and vi) to. Then we can simply do: rand(Sampler(model, alg)) or for models without parameters, it can be rand(Sampler(model)).

I like this. We would have to think carefully about how the rand(Sampler(model, alg)) would work for MCMC though. We should also probably implement a few convenience functions as well i.e. rand(model) = rand(Sampler(model)) and similar.

@mohamed82008
Copy link
Member

mohamed82008 commented Dec 23, 2018

It's anonymous, but it's definitely there, and if we define sample-from-prior functionality, then presumably something should happen.

Is there a practical application of sampling this anonymous random variable, or is it just for theoretical appeal? Note that implementation-wise, there is nothing random about a model with no explicit random variables; every time you run the model, you will get exactly the same thing, hence my difficulty digesting this anonymous random variable concept. I can see how 5 plays the role of data. So perhaps more generally this falls under likelihood models, where the only sensible thing to do with it, that I can think of, is to calculate the likelihood.

We should also probably implement a few convenience functions as well i.e. rand(model) = rand(Sampler(model)) and similar.

Sounds good.

@mohamed82008
Copy link
Member

mohamed82008 commented Dec 23, 2018

Note that currently, we just ignore any numbers or arrays on the LHS of a ~. If you want, we can assign an anonymous dvar to it instead and assign the value on the LHS of the ~ to this anonymous dvar. This would probably be more theoretically accurate. And it allows us to view the model with 5 ~ Normal(0, 1) only as a LikelihoodModel with 1 dvar.

Edit: by ignore I mean that we don't treat it as an explicit variable, but it still affects the logp calculation.

@willtebbutt
Copy link
Member

willtebbutt commented Dec 23, 2018

Is there a practical application of sampling this anonymous random variable, or is it just for theoretical appeal?

I'm just looking for consistency. I don't have a particular application in mind :)

Note that implementation-wise, there is nothing random about a model with no explicit random variables; every time you run the model, you will get exactly the same thing.

Not if you run the model in sample-from-prior mode.

If you want, we can assign an anonymous dvar to it instead and assign the value on the LHS of the ~ to this anonymous dvar. This would probably be more theoretically accurate. And it allows us to view the model with 5 ~ Normal(0, 1) only as a LikelihoodModel with 1 dvar.

This general kind of thing would be a good change. If the model is constructed in the regular way, it should be a dvar as you suggest, but if the model is run in sample-from-prior mode, then it should presumably magically become a dvar.

@mohamed82008
Copy link
Member

Not if you run the model in sample-from-prior mode.

@model demo() = begin
   5 ~ Normal(0,1)
end
model = demo()
logp = model(Turing.VarInfo(), Turing.SampleFromPrior())
all(isequal(logp), (model(Turing.VarInfo(), Turing.SampleFromPrior()) for i in 1:100))

vi is empty in every case, with a logp only.

@mohamed82008
Copy link
Member

then it should presumably magically become a dvar.

This should be totally possible when constructing the Sampler.

@willtebbutt
Copy link
Member

Sorry, I should have been clearer. I was referring to what I think should be going on rather than what actually happens at the minute. You're totally correct regarding the current behaviour. It's my feeling that if we implement a sampler-from-prior mode, then we should take these anonymous random variables more seriously and treat them more like named random-variables.

@mohamed82008
Copy link
Member

then we should take these anonymous random variables more seriously and treat them more like named random-variables.

I don't foresee a problem handling these. I can try a proof-of-concept after the VarInfo refactor. Perhaps @trappmartin can also help with this after he reads the new compiler code.

@trappmartin
Copy link
Member

Thanks for cc‘ing me. I’m pretty much in favour of the vi = rand(Sampler(model)) and the logpdf(model, vi) concept. Though, I believe it would be good if we keep the sample function. It’s semantically more clear to me what this does and that I can expect it to generate samples using a MCMC sampler. I would not expect it to draw from the prior as rand.

However, I’m not sure how inference with VI would fit into those syntaxes to stay consistent without having too many entry points for the user.

@willtebbutt
Copy link
Member

However, I’m not sure how inference with VI would fit into those syntaxes to stay consistent without having too many entry points for the user.

My thinking here is that VI works fine with the rand(sampler(model, alg), N) design, where the alg parameter would be a representation of the approximate posterior distribution. Semantically this reads as: draw N samples from the posterior distribution over the random variables in model (each of which should be a VarInfo?) using alg. This works for both VI and MCMC, so provides a single point of entry for both kinds of samplers.

Though, I believe it would be good if we keep the sample function. It’s semantically more clear to me what this does and that I can expect it to generate samples using a MCMC sampler.

Again, my feeling here is that we wouldn't need the sample function, as the creation of the sampler, e.g. rand(sampler(model, HMC(...)), N), makes it quite clear that we're drawing N samples from model using the specified HMC algorithm. Although I could live with the sample function being an alias for rand, I suspect that rand is more consistent with existing stuff.

I’m pretty much in favour of the vi = rand(Sampler(model)) and the logpdf(model, vi) concept.

As am I :)

A minor stylistic thing, @yebai what would your thoughts be on making the "correct" thing to do be sampler(model, alg) rather than Sampler(model, alg)? (At least to me) the latter says make me an object of type Sampler whereas the former says make me a thing that can generate samples from model using alg, which may or may not be an object of type Sampler. The distinction is important because the Sampler option commits us to always producing an object of type Sampler if we want to produce samples, whereas the former allows us to spit out whatever object we like provided that it can be used to generate samples via rand.

@xukai92
Copy link
Member Author

xukai92 commented Feb 7, 2019

Reading through your explanation of sample v.s. Sampler, I feel sample is better, although in most of the cases (at this moment), sample will return a Sampler to us. Also I think using a function rather than a type conversion syntax is a more standard Julia style.

@yebai
Copy link
Member

yebai commented Mar 10, 2019

Here is a summary of some proposals in this issue (biased towards personal aesthetic)

  • logp(m::Model, v::VarInfo, s::Selector=undef) computes log density/mass for model m at value v::VarInfo
  • Sampler(m, alg) creates a sampler for model using alg as the sampling method
  • rand(s::Sampler(m, alg), N) produces a Vector{<:VarInfo} of length N.
    • rand(s::Sampler(m, SampleFromPrior()), N) produces a Vector{<:VarInfo} of length N, by sampling from the prior distribution while ignoring all observations (including something like 5~Normal(0,1)).
    • sample(m, alg; N=N, args...) provides an alias to rand(Sampler(model, alg), N, args...)
    • v = rand(m::Model, N=1) is an alias to v = rand(Sampler(m, SampleFromPrior()), N)

This would provide a relatively well-defined interface between the compiler and inference methods (e.g. variational methods and sampling methods). For variational methods, the most important API is the logp function and its gradients. While for sampling methods, both the logp function and the rand function are necessary. This would allow us to clean up the code in many sampler's implementations. Such as examples below:

A long term goal is to build up a clear separation between the compiler and inference methods (see TuringLang/Turing.jl#456). Such that inference methods can be implemented in separate packages with a dependency on Turing.

Anything missing?

yebai referenced this issue in TuringLang/Turing.jl Mar 11, 2019
@xukai92
Copy link
Member Author

xukai92 commented Mar 19, 2019

v = rand(m::Model, N=1) is an alias to v = rand(Sampler(m, SampleFromPrior()), N)

How about enabling this alias with support to sampling algorithm as well.

v = rand(m::Model, N=1; alg=SampleFromPrior()) = rand(Sampler(m, alg), N)

@yebai yebai changed the title Forbid mf() and implement explicit rand and logpdf functions for mf Standardising APIs between compiler and inference methods Mar 20, 2019
@xukai92
Copy link
Member Author

xukai92 commented Mar 22, 2019

How about AbstractRunner -> AbstractTilde

@willtebbutt
Copy link
Member

I don't know that we actually need a supertype here. I mean, is there going to be a fallback that can be applied to Abstract<Thing>? If not, we can just create concrete types as required and avoid the hassle of having to choose and name for an abstract type, and forcing things to inherit from it.

@yebai
Copy link
Member

yebai commented Mar 27, 2019

Let's stick with AbstractRunner type for now and change it to something else when we got better ideas.

@yebai yebai pinned this issue Mar 27, 2019
@mohamed82008
Copy link
Member

I suggest during the model compilation we generate 2 functions inside Model:

  1. One for the log likelihood which replaces all the a ~ dist lines with a ~ Fixed() where the assume of Fixed returns the variable value in vi and 0 for the logpdf. This is useful for maximum likelihood estimation.
  2. The second is the logpdf of the joint probability distribution that we have now. This is to be used for MAP and inference.

@trappmartin
Copy link
Member

@mohamed82008 I think this is a good idea. Thanks! I'll take a closer look into this asap.

@trappmartin
Copy link
Member

trappmartin commented Apr 4, 2019

@yebai I had some time to read through your proposal. It sound quite reasonable to me. However, I have a few questions / remarks.

  • What is the purpose of the Selector in the logp function? Do we need that?

In favour of what Mohamed suggested and the latest discussions on model criticism, it might be good to have:

  • logjoint(m::Model, v::VarInfo, s::Selector=undef) == logp
  • logpdf(m::Model, v::VarInfo, s::Selector=undef)
  • logppd(m::Model, v::Vector{VarInfo}, s::Selector=undef)

As proposed by Mohamed we could do this by generating respective functions in the model macro.

@willtebbutt
Copy link
Member

willtebbutt commented Apr 4, 2019

Could someone please link the relevant discussions here?

@trappmartin could you please explain the above proposal in a bit more depth? What are the differences between each of the functions? What are logjoin and logppd short for? Thanks!

@mohamed82008 could you please expand on your proposal? It's not obvious to me why we need a separate model construction to perform maximum likelihood / when we would actually be interested in performing maximum likelihood in our context: if the user has gone to the trouble to specify priors over parameters, why would they wish to ignore them?

@trappmartin
Copy link
Member

trappmartin commented Apr 4, 2019

logjoint = log p(\theta, x), i.e. the log of the joint probability function of the model.
logppd= \sum_i^N log 1/T \sum_t^T p(x_i | \theta_t), i.e. the log pointwise predictive density function.

@trappmartin
Copy link
Member

I don't think we want maximum likelihood. But I agree it would be good to provide an interface to compute the log likelihood.

@willtebbutt
Copy link
Member

Sorry, I'm not familiar with the logppd, could you explain why it's useful?

logjoin = log p(\theta, x), i.e. the log of the joint probability function of the model.

If we're going to rename logp to something, could we please write logjoint? For the cost of an extra character it's quite a bit less ambiguous haha

@trappmartin
Copy link
Member

trappmartin commented Apr 4, 2019

Uh, sorry that was a typo. I meant logjoint. :D
The log pointwise predictive density is useful for model criticism and model selection.

@willtebbutt
Copy link
Member

Uh, sorry that was a typo. I meant logjoint. :D

haha fair enough. Glad we're on the same page with this.

The log pointwise predictive density is useful for model criticism and model selection.

Is there a good reference for this, or is this something that should be obvious?

@trappmartin
Copy link
Member

trappmartin commented Apr 4, 2019

See the paper by A. Gelman, J. Hwang, and A. Vehtari on Understanding predictive information criteria for Bayesian models for example.

@mohamed82008
Copy link
Member

@mohamed82008 could you please expand on your proposal? It's not obvious to me why we need a separate model construction to perform maximum likelihood / when we would actually be interested in performing maximum likelihood in our context: if the user has gone to the trouble to specify priors over parameters, why would they wish to ignore them?

I just thought it would be cool to compare ML, MAP, VI and MCMC without changing the model definition! But @trappmartin has a lot more interesting use cases :)

@yebai
Copy link
Member

yebai commented Apr 4, 2019

What is the purpose of the Selector in the logp function? Do we need that?

@trappmartin

you're right - Selector here is not necessary. This might be taken from some interface discussed in https://github.com/TuringLang/Turing.jl/issues/634#issuecomment-44965726, i.e.

# s::Sampler ==> s::Selector
logp = model(Turing.VarInfo(), s::Sampler) 

Since logp will always compute the log joint probability in the new design, we don't really need s::Selector to select subsets of random variables.

I suggest during the model compilation we generate 2 functions inside Model:
One for the log likelihood which replaces all the a ~ dist lines with a ~ Fixed() where the assume of Fixed returns the variable value in vi and 0 for the logpdf. This is useful for maximum likelihood estimation.
The second is the logpdf of the joint probability distribution that we have now. This is to be used for MAP and inference.
I just thought it would be cool to compare ML, MAP, VI and MCMC without changing the model definition! But @trappmartin has a lot more interesting use cases :)

@mohamed82008 I'm not sure I understand the point here. For MAP and VI based learning, all we need is logp and its gradient. We don't need to modify the compiler.

logjoint = log p(\theta, x), i.e. the log of the joint probability function of the model.
logppd= \sum_i^N log 1/T \sum_t^T p(x_i | \theta_t), i.e. the log pointwise predictive density function.
See the paper by A. Gelman, J. Hwang, and A. Vehtari on Understanding predictive information criteria for Bayesian models for example.

@trappmartin

It seems logjoint is the same as logp?

I agree it's helpful to have some support for model evaluation and criticism, but I prefer to keep the standard API between compiler and inference minimal. These additional features can be supported in a separate module (e.g. MCMCChains). In fact, it appears that logppd can be implemented in the form of a new Runner or even InferenceAlgorithm.

@trappmartin
Copy link
Member

It seems logjoint is the same as logo?

Yes, seems more telling to me than logp

I agree it's helpful to have some support for model evaluation and critism, but I prefer to keep the standard API between compiler and inference minimal. These additional features in a seprate module. In fact, it apears that logppd can be implemented in the form of a new Runner or even InferenceAlgorithm.

Good point. I'll try to do so as it seems relevant for several people.

@phipsgabler
Copy link
Member

Is there anything here that has not already implemented in the meantime, or couldn't be moved to the discussions at AbstractPPL?

@yebai
Copy link
Member

yebai commented Feb 3, 2021

Nothing that I know of. Please feel free to close, and/or move some useful bits to AbstractPPL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants