Non-parametric Bayesian Model
- GP
- DP
Random process is a collection of different random variables, defined on a common probability space, taking values in a common set w
(the state space), and indexed (labeled) by a set t
, thought of as discrete or continuous. As moving along the horizontal axis (sigma algebra), the different RV with different sample space appears and produces random output.
- Ranndom Process is probability distribution over
trajectories of journey of θ
(random walks) such as Markov Chain.
In DP, each R.V takes a certain distribution rather than a certain value. It generates a new distribution by first selecting a center distribution from the previous sequence of distributions, and then randomly perturbing this center distribution using the 'G_0' (The amount of perturbation is controlled by α
). This iterative process of selecting a center distribution and perturbing it makes this model stochastic process because the sequence of distributions it generates is random and depends on the previous distributions generated by the process. so...Markove propertyyyyyy
- discrete State: different distribution
- discrete Time: different sigma algebra ?
Before using Gaussian Processes, you have two main questions:
- Q1. "Is your dataset reasonably small?" (less than 1,000 is fine)
- Gaussian Processes don't do well with large numbers of points.
- Q2. "Do you have a good idea about how two point's labels are related?"
- ex) if you know f(1) = A, do you have any good intuition about what that makes you think f(4) is? This is crucial for picking a reasonable covariance function. It’s useful when you want a measure of uncertainty about your predictions instead of just a point estimate. "What’s the probability that f(4) is between 2 and 7?" then a Gaussian process with a linear-kernel would make sense.
GP on S
refers to a bunch of random variables(with pdf: f(x)
) on S
... whose index is the member of the set S
such that they can have the following properties: The bunch of variables with pdf functions: f(
) are normally multivariate distributed, thus GP outputs from the mean function and cov function (kernel function)! Gaussian process is parameterized by the
mean vector
and the covariance matrix
.
Kernel helps us obtain customized samples in the random process. And if we keep increasing the value of bandwidth in the kernel, we'll have almost constant functions...weighted evenly????
GP is a distribution over functions.
To make the prediction, take the point (x, f(x)), then try to generate the mean and CI(or cov): Given that having training set, I try to combine the data with the prior of the functions(with mean and cov)..to make my functions smooth. Use a similarity matrix such that our function approximator makes the two points close by when we fit new data to make sure the two heights are also close by.
-
Case 01> Too noisy observations:
-
Add the independent Gaussian noise to all RVs.
- we don't have the
0 variance
data points anymore. And also the mean function became a bit smoother.
- we don't have the
-
Hey, we can change the parameters(bandwidth of kernel, variance of GP, variance of noise) of the kernel a bit, and find the optimal values for them, in this special case.
- Parameter Optimization using Maximum Likelihood Estimation
-
-
Case 02> Inducing Input
- T.B.D
from __future__ import division # from a module "__future__"
import numpy as np
import matplotlib.pyplot as plt
# GaussianProcess' squared exponential
def kernel(a, b)
SQdist = np.sum(a**2, 1).reshape(-1, 1) + np.sum(b**2, 1) - 2*np.dot(a, b.T)
return(np.exp(-0.5 * SQdist)
n = 50
X_text = np.linspace(-5, 5, n).reshape(-1, 1)
K_ = kernel(X_text, X_test)
# Draw samples from the prior
L = np.linalg.cholesky(K_ + 1e-6*np.eye(n))
f_prior = np.dot(L, np.random.normal(size= (n, 10))) # draw 10 samples(functions)
plt.plot(X_test, f_prior)
α
,β
are shape parameters.- Beta(
α
,β
) prior takes Bin(n,θ
) likelihood
- Dirichlet Distribution is a generalized beta distribution.
α1
,α2
,α3
... are shape parameters.- Dirichlet(
α1
,α2
,α3
...) prior takes Multinom(n,θ1
,θ2
,θ3
...) likelihood.
-
Its parameter
α
will be a shape vector.- if α1,α2,α3 are all the same, then the outcome(
θ_i
) appears uniformly. - if α1,α2,α3 are large(>1), the outcome(
θ_i
) samplings appear fluffy (convexed): High VAR - if α1,α2,α3 are small(<1), the outcome(
θ_i
) samplings appear tight (severely concaved): low VAR - Thus...α controls the mixture of outcomes, and sampling variance.
- Turn it down, and we will likely have same values for each possible outcome.
- Turn it up, and we will likely have different values for each possible outcome.
- if α1,α2,α3 are all the same, then the outcome(
-
Its outcome
θ
will bek
dimensional vector such asc(θ_1, θ_2, θ_3) where θ_1 + θ_2 + θ_3 = 1
- Each
θ_i
has its ownα
...weight(shape) for each distribution ofθ_i
- Each
θ_i
has its own distribution...???????
- Each
Gaussian Mixture example
- How to get a control over the
multiple membership
(?) dynamically? - Note that the
plate notation
refers to Random Variables (multiple membership
).
-
- 1.Collection of the cluster parameters
μ
andΣ
- 2.Collection of the cluster proportions
π
- Gaussian Mixture vs Dirichlet Mixture
:
Likelihood
(This is for Gaussian Mixture):
prior
(This is for Gaussian Mixture):
Likelihood
(This is for Dirichlet Mixture):
Prior
(This is for Dirichlet Mixture)- Multinomial + Dirichlet conjugate relation tells us our parameter value(posterior) can be updated by the introduction of new data(likelihood)! We can get all latent variable with the help of sampling
θ
from Dirichlet prior! So it seems we can easily get the posterior, thus ourθ
("mixing coefficients" or "obv-proportion" for every Gaussians) at the end. Done and dusted! We now have the model describing our data! Wait! However, is their occurance accurate? How to address the hyperparameter α that affects the sampling result ?
- 1.Collection of the cluster parameters
[Idea]: **infinite latent variable parameter values** can be controlled by Random Process that can address **α**
Note: Random Variable & Random Process
- : RV is different from the variable in algebra as RV has whole set of values and it can take any of those randomly. Variable used in algebra cannot have more than a single value at a time:
- ex)
random variable_X = {0,1,2,3}
,variable_K = 1
.
- ex)
- : Random(stochastic) Process: Random Process is an event or experiment that has a random outcome, so you can’t predict accurately. In a deterministic process, if we know the initial condition (starting point) of a series of events, we can then predict the next step in the series. Instead, in stochastic processes, although we know the initial condition, we can't determine with full confidence what are going to be the next steps. That’s because there are so many(or infinite!) different ways the process might evolve. How smoke particles collide with each other? Their unpredictable movements and collisions are random and are referred to as Brownian Motion. Interest rate is a variable that changes its value over time. It is not straightforward to forecast its movements. - ex) Gaussian_P, Drichlet_P, Poisson_P, Brownian motion_P, Markov decision_P, etc... Markov Chain is also random process(resulting random ouput) in which the effect of the past on the future is only summarized by the current state.
A collection of cluster information(parameters): () follows a Dirichlet distribution with k-dimensional DD parameters
. DP can sample all possible highly likely
scenarios of mixture setup
(mixture of clusters) that describes your complex data. First, assume you have data that follows some unknown mixture distribution. so we want to estimate mixing coefficeint
(proportion), and other distribution specific certain parameters
for each mixture components (cluster). For the time being, forget about the labelling the membership because if K goes to infinity, our "Z case" (Scenarios) becomes random variable. Now it's time for Random Process! Ok...What on earth is DP ? It seems both DD, DP are distributions of certain scenarios
.
-
Remarks: The
α
controls the total number of clustering while theG_0
determines the distribution of the data points in each cluster. -
Remarks: The G_0 is the
joint distribution of the parameters for a potential new cluster
for an observation to enter. The new cluster is initially empty, so there is no data that could be used to determine the posterior estimate of the parameters. Hence, we instead draw parameters from the prior distribution G_0 to determine estimates for the parameters, which then get used in the calculations of the probability of entering that cluster. If a new cluster is indeed selected, then the G_0 is discarded and a newδ_Φ
is created for a brand new cluster. Then this newδ_Φ
is used as the distribution for probability calculations, parameter updates, etc, for future observations that may want to enter that cluster. G_0 would be used again for another new cluster if necessary. G_0 is like a real estate aganet. It keeps suggesting new cluster locations to the data point. Then the data point decides if staying or moving out based on the porbability built by theproportion
andlikelihood
of its current cluster vs new cluster. Ok. the importance weight(proportion?) and cluster parameter(location?) information are the key criteria from which your data is understood. -
Remarks: Then...the initial importance weight for G_0 is determined by
α
?α
is a value sampled from some algorithm? And it's like a reliability of G_0's suggestion. The higherα
gives higher proportion of the new cluster, thus, data might choose the new cluster more often.α
controls the concentration around the mean (smaller α makes the concentration tighter since it gives smaller variance, thus produce less clusters). This meansα
is analogous to thevariance of sampling
. Note, in Beta(α,β), the smaller shape gives small variance, thus produces bigger probability (peak)...so it makes sense. Hence, a largerα
allows more clusters and a largerα
give higher proportion to G_O's suggestion. -
Remarks: G_0 is a joint prior distribution.
Φ
is a mixed random variable of G_0.G_0(A)
is AUC, but it is not an importance weight (cluster proportion). AUC is not important. It's just a representation. G_0(A) vs G(A) ::prior parameter samples
vsposterior cluster density
that are equipped with posterior parameter and new likelihood data entering. G(A) is a component of G and G as a single output of Dirichlet process can be represented in two ways: 1) a collection ofproportion x cluster densities(δ_Φ)
, 2) a solid, single, stand-alone predictive Mixture Model using the weighted AVG.
The following three steps make one iteration (investigating every corner of parameter space)
-
[Step 1] Clustering (Accept/Reject G_0's suggestion)
- The unique cluster proportions or the probability of each cluster (
ω_1
,ω_2
,ω_3
...ω_k
) are defined as it goes...then the possibility of new cluster probabilityω_k+1
are considered when we assign new observation to the brandnew clusters (parameter free joint
of Y and X) and data pt decide to move in or not based on the cluster probability comparison. - In this stage, cluster memberships(k) are fully defined.
- The unique cluster proportions or the probability of each cluster (
-
[Step 2] Outcome_Model, Covariate_Model Refinement
- This is for the preparation of the
parameter update
, theunique cluster probability term
development required to determine the later selection criteria. - Given the data points in each clusters,
some extra strategies
(such as imputation for missing data, etc.) can be implemented to refine them.- For Outcome Model, we imputing values for N/A using the imputation model (outcome density*covariate densities that have NA), then to obtain the final outcome model, we take the imputation model and integrate it out the covariates with missing data. So..at the first glance, it looks like the outcome model ignores the missing data, but it doesn't.. it integrates it out.
- For Covariate Model, we simply ignore any covariates that have N/A.
- This is for the preparation of the
-
[Step 3] Parameter Re-Estimation (Posterior development based on the likelihood)
- Note that in the parameter estimation stage, you need to have a complete data!
- The prameters of each cluster (X
β
,σ^2
) are re-calculated (re-sampled) based on the data pt they have and the selection criteria built upon two CL probabilities)
Once all missing data in all covariates has been imputed, then the prameters of each cluster (Xβ
, σ^2
) are re-calculated. After this parameter has been updated, the clustering process is performed and in the parameter Re-estimation stage, the previous imputed data is discarded and the sampling for the imputation starts over in the next iteration. This means... missing data do not impact on the clustering process whatsoever in the iteration
. Aslo note that when calculating the predictive distribution, we integrate out the covariates that are missing, thus remove these two missing covariates from the x
term that is being conditioned on: p(y|X, θ)
=> p(y|x1,x4,θ)
.
The goal is to find the distribution of the clean covariate!
t.b.d....
Escobar and West developed the posterior distribution for the DP prior: α
as follows:
We can construct the DP prior: α
(Suggestion Reliability), using Non-parametric prior construction scheme.
-
Sol 1) Stick-Breaking scheme(creating "G" distribution)
- : Big
α
results in big sticks while smallα
results in small sticks.
- : Big
-
Sol 2) Chinese-Restaurant scheme(assigning membership to new point)
- : A customer is more likely to sit at a table if there are already many people sitting there. However, with probability proportional to
α
, the customer will ask a new table.
- : A customer is more likely to sit at a table if there are already many people sitting there. However, with probability proportional to
-
Creating a decent distribution: G(A_i)
, "the each single stick", an element of G.- How to obtain a candidate probability values of the pizza with infinite slicing?
- Using the "adjusted Beta value": GEM(hyperparameter
α
) which is an adjusted probability value. - Based on the properties of Beta:
- Big Hyperparameter: result in big sticks
- Small Hyperparameter: result in small stick
- Using the "adjusted Beta value": GEM(hyperparameter
-
Assigning a membership to new point
- here, the number of tables(and hence table parameters θ) will grow with N, which does not occur in a finite model. This is the essence of the "non-parametric" aspect.
- Rich get richer... popular table..
- No fixed size of labels with a fixed size of data instances
The main goal of clustering is to find the posterior distribution P(|x) of the cluster assignments! Computing this is intractable due to the sum in the denominator and the growing number of partitions. That's why we use Gibbs Sampling. Let's say..given the previous partition
, we remove one data pt
x
from the partition (prior) then re-added to the partition (likelihood) to obtain posterior: P(|
x
). This gives new partition (prior)!
- de Finetti's theorem + GibbsSampling
G from DP is discrete
with probability "1", thus DP would not be a suitable prior distribution for the situations with continuous data coz in this case, we want continuous G. Let's think about mixture models. Mixture models are widely used for density estimation and classification problem. A natural idea is to create a prior for continuous
densities via a mixture where the mixing distribution G is given a Dirichlet process prior. As a natural way to increase the applicability of DP-based modeling, we can use DP as a prior for the mixing distribution in a mixture model with a parametric kernel distribution
.
The posterior under a DPMM is effectively finite-dimensional, though the dimension is adaptive, determined by data, as opposed to fixed like in the parametric models. This adaptive dimensionality is what gives the model its flexibility and its effective finite-dimensionality is what makes posterior computation possible.
DP gives a cdf
while DPMixture gives a density
????
- One hurdle we encounter here is "sampling from
G
", which has countably many atoms(sticks). There is also an exact approach that generates atoms "on the fly" as needed, and exploits the fact that only finitely many atoms are needed in practice.
-
For the sampling algorithm, it is convenient to include table assignment variable
Z
to indicate which table our data ptbelongs to.
where
Cat()
refers to a categorical or multinoulli distribution? -
We have the joint so perform Gibbs sampling over the state space of {
w
,Փ
} and {z
}At each iteration, we choose one of these variables and re-sample it from its conditional distribution given all the other variables.
Our Dirichlet Process provides a discrete distribution over objects and take i.i.d. samples from this distribution. Analogous to the beta-binomial
and Dirichlet-multinomial
conjugacy, we suspect the posterior of the DP, after observe samples, is also a DP. We will make this precise:
- Suppose we have a partition
.
- The vector (
) is an indicator vector for the index
k
such that. And this event (conditioned on G) has probability G(
).
- Thus, this vector(conditioned on G) (
| G) is a categorical/multinoulli random variable with parameters
.
- And these random parameters
follow a Dirichlet distribution! This is the primary reason for the name "Dirichlet Process".
- Since the Dirichlet distribution is conjugate to the multinomial distribution, we can have the posterior!
- Step I. Initial Labeling(assign table) to every point.
- Step II. While Gibbs Sampling Iteration with each data instance:
- For each data instance:
- Remove the instance from the label
- Calculate the prior labal
- Calculate the likelihood
- Calculate the posterior
- Sample the label from the posterior
- Update the component parameter
- For each data instance: