-
Notifications
You must be signed in to change notification settings - Fork 3
/
README.Rmd
205 lines (152 loc) · 9.06 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
---
title: "LAtent Mixture of Bayesian Cluetring (Lamb) - User Manual"
author: "Noirrit Kiran Chandra"
date: "8/19/2021"
output:
github_document:
pandoc_args: --webtex
---
## \*\*For ***`Microsoft Windows`***Users
`RcppGSL` is required to use the codes. There might be some issues with `MS Windows` users in setting up this package. Please refer to [this stackoverflow link](https://stackoverflow.com/questions/55976547/linking-gsl-libraries-to-rcppgsl-in-windows-10) for solutions.
## Compiling the Source `C++` File and Loading Requisite Libraries
```{r Compile the source C++ file, echo=TRUE}
library("Rcpp")
library("pracma")
library("tidyverse")
library("ggpubr")
library("uwot")
library("irlba")
library("mclust")
library("mcclust")
sourceCpp("DL_linear_split_merge_package.cpp") ##This is the souce C++ file
source("simulate_data_fxns.R") ##load a supplementary R file with supplementary functions
```
## Simulate Data from the Lamb Model
#### Simulate Data
Set the
1. dimension (`p`)
2. sample size (`n`)
3. dimension of the unobserved latent variables `eta` (`d`)
4. number of clusters (`k`) which is unobserved.
```{r Simulation}
set.seed(1)
p= 2500 # Observed dimension
n=2000 # Sample size
d=20 # Latent dimension
k= 15 # Number of true clusters
```
Set the cluster memberships probabilities such that 2/3 of the clusters have same probabilities, the remaining 1/3 together have the same prob of a single big bluster.
```{r Set cluster probabilities}
pis <- c(rep(1/(round(k *2/3)+1), (round(k *2/3))),
rep((1/(round(k *2/3)+1))/(k - (round(k *2/3))), k - (round(k*2/3))))
pis <- pis/sum(pis) ##cluster weights
```
Generate the observed `p`-dimensional data `y`
```{r Generate data}
lambda = simulate_lambda(d,p,1) # Generate the loading matrix lambda
sc=quantile(diag(tcrossprod(lambda) ) ,.75); lambda=lambda/sqrt(sc) # Scaling lambda
s = sample.int(n=k,size=n, prob=pis, replace = TRUE) ##Cluster membership indicators
eta = matrix(rnorm(n*d), nrow=n, ncol=d) # Latent factors
shifts=2*seq(-k,k,length.out = k)
for(i in 1:k){
inds = which(s==i)
vars = sqrt(rchisq(d,1))
m = pracma::randortho(d) %*% diag(vars)
eta[inds,]= eta[inds,] %*% t(m)+shifts[i] # The i-th cluster is centered around rep(shifts[i],d) in the latent eta space
}
y <- tcrossprod(eta,lambda) + matrix(rnorm(p*n, sd=sqrt(.5)), nrow=n, ncol=p) # Observed data
```
#### UMAP Plot of the Simulated Data
```{r UMAP plot, fig.height=4.5, fig.width=5, dpi=300}
umap_data=uwot::umap(y, min_dist=.8, n_threads = parallel::detectCores()/2) #2-dimensional UMAP scores of the original data
dat.true=data.frame(umap_data, Cluster=as.factor(s)); colnames(dat.true)= c("UMAP1","UMAP2", "Cluster")
p.true<-ggplot(data = dat.true, mapping = aes(x = UMAP1, y = UMAP2 ,colour=Cluster)) + geom_point()
p.true
```
------------------------------------------------------------------------
> ***For any other dataset replace the simulated `y` with that and follow the proceeding steps*****.**
## Pipeline of Fitting the Lamb Model
#### Pre-processing Data
We center the data with respect to the *median* and scale ALL variables with the *median standard deviations* of the observed variables.
```{r Pre-processing}
y_original=y
y=y_original
centering.var=median(colMeans(y))
scaling.var=median(apply(y,2,sd))
y=(y_original-centering.var)/scaling.var
```
#### Empirical Estimation the Latent Dimension `d` and Initializing `eta` & `lambda`
Perform a sparse PCA on the pre-processed data.
```{r Sparse PCA}
pca.results=irlba::irlba(y, nv=40)
```
`d` is set to be the minimum number of eigenbalues explaining at least 95% of the variability in the data.
```{r Estimation of d}
cum_eigs= cumsum(pca.results$d)/sum(pca.results$d)
d=min(which(cum_eigs>.95))
d=ifelse(d<=15,15,d) # Set d= at least 15
```
Left and right singular values of `y` are used to initialize `eta` and `lambda` respectively for the MCMC.
```{r Singular value initialization}
eta= pca.results$u %*% diag(pca.results$d)
eta=eta[,1:d] #Left singular values of y are used as to initialize eta
lambda=pca.results$v[,1:d] #Right singular values of y are used to initialize lambda
```
#### Initializing Cluster Allocations using `KMenas`
```{r KMeans Initialization}
cluster.start = kmeans(y, 80)$cluster - 1
```
### Set Prior Parameter of `Lamb`
##### Parameter Description in the Chandra et al. (2021+) Notations:
| Parameters | Description |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `as`=![formula](https://render.githubusercontent.com/render/math?math=a_%5Csigma); `bs`=![formula](https://render.githubusercontent.com/render/math?math=b_%5Csigma) | The Gamma hyperparameters for residual precision ![formula](https://render.githubusercontent.com/render/math?math=%5Csigma_j%5E%7B-2%7D)'s for j=1,...,p |
| `a`=<img src="https://render.githubusercontent.com/render/math?math=a "/> | Dirichlet-Laplace parameter on `lambda` |
| `diag_psi_iw`=![formula](https://render.githubusercontent.com/render/math?math=%5Cxi%5E2); `niw_kap`=![formula](https://render.githubusercontent.com/render/math?math=%5Ckappa_0); `nu`=![formula](https://render.githubusercontent.com/render/math?math=%5Cnu_0) | Hyperarameters for the Normal-Inverse Wishart base measure ![formula](https://render.githubusercontent.com/render/math?math=G_0) of the Dirichlet process prior on `eta` |
| `a_dir`=![formula](https://render.githubusercontent.com/render/math?math=a_%5Calpha); `b_dir`=![formula](https://render.githubusercontent.com/render/math?math=b_%5Calpha) | Gamma hyperparameters for the conjugate Gamma(![formula](https://render.githubusercontent.com/render/math?math=a_%5Calpha,b_%5Calpha)) prior on the Dirichlet-process concentration parameter 𝛼 |
##### Set Prior Hyperparameters
```{r Set prior hyperparameters}
as = 1; bs = 0.3
a=0.5
diag_psi_iw=20
niw_kap=1e-3
nu=d+50
a_dir=.1
b_dir=.1
```
#### Fit the `Lamb` Model
```{r Fit Lamb}
result.lamb <- DL_mixture(a_dir, b_dir, diag_psi_iw=diag_psi_iw, niw_kap=niw_kap, niw_nu=nu,
as=as, bs=bs, a=a,
nrun=5e3, burn=1e3, thin=5,
nstep = 5,prob=0.5, #With probability `prob` either the Split-Merge sampler with `nstep` Gibbs scans or Gibbs sampler in performed
lambda, eta, y,
del = cluster.start,
dofactor=1 #If dofactor set to 0, clustering will be done on the initial input eta values only; eta will not be updated in MCMC
)
```
### Posterior Summary of the MCMC Samples
We compute summaries of the MCMC samples of cluster allocation variables following *Wade and Ghahramani (2018, Bayesian Analysis)*.
```{r Posterior Summary}
burn=2e2 #burn/thin
post.samples=result.lamb[-(1:burn),]+1
sim.mat.lamb <- mcclust::comp.psm(post.samples) # Compute posterior similarity matrix
clust.lamb <- minbinder(sim.mat.lamb)$cl # Minimizing Binder loss across MCMC estimates
```
### Adjusted Rand Index
```{r adjustedRandIndex}
adjustedRandIndex( clust.lamb, s)
```
### UMAP Plot of Lamb Clustering
```{r, UMAP Plot of Lamb, fig.height=4.5, fig.width=5, dpi=300}
dat.lamb=data.frame(umap_data, Cluster=as.factor(clust.lamb)); colnames(dat.true)= c("UMAP1","UMAP2", "Cluster")
p.lamb<-ggplot(data = dat.true, mapping = aes(x = UMAP1, y = UMAP2 ,colour=Cluster)) + geom_point()
p.lamb
```
### Comparison with the True Clustering
```{r Comparison with True UMAP, fig.height=4.5, fig.width=9, dpi=300}
ggpubr:: ggarrange(p.true,p.lamb,nrow=1,ncol=2, labels = c("True", "Lamb"))
```
------------------------------------------------------------------------
## Citation
If you use the Lamb method, kindly cite *Chandra, et al. (2021+). Escaping the curse of dimensionality in Bayesian model based clustering,* <https://arxiv.org/abs/2006.02700>