-
Notifications
You must be signed in to change notification settings - Fork 1
/
simulated_zipoisson_multilevel.Rmd
169 lines (137 loc) · 6.81 KB
/
simulated_zipoisson_multilevel.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
---
title: "Zero-Inflated Poisson - Simple Multilevel model"
author: "Anders Sundelin"
date: "2022-10-25"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(rethinking)
```
## Refresher from chapter 13.2 (simulating tadpoles in ponds)
```{r}
a_bar <- 1.5 # overall alpha
sigma <- 1.5 # overall stddev
nponds <- 60 # number of ponds, i.e. clusters
Ni <- as.integer(rep(c(5, 10, 25, 35), each=15)) # integers of initial tadpole counts, one dimension, "long format"
set.seed(5005)
a_pond <- rnorm(nponds, mean=a_bar, sd=sigma) # a value for each pond, comes from the general a_bar population level estimate
dsim <- data.frame(pond=1:nponds, Ni=Ni, true_a=a_pond) # just for organizing the data - each pond is an own row in the data frame
dsim$Si <- rbinom(nponds, prob=logistic(dsim$true_a), size=dsim$Ni) # generate simulated survivor count, for each row in the data, using the parameters from that row
# each individual (dsim$true_a, dsim$Ni) value is used to generate a random survivor count with the appropriate probability of survival and maximum count
dsim$p_nopool <- dsim$Si / dsim$Ni # proportion of survivors in each pond
dat=list(Si=dsim$Si, Ni=dsim$Ni, pond=dsim$pond)
m13.3 <- ulam(
alist(
Si ~ dbinom(Ni, p),
logit(p) <- a_pond[pond],
a_pond[pond] ~ dnorm(a_bar, sigma),
a_bar ~ dnorm(0, 1.5),
sigma ~ dexp(1)
), data=dat, chains=4
)
precis(m13.3, depth=2)
```
```{r}
post <- extract.samples(m13.3)
dsim$p_partpool <- apply(inv_logit(post$a_pond), 2, mean)
dsim$p_true <- inv_logit(dsim$true_a)
nopool_error <- abs(dsim$p_nopool - dsim$p_true)
partpool_error <- abs(dsim$p_partpool - dsim$p_true)
plot(1:60, nopool_error, xlab="pond", ylab = "absolute error", col=rangi2, pch=16) # filled blue points
points(1:60, partpool_error) # varying effects estimates
npool_avg <- aggregate(nopool_error, list(dsim$Ni), mean)
partpool_avg <- aggregate(partpool_error, list(dsim$Ni), mean)
```
## Multilevel Bayesian Model
```{r init_sim}
N <- 350
set.seed(700716)
a_bar <- 1.5
sigma_a <- 1.5
b_bar <- .33
sigma_b <- 0.1
nfiles <- 4
# probability of a event not happening
# A is likely to be touched, but B is not, C and D also vary a bit...
prob_untouched <- c(0.2, 0.6, 0.1, 0.4)
# simulate with more files, just pick uniformly random numbers as probabilities
#nfiles <- 40
#prob_untouched <- runif(nfiles)
a_file <- rnorm(nfiles, mean=a_bar, sd=sigma_a) # value for each file, drawn from the general mean/sd
b_file <- rnorm(nfiles, mean=b_bar, sd=sigma_b) # ditto for the slope, unique values for each slope
dsim <- data.frame(file=1:nfiles, p_untouched=prob_untouched, true_a=a_file, true_b=b_file)
# on average, we add 10 rows per week
change_size <- rnorm(N, 10, 25)
SZ <- scale(change_size)
# The log of the rate is a linear model - rates are always positive
# Thus, the rate is the exponentiation of the linear model
# Replaced the 2.4 by the a_file[file], and the .33 with the b_file[file]
# Right now, because of the sapply, I run the model on the same SZ for all the files.
# This is solved later on, where I repeat the SZ as many times as there are files
# Guess it would be possible to use different initializations for each file, as long as you use the same data when recovering the parameters
rate_introd <- sapply(1:nfiles, function(i) exp(dsim$true_a[i] + dsim$true_b[i] * SZ))
# generates a 4-column matrix with 350 rows (because SZ had 350 items)
untouched <- sapply(1:nfiles, function(i) rbinom(N, 1, prob_untouched[i]))
fileindex <- rep(1:nfiles, each=N)
# simulate issues introduced
introd <- sapply(1:nfiles, function (i) (1-untouched[,i]) * rpois(N, rate_introd[,i]))
# generates a 4-column matrix with 350 rows (because the i:th column of rate_introd has 350 items)
# turn the matrix into "long form" (putting the columns after each other, in one long row, matching the fileindex vector above)
introd_long <- as.vector(introd)
```
```{r}
simplehist(introd[,1], xlab="issues introduced in file 1", lwd=4)
```
```{r}
simplehist(introd[,2], xlab="issues introduced in file 2", lwd=4)
```
### Recovering the model parameters
```{r}
issue_model <- ulam(
alist(
y ~ dzipois(p, lambda),
logit(p) <- ap[file], # assume simplest possible model
log(lambda) <- al[file] + bsize[file] * size, #+ bscout * scout,
ap[file] ~ dnorm(0, 1.5), # could of course also pull this from the model, if we generate it from a common probability above
al[file] ~ dnorm(a_bar, sigma_a),
bsize[file] ~ dnorm(b_bar, sigma_b),
a_bar ~ dnorm(3, .5),
sigma_a ~ dexp(1),
b_bar ~ dnorm(0, 0.4),
sigma_b ~ dexp(1)
), data=list(y=introd_long, file=fileindex, size=rep(SZ, nfiles)), chains=4)
```
```{r}
precis(issue_model, depth=2)
```
It is extremely important to know how you slice the matrices. Forgetting a comma here and there will completely screw your model...
```{r}
post <- extract.samples(issue_model)
max(sapply(1:nfiles, function(i) abs(mean(inv_logit(post$ap[,i])) - dsim$p_untouched[i])))
```
Derived probabilities of changing (to be compared with the real values, 0.2, 0.6, 0.1 and 0.4)
```{r}
sapply(1:nfiles, function(i) mean(inv_logit(post$ap[,i])))
```
Derived intercepts (to be compared with the real values, 1.71, 1.43, 3.28 and 0.75)
```{r}
sapply(1:nfiles, function(i) mean(post$al[,i]))
```
```{r}
max(sapply(1:nfiles, function(i) abs(mean(post$al[,i]) - dsim$true_a[i])))
```
Derived slopes (to be compared with the real values, 0.42, 0.09, 0.27 and 0.53)
```{r}
sapply(1:nfiles, function(i) mean(post$bsize[,i]))
```
```{r}
max(sapply(1:nfiles, function(i) abs(mean(post$bsize[,i]) - dsim$true_b[i])))
```
The model seems to recover both the probabilities and the alphas quite well.
However it overestimates a_bar (which we put at 1.5), but put a decent job at finding sigma. Adding more files would presumably help... It does... With 40 files, a_bar becomes 1.69 and sigma stays at 1.40. I guess the same goes for b_bar
Question:
* In this simulation, I reused the SZ across all files. Does it matter at all? In reality, files have different "change histories" - some change more than others. But as long as I simulate with the data for the particular file, and use the same data when I recover the betas, then everything should work out right, I presume.
A: Yes, it is OK to reuse SZ across files, even though in reality files will have different change histories.
* I guess standardization should work similarly - no point in keeping "separate scales" for the different files. This would make it impossible to make sense of the common b_bar (because the mean and stddev for a new file would be unknown).
A: Yes, standardization would have to be made same way for all files. As long as you keep to one, single measurement scale, what unit it is in is does not affect the simulation (except for numerical accuracy issues of course). Just don't forget to use the same scale when interpreting the results.