simulated_zipoisson.Rmd

---
title: "ZeroInflatedPoisson"
author: "Anders Sundelin"
date: "2022-10-18"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(rethinking)
```

## Simulating issue prediction with Bayesian Models
```{r init_sim}
N <- 350
set.seed(700716)

# probability of a event not happening
prob_untouched <- 0.2

# on average, we add 10 rows per week
change_size <- rnorm(N, 10, 25)
# tells how "scouty" our average developer are. High value -> more likely to remove issues, less likely to add new ones
scoutiness <- rbinom(N, 10, .66)

SZ <- (change_size - mean(change_size))/sd(change_size)
SC <- (scoutiness - mean(scoutiness))/sd(scoutiness)

# The log of the rate is a linear model - rates are always positive
# Thus, the rate is the exponentiation of the linear model
rate_introd <- exp(2.4 + .33 * SZ - SC)

untouched <- rbinom(N, 1, prob_untouched)

# simulate issues introduced
introd <- (1-untouched) * rpois(N, rate_introd)


```

Produce some data that looks reasonably well like the one we have.

```{r}
eadd <- rexp(N, .75)
```
logarithm could be modeled like this...
which means that add = exp(eadd)-1
```{r}
simplehist(eadd)
```
While the range of values are OK with the rate 0.7, there is too few values.
THe real distribution has more spread out values between to 600-6000 range.
But at least the number of zeros look OK. So some exponential family function should it be. But how do we find an alternative?

```{r}
min(exp(rexp(N, 0.75))-1)
```

seems reasonable to use this for , with 0.8 as parameter.
NB is 

```{r}
draws <- exp(rexp(N, 0.8)-1)
summary(draws)
```

```{r}
simplehist(exp(rexp(N, .75))-1)
```

All parameters shouw a similar extremely right-skewed distribution
```{r}
ADD <- exp(rexp(N, .75))-1
COMPLEX <- exp(rexp(N, .75))-1
DUP <- exp(rexp(N, .75))-1

```


```{r}
#ADD <- rlnorm(N, 1, 1)+1
#COMPLEX <- rlnorm(N, 1.2, 0.5)+1
#DUP <- rlnorm(N, -0.9, 1.1)+1

```

```{r}
simplehist(ADD)
```


```{r}
logADD <- log(ADD) #log(rlnorm(N, 1, 1)+1)
logCOMPLEX <- log(COMPLEX) #log(rlnorm(N, 1.2, 0.5)+1)
logDUP <- log(DUP) #log(rlnorm(N, -0.9, 1.1)+1)
```

```{r}
simplehist(logADD)
```


Check the distribution of each of the generated data
```{r}
summary(log(rlnorm(500, -0.5, 1.1)+1))
simplehist(rlnorm(500, -0.5, 1.1))
```
```{r}
summary(log(rlnorm(500, 0.7, 1)+1))
simplehist(log(rlnorm(500, 0.7, 1)+1))
```


```{r}
teams <- rep(1:5, each=70)
authors <- rbinom(N, teams*7, .4) + 1
```

```{r}
simplehist(authors)
```

Invent some coefficients tied to our data.

```{r}
logLambda <- -1.5 + 0.9 * logADD + 0.7 * logCOMPLEX + 0.2 * logDUP
data.frame(logLambda=logLambda) |> ggplot(aes(x=logLambda)) + geom_histogram(bins = 40)
#simplehist(0.5 + 0.5 * logADD + 1.3 * logCOMPLEX + 0.93 * logDUP)
```

Check that the generated data produces reasonably well behaved outcomes

```{r}
simplehist(exp(logLambda))# - teams/5 - authors/10))
```

And also for the negative binomial (with the given size and mu parameters)

```{r}
mu <- exp(logLambda)#-teams/5-authors/10)
size <- 5#*teams + authors
simplehist(rnbinom(500, mu=mu, size=size))
```

Only keep some events - here, we keep 15%

```{r}
prob_untouched <- 0.75
untouched <- rbinom(N, 1, prob_untouched)
introd <- (1-untouched) * rnbinom(N, mu=mu, size=size)
```

This gives the following distribution of data.

```{r}
simplehist(introd)
```

#
+ (0 + author | team),

Gives 


```{r}
library(brms)
library(tidybayes)
library(bayesplot)

```
```{r}

```


The 350 data points are not enough to recover the parameters stable enough. But 3500 data points work well. Can we find more intelligent priors?

```{r}
d <- data.frame(y=introd, logADD=logADD, logDUP=logDUP, logCOMPLEX=logCOMPLEX, team=as.factor(teams), author=as.factor(authors))
M_recover_params <- brm(data=d,
                        family=zero_inflated_negbinomial,
                        formula=bf(y ~ 0 + Intercept + logADD + logDUP + logCOMPLEX ,#+ (0 + author | team),
                                   zi ~ 1),
                        prior = c(prior(normal(-2, 1), coef = Intercept),
                                  prior(normal(0, 0.25), class = b),
                                  prior(normal(1, 0.25), class = Intercept, dpar=zi),
#                                  prior(lkj(2), class = cor),
#                                  prior(weibull(2, 1), class = sd),
                                  prior(gamma(0.1, 0.01), class = shape)),
                        warmup=1000,
                        iter=2000,
                        chains=2,
                        cores=2,
                        backend="cmdstanr",
                        save_pars = save_pars(all=T),
                        adapt_delta=0.95)
```

```{r}
get_prior(data=d,
                        family=zero_inflated_negbinomial,
                        formula=bf(y ~ 0 + Intercept + logADD + logDUP + logCOMPLEX ,#+ (0 + author | team),
                                   zi ~ 1))
```

```{r}
M_recover_params_autocentered <- brm(data=d,
                        family=zero_inflated_negbinomial,
                        formula=bf(y ~ 1 + logADD + logDUP + logCOMPLEX ,#+ (0 + author | team),
                                   zi ~ 1),
                        prior = c(prior(normal(-2, 1), class = Intercept),
                                  prior(normal(0, 0.25), class = b),
                                  prior(normal(1, 0.25), class = Intercept, dpar=zi),
#                                  prior(lkj(2), class = cor),
#                                  prior(weibull(2, 1), class = sd),
                                  prior(gamma(0.1, 0.01), class = shape)),
                        warmup=1000,
                        iter=2000,
                        chains=2,
                        cores=2,
                        backend="cmdstanr",
                        save_pars = save_pars(all=T),
                        adapt_delta=0.95)

```


```{r}
m <- M_recover_params
stopifnot(rhat(m) < 1.01)
stopifnot(neff_ratio(m) > 0.2)
mcmc_trace(m) # ok
loo(m) 
```

```{r}
rhat(m) |> mcmc_rhat()
neff_ratio(m) |> mcmc_neff() # + scale_x_continuous(limits=c(0, 0.5))
```

```{r}
loo(m)
```

```{r}
loo_recover <- loo(m, moment_match = T, reloo = T)

```

```{r}
plot(loo_recover)
```
```{r}
loo_recover
```

```{r}
yrep_zinb <- posterior_predict(m)

ppc_stat(y = d$y, yrep_zinb, stat = function(y) mean(y == 0))
ppc_stat(y = d$y, yrep_zinb, stat = "max")  + scale_x_continuous(limits=c(0,2000))

```


```{r}
max(yrep_zinb)
```

OK, so my priors were quite OK after all...


```{r}
pp_check(m, type = "bars") #+ ylim(0, 1400) + xlim(1, 16)
```

```{r}
plot(conditional_effects(m), ask = FALSE)

```
So, 350 observations were not enough to arrive at the correct conclusions.
Better priors warranted? Because it works with 3500 observations - indicates that with more data, the priors will be swamped.

```{r}
sum_1 <- summary(m)
```

```{r}
sum_1
```

```{r}
inv_logit(c(0.8, 1.11, 1.41))
```

```{r}
mcmc_areas(m, regex_pars = c("^b_", "^b_zi"))

```


### Prior selection

```{r}
get_prior(data=d,
                        family=zero_inflated_negbinomial,
                        formula=bf(y ~ 1 + logADD + logDUP + logCOMPLEX ,#+ (0 + author | team),
                                   zi ~ 1))
```

```{r}
M_horrible_prior_selection <- brm(data=d,
                        family=zero_inflated_negbinomial,
                        formula=bf(y ~ 1 + logADD + logDUP + logCOMPLEX ,#+ (0 + author | team),
                                   zi ~ 1),
                        prior = c(prior(normal(0, 0.1), class = Intercept),
                                  prior(normal(0, 0.01), class = b),
                                  prior(normal(0, 0.6), class = Intercept, dpar=zi),
#                                  prior(lkj(2), class = cor),
#                                  prior(weibull(2, 1), class = sd),
                                  prior(gamma(0.1, 1), class = shape)),
                        sample_prior = "only",
                        warmup=1000,
                        iter=2000,
                        chains=2,
                        cores=2,
                        backend="cmdstanr",
                        save_pars = save_pars(all=T),
                        adapt_delta=0.95)
```


With these priors, we can get the true parameters with only 350 samples! (gamma(0.1, 0.1))
With default shape priors (gamma(0.01, 0.01)), we get lot of 10% Pareto values > 0.7 and 10% > 1.0. This gamma(0.1,0.1) is better
```{r}
M_prior_selection <- brm(data=d,
                        family=zero_inflated_negbinomial,
                        formula=bf(y ~ 0 + Intercept + logADD + logDUP + logCOMPLEX ,#+ (0 + author | team),
                                   zi ~ 1),
                        prior = c(#prior(normal(-2, 1), class = Intercept),
                                  prior(normal(0, 0.25), class = b),
                                  prior(normal(1, 0.25), class = Intercept, dpar=zi),
#                                  prior(lkj(2), class = cor),
#                                  prior(weibull(2, 1), class = sd),
                                  prior(gamma(0.1, 0.1), class = shape)),
                        sample_prior = "only",
                        warmup=1000,
                        iter=2000,
                        chains=2,
                        cores=2,
                        backend="cmdstanr",
                        save_pars = save_pars(all=T),
                        adapt_delta=0.95)

```

Having the default prior on shape (0.01, 0.01) leads to very many divergent transitions.

```{r}
m <- M_prior_selection

loo(m)
```

All Pareto k values are bad! Can we get more informed priors?
```{r}
yrep_zinb <- posterior_predict(m)

ppc_stat(y = d$y, yrep_zinb, stat = function(y) mean(y == 0))
ppc_stat(y = d$y, yrep_zinb, stat = "max")

```

```{r}
inv_logit(1.5)
```

We have some wild priors. 305k issues...
And seem to be overestimating zero-inflation parts... But at least down towards 0.6-07 percentage is plausible... Probably realistic to say that half the commits should not introduce any issues

```{r}
max(yrep_zinb)
```

Plot in different scale

```{r}
ppc_stat(y = d$y, yrep_zinb, stat = "max") + scale_x_continuous(limits = c(0,400))

```

Seem to be overestimating the zero-inflation here too


### Retrofitting with new priors

Dangerous priors! All data will be put on the intercept, even with - but that may be because I had sd 0.01 on the betas...

```{r}
d <- data.frame(y=introd, logADD=logADD, logDUP=logDUP, logCOMPLEX=logCOMPLEX, team=as.factor(teams), author=as.factor(authors))
M_worse_priors <- brm(data=d,
                        family=zero_inflated_negbinomial,
                        formula=bf(y ~ 1 + logADD + logDUP + logCOMPLEX ,#+ (0 + author | team),
                                   zi ~ 1),
                        prior = c(prior(normal(0, 1), class = Intercept),
                                  prior(normal(0, 0.4), class = b),
                                  prior(normal(0, 0.4), class = Intercept, dpar=zi),
#                                  prior(lkj(2), class = cor),
#                                  prior(weibull(2, 1), class = sd),
                                  prior(gamma(0.1, 0.01), class = shape)),
                        warmup=1000,
                        iter=2000,
                        chains=2,
                        cores=2,
                        backend="cmdstanr",
                        save_pars = save_pars(all=T),
                        adapt_delta=0.95)

```

```{r}
m <- M_worse_priors

loo(m)
```

All Pareto k values are bad! Can we get more informed priors?
```{r}
yrep_zinb <- posterior_predict(m)

ppc_stat(y = d$y, yrep_zinb, stat = function(y) mean(y == 0))
ppc_stat(y = d$y, yrep_zinb, stat = "max")

```

We clearly have too strict prior on the zi-intercept


```{r}
max(yrep_zinb)
```


```{r}
pp_check(m, type = "bars") #+ ylim(0, 40) + xlim(1, 40)
```

As stated, we overestimate the zero-inflation

```{r}
plot(conditional_effects(m), ask = FALSE)

```

```{r}
summary(m)
```

This model, likely has too restrictive priors. The data is not allowed to tell its tale...


```{r}
simplehist(introd, xlab="issues introduced", lwd=4)
```


### brms model (zi NegBin)

```{r}
# given that var(y) != mean(y) a Poisson might not be the right thing to use.
library(brms) # rethinking does not have support for zi-negbin

introd <- as.numeric(introd)
size <- scale(change_size)
scout <- scale(scoutiness)

d <- data.frame(y = introd, size = size, scout = scout)

p <- get_prior(bf(
  y ~ 1 + size + scout,
  zi ~ 1),
  family = zero_inflated_negbinomial(),
  data = d)

p[1,1] <- "normal(0, 0.5)"
p[4,1] <- "normal(0, 1)"

m1 <- brm(bf(
  y ~ 1 + size + scout,
  zi ~ 1),
  family = zero_inflated_negbinomial(),
  data = d,
  prior = p
)

summary(m1)
```

### Recovering the model parameters (zi-poisson)

```{r}
issue_model <- ulam(
  alist(
    y ~ dzipois(p, lambda),
    logit(p) <- ap, # assume simplest possible model
    log(lambda) <- al + bsize * size + bscout * scout,
    ap ~ dnorm(0, 1.5),
    al ~ dnorm(3, .5),
    bsize ~ dnorm(0, 0.4),
    bscout ~ dnorm(0, 0.4)
  ), data=list(y=introd, size=SZ, scout=SC), chains=4)
```

```{r}
precis(issue_model)
```

With the corrected lambda (making log(lambda) into a linear model), we can easily recover the parameters of the model.

```{r}
post <- extract.samples(issue_model)
mean(inv_logit(post$ap))
```
```{r}
ns <- 100
SZ_seq <- seq(from=-3, to=3, length.out=ns)
SC_seq <- seq(from=-3, to=3, length.out=ns)
lambda <- link(issue_model, data=data.frame(size=SZ_seq, scout=SC_seq))
lmu <- apply(lambda$lambda, 2, mean)
lci <- apply(lambda$lambda, 2, PI)
plot(SZ, introd, xlim=range(SZ_seq), ylim=c(0,150), xlab="normalized size", ylab="issues introduced")
lines(SZ_seq, lmu, lty=1, lwd=1.5)
shade(lci, SZ_seq, xpd=TRUE)
```
```{r}
plot(SC, introd, xlim=range(SC_seq), ylim=c(0,150), xlab="normalized scoutiness", ylab="issues introduced")
lines(SC_seq, lmu, lty=1, lwd=1.5)
shade(lci, SC_seq, xpd=TRUE)
```


Question for my Bayesian buddies, including Prof. Torkar:

* In my analysis, I standardized the data, using the complete data set, then used that data to recover the parameters.

A: That is a first good step to see if your model can recover the basic (ground truth) parameters.

* My hunch is that this introduces a risk of overfitting - making the model trust the data too much.

A: Aah, yes, indeed. The second step could be to do what you propose next below.

* I guess we should have sampled some data, standardized using that dataset, and then run the model on the remainder - akin to ML train/test/validation sets?

A: But remember also that we rely on an epistemological reasoning here, i.e., information theory (LOO). What LOO does is approximating leave-one-out (LOO) CV. It does it remarkably well and the results are very reliable (including the fact that we also get diagnostics w/ LOO). I have over time fallen back on relying more and more on LOO and not focus on scaling the sample and use that before I use the test/validation set.


### Prior selection, incl. plots

Choosing a "flat prior" in log space can be challenging (and usually requires standardizing variables).
If log(lambda) has a normal distribution, then lambda has a log-normal distribution.
Plotting the priors
```{r}
curve(dlnorm(x, 30, 30), from=0, to=100, n=200)
```
That is a prior that is extremely biased towards 0. So should not be used in log space.

Seeking a more sensible prior (we expect about 20-40 issues to be able to be introduced):
```{r}
curve(dlnorm(x, 3, .5), from=0, to=100, n=200)
```

Prior for beta:

Choosing blindly, we chose a norm(20,40) prior for beta
```{r}
Nbeta <- 100
a <- rnorm(Nbeta, 3, .5)
b <- rnorm(Nbeta, 20, 40)
plot(NULL, xlim=c(-2,2), ylim=c(0,100))
for (i in 1:Nbeta) curve(exp(a[i] + b[i]*x), add=TRUE, col=grau())
```

Wild linear relationships - we need to tame the prior.


```{r }
Nbeta <- 100
a <- rnorm(Nbeta, 3, 0.5)
b <- rnorm(Nbeta, 0, 0.4)
plot(NULL, xlim=c(-2,2), ylim=c(0,100))
for (i in 1:Nbeta) curve(exp(a[i] + b[i]*x), add=TRUE, col=grau())
```

The Normal(0, 0.4) prior allows some strong relationships, while keeping most of the lines relatively straight (skeptical of the data).

### Viewing the priors on the natural outcome scale

```{r}
x_seq <- seq(from=-3, to=3, length.out=100)
lambda <- sapply(x_seq, function(x) exp(a+b*x))
plot(NULL, xlim=range(x_seq), ylim=c(0,500), xlab="size", ylab="issues introduced")
for (i in 1:Nbeta) lines(x_seq, lambda[i,], col=grau(), lwd=1.5)

```

Most of the priors are relatively flat, but a few show strong coupling. This would allow such trends to appear, if they were present in the data. But in general, the priors assume "no relation" - this is because we want the data to show these relationships - we do not want to bias our analysis via the prior.