Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jackknife estimate of the variance #112

Closed
scj-robin opened this issue Nov 14, 2023 · 13 comments · Fixed by #113
Closed

Jackknife estimate of the variance #112

scj-robin opened this issue Nov 14, 2023 · 13 comments · Fixed by #113
Assignees

Comments

@scj-robin
Copy link

scj-robin commented Nov 14, 2023

I do not get the same estimate of the sd using the jackknife option of PLN and doing it by hand. Did I do smthing wrong?

# Test Jackknife variance estimate

library(PLNmodels)

# Make parms
data("barents")
n <- nrow(barents$Abundance); p <- ncol(barents$Abundance)
fit <- PLN(barents$Abundance ~ barents$Latitude + barents$Longitude + barents$Depth + barents$Temperature)
X <- lm(barents$Abundance[, 1] ~ barents$Latitude + barents$Longitude + barents$Depth + barents$Temperature, x=TRUE)$x

# Simulate data
Y <- matrix(rpois(n*p, exp(X%*%fit$model_par$B + mvtnorm::rmvnorm(n, sigma=fit$model_par$Sigma))), n, p)
d <- ncol(X)

# Jacknife sd estimate
pln <- PLN(Y ~ -1 + X, control=PLN_param(config_post = list(jackknife=TRUE)))
betaSd1 <- as.vector(standard_error(pln, type='jackknife'))

# Jacknife sd estimate by hand
betaJK <- matrix(NA, n, p*d)
for(i in 1:n){
  plnTmp <- PLN(Y[-i, ] ~ -1 + X[-i, ])
  betaJK[i, ]  <- as.vector(plnTmp$model_par$B)
}
betaSd2 <- sqrt((n-1)*(colMeans(betaJK^2) - (colMeans(betaJK)^2)))

# Comparison
plot(betaSd1, betaSd2)
@mahendra-mariadassou
Copy link
Collaborator

Hi @scj-robin, correct me if I'm wrong but the difference likely arises from the formula you used for the jackknife.

Forgetting matrix notations for a bit, and noting $\hat{\theta}_{(i)}$ the jackknife estimate of $\theta$ without the $i$-th sample and $\hat{\theta}_{(.)} = \frac{1}{n} \sum_{i} \hat{\theta}_{(i)}$ their mean, you use the expression

$$\text{var}_{jack}(\hat{\theta}) = \frac{1}{n} \sum_{i}^n ( \hat{\theta}_{(i)} - \hat{\theta}_{(.)})^2$$

whereas we use Tukey's expression, which is quite different (although slightly conservative, see for example DOI:10.1214/aos/1176345462)

$$\text{var}_{jack}(\hat{\theta}) = \frac{n-1}{n} \sum_{i}^n ( \hat{\theta}_{(i)} - \hat{\theta}_{(.)})^2$$

as shown here (where var_jack is $\sum_i ( \hat{\theta}_{(i)} - \hat{\theta}_{(.)})^2$:

attr(private$B, "variance_jackknife") <- (self$n - 1) / self$n * var_jack

@scj-robin
Copy link
Author

scj-robin commented Nov 15, 2023

I do not think I use the first expression you suggest. I think that line

betaSd2 <- sqrt((n-1)*(colMeans(betaJK^2) - (colMeans(betaJK)^2)))

does compute the second formula you mention as the difference of colMeans is precisely

$$\frac1n \sum_i (\hat{\theta}_{(i)} - \hat{\theta}_{(.)})^2.$$

@mahendra-mariadassou
Copy link
Collaborator

You're completely right and I should stop trying to understand code late at night. I'll look into it.

@mahendra-mariadassou
Copy link
Collaborator

mahendra-mariadassou commented Nov 15, 2023

OK, I think I understand the source of the problem: our jackknife estimates are not variable enough. I made a branch to export the jackknife estimates (not just the jackknife variance estimator) and got this

# Test Jackknife variance estimate
library(PLNmodels)
#> This is packages 'PLNmodels' version 1.0.4-0200
#> Use future::plan(multicore/multisession) to speed up PLNPCA/PLNmixture/stability_selection.

# Make params
data("barents")
## To keep things fast, use only n = 20 samples
barents <- barents[1:20, ]
n <- nrow(barents$Abundance); p <- ncol(barents$Abundance)
fit <- PLN(Abundance ~ Latitude + Longitude + Depth + Temperature, data = barents)
#> 
#>  Initialization...
#>  Adjusting a full covariance PLN model with nlopt optimizer
#>  Post-treatments...
#>  DONE!
X <- model.matrix(Abundance ~ Latitude + Longitude + Depth + Temperature, data = barents)

# Simulate data
B0 <- fit$model_par$B; Sigma0 <- fit$model_par$Sigma
Y <- matrix(rpois(n*p, exp(X %*% B0 + mvtnorm::rmvnorm(n, sigma= Sigma0))), n, p)
d <- ncol(X)

# All jackknife estimates
pln <- PLN(Y ~ -1 + X, control=PLN_param(config_post = list(jackknife=TRUE)))
#> 
#>  Initialization...
#>  Adjusting a full covariance PLN model with nlopt optimizer
#>  Post-treatments...
#>   Computing jackknife variance estimator
#> 
#>  DONE!
jackknife_estimates <- attr(pln$model_par$B, "estimates_jackknife") |> purrr::map("B")
betaJK_pln <- matrix(NA, p*d, n)
## Compare jackknife estimates to full set estimate
par(mfrow = c(4, 5))
for(i in 1:n){
  betaJK_pln[, i] <- as.vector(jackknife_estimates[[i]])
  plot(pln$model_par$B, jackknife_estimates[[i]],
       xlab = "Coef. (full set)", ylab = "Coeff. (jackknife)",
       main = paste("Jackknife", i))
  abline(a = 0, b = 1, col = "red")
}

All jackknife estimators (computed within the PLN code) align perfectly with the full set estimators 😞
image

## Compute jackknife variance manually from jackknife estimates and compare with function output
## Use this form because the other leads to negative values due to numeric precision
betaSd1_manual <- sqrt((n-1)*(rowMeans( (betaJK_pln - rowMeans(betaJK_pln))^2)))
betaSd1_pln <- as.vector(standard_error(pln, type='jackknife'))
par(mfrow = c(1, 1))
plot(betaSd1_manual, betaSd1_pln,
     xlab = "Coef. (manual comp.)", ylab = "Coeff. (pln comp.)",
     main = "Jackknife variance estimators")
abline(a = 0, b = 1, col = "red")

No problem when computing the jackknife variance estimators from the jackknife estimators (manual computations are equal to computations done in the package ✅ )

# Jacknife sd estimate by hand
betaJK <- matrix(NA, p*d, n)
par(mfrow = c(4, 5))
for(i in 1:n){
  # cat(i, sep = "\n")
  plnTmp <- PLN(Y[-i, ] ~ -1 + X[-i, ], control = PLN_param(trace = 0))
  plot(pln$model_par$B, plnTmp$model_par$B,
       xlab = "Coef. (full set)", ylab = "Coeff. (jackknife)",
       main = paste("Jackknife", i))
  abline(a = 0, b = 1, col = "red")
  betaJK[, i]  <- as.vector(plnTmp$model_par$B)
}

The jackknife estimators (computed manually) don't align perfectly with the full set estimators ✅
image

betaSd2_manual <- sqrt((n-1)*(rowMeans( (betaJK - rowMeans(betaJK))^2)))

# Comparison
par(mfrow = c(1, 1))
plot(betaSd1_manual, betaSd2_manual,
     xlab = "Coef. (pln comp.)", ylab = "Coeff. (manual comp.)",
     main = "Jackknife variance estimators")

Hence the difference between the values from the package and the values computed by hand.

Created on 2023-11-16 with reprex v2.0.2

The problem likely arises from the initialization in the jackknife loop: we reuse the values of $B$ and $S$ (but not $M$) estimated from the full set model and the iterative algorithm doesn't appear to move away from them.

PLNmodels/R/PLNfit-class.R

Lines 208 to 218 in 2657e50

jacks <- future.apply::future_lapply(seq_len(self$n), function(i) {
data <- list(Y = Y[-i, , drop = FALSE],
X = X[-i, , drop = FALSE],
O = O[-i, , drop = FALSE],
w = w[-i])
args <- list(data = data,
params = list(B = private$B, M = matrix(0, self$n-1, self$p), S = private$S[-i, ]),
config = config)
optim_out <- do.call(private$optimizer$main, args)
optim_out[c("B", "Omega")]
}, future.seed = TRUE)

@mahendra-mariadassou
Copy link
Collaborator

If we perturb $B$ in the initialization of the jackknife estimators (here a multiplicative perturbation),

PLNmodels/R/PLNfit-class.R

Lines 215 to 219 in f89d7d0

args <- list(data = data,
params = list(B = private$B * (1 + matrix(runif(min = -0.5, max = 0.5, n = self$d * self$p), self$d, self$p)),
M = matrix(0, self$n-1, self$p),
S = private$S[-i, , drop = FALSE]),
config = config)

we get more credible results when comparing jackknife estimators to full set estimators
image

and in turn the standard deviations obtained "directly from the package" and "manually" are closer to each other but not yet perfectly aligned (beware of the log-scale), probably because of the difference in starting points:

image

@scj-robin
Copy link
Author

scj-robin commented Nov 16, 2023

The thing is that the manual estimates seem to better fit the package ones, when compared to empirical variability of the estimates observed when simulating several datasets with same true parameters. I copy below more or less the same code as the initial one increasing the number of observations et adding the empirical sd.

# Test Jackknife variance estimate

library(PLNmodels); library(mvtnorm)

# Make parms
data("barents")
n <- nrow(barents$Abundance); p <- ncol(barents$Abundance)
fit <- PLN(barents$Abundance ~ barents$Latitude + barents$Longitude + barents$Depth + barents$Temperature)
X <- lm(barents$Abundance[, 1] ~ barents$Latitude + barents$Longitude + barents$Depth + barents$Temperature, x=TRUE)$x

# Simulate data
n <- 5e2
X <- X[sample(1:nrow(X), n, replace=TRUE), ]
Y <- matrix(rpois(n*p, exp(X%*%fit$model_par$B + rmvnorm(n, sigma=fit$model_par$Sigma))), n, p)
d <- ncol(X)

# Jacknife sd estimate
pln <- PLN(Y ~ -1 + X, control=PLN_param(config_post = list(jackknife=TRUE)))
betaSdPackage <- as.vector(standard_error(pln, type='jackknife'))

# Jacknife sd estimate by hand
betaJK <- matrix(NA, n, p*d)
for(i in 1:n){
  if(i %% round(sqrt(n))==0){cat(i, '')}
  plnTmp <- PLN(Y[-i, ] ~ -1 + X[-i, ], control=PLN_param(trace=0))
  betaJK[i, ]  <- as.vector(plnTmp$model_par$B)
}
cat('\n')
betaSdByHand <- sqrt((n-1)*(colMeans(betaJK^2) - (colMeans(betaJK)^2)))

# Empirical sd estimate 
B <- n
betaEmp <- matrix(NA, B, p*d)
for(b in 1:B){
  if(b %% round(sqrt(B))==0){cat(b, '')}
  Ytmp <- matrix(rpois(n*p, exp(X%*%fit$model_par$B + rmvnorm(n, sigma=fit$model_par$Sigma))), n, p)
  plnTmp <- PLN(Ytmp ~ -1 + X, control=PLN_param(trace=0))
  betaEmp[b, ]  <- as.vector(plnTmp$model_par$B)
}
betaSdEmpirical <- apply(betaEmp, 2, sd)

# Comparison
plot(as.data.frame(cbind(betaSdPackage, betaSdByHand, betaSdEmpirical)))

image

@mahendra-mariadassou
Copy link
Collaborator

mahendra-mariadassou commented Nov 16, 2023

I think this is the same problem. Right now, the jackknife estimators we get "from the package" are underdispersed because the starting point we use makes each jackknife estimator $B_{(i)}$ almost identical to the full dataset one $\hat{B}$.

When you do by it by hand, you use a different starting point for each jackknife dataset and you get jackknife estimator $B_{(i)}$ which are much more dispersed and mimic better the empirical estimate. We need to change the initialization for the jackknife estimator (in the package) to avoid being stuck in $\hat{B}$. The challenge consists in striking the right balance between recycling previous computations (e.g. reusing $\hat{B}$ as such, what's done in the package, which leads to underdispersion) and starting from scratch each time (what you did by hand, which is computationally more intensive).

@jchiquet
Copy link
Member

Thank @scj-robin for the report and to @mahendra-mariadassou for the diagnostic.

As is often the case, we are reduced to a problem of local V-EM minima and/or difficult optimization in a flat area of the ELBO, with potentially strong variations in parameter values. As far as prediction is concerned, we don't really care, but to our great misfortune we do statistical models where parameter interpretation is important.

This is particularly visible on Stéphane's favorite data, Barent Fish, and not for the first time.

In any case, it's better to start again from scratch than to save calculation time. Easily parallelizable with future_lapply, which hasn't been done yet.

@mahendra-mariadassou
Copy link
Collaborator

TODO: change initialization in variance_jackknife() and variance_bootstrap() to use same starting point as model inferred "from scratch"

@mahendra-mariadassou
Copy link
Collaborator

I'm happy to report that we have a working fix:

library(PLNmodels); library(mvtnorm)

# Make params
data("barents")
n <- nrow(barents$Abundance); p <- ncol(barents$Abundance)
fit <- PLN(Abundance ~ Latitude + Longitude + Depth + Temperature, data = barents)
X <- model.matrix(Abundance ~ Latitude + Longitude + Depth + Temperature, data = barents)

# Simulate data (n = 100)
n <- 100
X <- X[sample(1:nrow(X), n, replace=TRUE), ]
B0 <- fit$model_par$B; Sigma0 <- fit$model_par$Sigma
Y <- matrix(rpois(n*p, exp(X %*% B0 + mvtnorm::rmvnorm(n, sigma= Sigma0))), n, p)
d <- ncol(X)

# Jackknife sd estimate (using the package)
tictoc::tic()
pln <- PLN(Y ~ -1 + X, control=PLN_param(config_post = list(jackknife=TRUE)))
#> 
#>  Initialization...
#>  Adjusting a full covariance PLN model with nlopt optimizer
#>  Post-treatments...
#>   Computing jackknife variance estimator
#> 
#>  DONE!
tictoc::toc()
#> 4.296 sec elapsed

betaSdPackage <- as.vector(standard_error(pln, type='jackknife'))
jackknife_estimates <- attr(pln$model_par$B, "estimates_jaccknife") |> purrr::map("B")
betaJKPackage <- matrix(NA, n, p*d)
for(i in 1:n){
  betaJKPackage[i, ]  <- as.vector(jackknife_estimates[[i]])
}


# Jackknife sd estimate (by hand)
tictoc::tic()
betaJK <- matrix(NA, n, p*d)
for(i in 1:n){
  plnTmp <- PLN(Y[-i, ] ~ -1 + X[-i, ], control=PLN_param(trace=0))
  betaJK[i, ]  <- as.vector(plnTmp$model_par$B)
}
tictoc::toc()
#> 20.509 sec elapsed
betaSdByHand <- sqrt((n-1)*apply(betaJK, 2, function(x) { mean(x^2) - mean(x)^2 }))

We can check that the two options (by hand and through the package) return identical results with a figure

## Check that "automatic" and "manual" jackknife estimates are identical
plot(as.vector(betaJK), as.vector(betaJKPackage),
     ylab = "Computed from package", xlab = "Computed by hand",
     main = "Jackknife estimates (all coefficients, all samples)")
abline(0, 1, col = "red")

Or by looking at the difference between coefficients

max(abs(betaJK - betaJKPackage))
#> [1] 0

But using the package is faster (4.296 sec versus 20.509 sec in this example) than doing it by hand (less overhead). And the jackknife estimator now mimics correctly the empirical sd.

# Empirical sd estimate
B <- n
betaEmp <- matrix(NA, B, p*d)
for(b in 1:B){
  if(b %% 10==0){cat(b, '')}
  Ytmp <- matrix(rpois(n*p, exp(X %*% B0 + rmvnorm(n, sigma= Sigma0))), n, p)
  plnTmp <- PLN(Ytmp ~ -1 + X, control=PLN_param(trace=0))
  betaEmp[b, ]  <- as.vector(plnTmp$model_par$B)
}
#> 10 20 30 40 50 60 70 80 90 100
cat("\n")
betaSdEmpirical <- apply(betaEmp, 2, sd)

# Comparison
plot(as.data.frame(cbind(betaSdPackage, betaSdByHand, betaSdEmpirical)))

Created on 2023-11-17 with reprex v2.0.2

@jchiquet, I'll clean the code and do the PR.

@jchiquet
Copy link
Member

jchiquet commented Nov 17, 2023

Great news. Thanks a lot for the fix: how did you manage to do it?

mahendra-mariadassou added a commit that referenced this issue Nov 17, 2023
@mahendra-mariadassou
Copy link
Collaborator

Not in a very smart way, I essentially extracted the computations for the starting point of $M$, $B$ and $S$ into their own function:

PLNmodels/R/utils.R

Lines 256 to 265 in c740f2d

#' barebone function to compute starting points for B and M
#' @importFrom stats lm.fit
starting_point <- function(Y, X, O, w) {
# Y = responses, X = covariates, O = offsets, w = weights
n <- nrow(Y); p <- ncol(Y); d <- ncol(X)
fits <- lm.fit(w * X, w * log((1 + Y)/exp(O)))
list(B = matrix(fits$coefficients, d, p),
M = matrix(fits$residuals, n, p),
S = matrix(.1, n, p))
}

and used it in variance_*() functions.

PLNmodels/R/PLNfit-class.R

Lines 218 to 223 in c740f2d

args <- list(data = data,
# params = list(B = private$B,
# M = matrix(0, self$n-1, self$p),
# S = private$S[-i, , drop = FALSE]),
params = do.call(starting_point, data),
config = config)

I need to check that the tests pass and we should be good.

@jchiquet
Copy link
Member

Efficiency and accuracy sound smart to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants