Jackknife estimate of the variance #112

scj-robin · 2023-11-14T17:44:27Z

I do not get the same estimate of the sd using the jackknife option of PLN and doing it by hand. Did I do smthing wrong?

# Test Jackknife variance estimate

library(PLNmodels)

# Make parms
data("barents")
n <- nrow(barents$Abundance); p <- ncol(barents$Abundance)
fit <- PLN(barents$Abundance ~ barents$Latitude + barents$Longitude + barents$Depth + barents$Temperature)
X <- lm(barents$Abundance[, 1] ~ barents$Latitude + barents$Longitude + barents$Depth + barents$Temperature, x=TRUE)$x

# Simulate data
Y <- matrix(rpois(n*p, exp(X%*%fit$model_par$B + mvtnorm::rmvnorm(n, sigma=fit$model_par$Sigma))), n, p)
d <- ncol(X)

# Jacknife sd estimate
pln <- PLN(Y ~ -1 + X, control=PLN_param(config_post = list(jackknife=TRUE)))
betaSd1 <- as.vector(standard_error(pln, type='jackknife'))

# Jacknife sd estimate by hand
betaJK <- matrix(NA, n, p*d)
for(i in 1:n){
  plnTmp <- PLN(Y[-i, ] ~ -1 + X[-i, ])
  betaJK[i, ]  <- as.vector(plnTmp$model_par$B)
}
betaSd2 <- sqrt((n-1)*(colMeans(betaJK^2) - (colMeans(betaJK)^2)))

# Comparison
plot(betaSd1, betaSd2)

mahendra-mariadassou · 2023-11-14T22:30:51Z

Hi @scj-robin, correct me if I'm wrong but the difference likely arises from the formula you used for the jackknife.

Forgetting matrix notations for a bit, and noting $\hat{\theta}_{(i)}$ the jackknife estimate of $\theta$ without the $i$-th sample and $\hat{\theta}_{(.)} = \frac{1}{n} \sum_{i} \hat{\theta}_{(i)}$ their mean, you use the expression

$$\text{var}_{jack}(\hat{\theta}) = \frac{1}{n} \sum_{i}^n ( \hat{\theta}_{(i)} - \hat{\theta}_{(.)})^2$$

whereas we use Tukey's expression, which is quite different (although slightly conservative, see for example DOI:10.1214/aos/1176345462)

$$\text{var}_{jack}(\hat{\theta}) = \frac{n-1}{n} \sum_{i}^n ( \hat{\theta}_{(i)} - \hat{\theta}_{(.)})^2$$

as shown here (where var_jack is $\sum_i ( \hat{\theta}_{(i)} - \hat{\theta}_{(.)})^2$:

PLNmodels/R/PLNfit-class.R

Line 225 in 2657e50

attr(private$B, "variance_jackknife") <- (self$n - 1) / self$n * var_jack

scj-robin · 2023-11-15T10:34:17Z

I do not think I use the first expression you suggest. I think that line

betaSd2 <- sqrt((n-1)*(colMeans(betaJK^2) - (colMeans(betaJK)^2)))

does compute the second formula you mention as the difference of colMeans is precisely

$$\frac1n \sum_i (\hat{\theta}_{(i)} - \hat{\theta}_{(.)})^2.$$

mahendra-mariadassou · 2023-11-15T10:59:30Z

You're completely right and I should stop trying to understand code late at night. I'll look into it.

mahendra-mariadassou · 2023-11-15T23:34:05Z

OK, I think I understand the source of the problem: our jackknife estimates are not variable enough. I made a branch to export the jackknife estimates (not just the jackknife variance estimator) and got this

# Test Jackknife variance estimate
library(PLNmodels)
#> This is packages 'PLNmodels' version 1.0.4-0200
#> Use future::plan(multicore/multisession) to speed up PLNPCA/PLNmixture/stability_selection.

# Make params
data("barents")
## To keep things fast, use only n = 20 samples
barents <- barents[1:20, ]
n <- nrow(barents$Abundance); p <- ncol(barents$Abundance)
fit <- PLN(Abundance ~ Latitude + Longitude + Depth + Temperature, data = barents)
#> 
#>  Initialization...
#>  Adjusting a full covariance PLN model with nlopt optimizer
#>  Post-treatments...
#>  DONE!
X <- model.matrix(Abundance ~ Latitude + Longitude + Depth + Temperature, data = barents)

# Simulate data
B0 <- fit$model_par$B; Sigma0 <- fit$model_par$Sigma
Y <- matrix(rpois(n*p, exp(X %*% B0 + mvtnorm::rmvnorm(n, sigma= Sigma0))), n, p)
d <- ncol(X)

# All jackknife estimates
pln <- PLN(Y ~ -1 + X, control=PLN_param(config_post = list(jackknife=TRUE)))
#> 
#>  Initialization...
#>  Adjusting a full covariance PLN model with nlopt optimizer
#>  Post-treatments...
#>   Computing jackknife variance estimator
#> 
#>  DONE!
jackknife_estimates <- attr(pln$model_par$B, "estimates_jackknife") |> purrr::map("B")
betaJK_pln <- matrix(NA, p*d, n)
## Compare jackknife estimates to full set estimate
par(mfrow = c(4, 5))
for(i in 1:n){
  betaJK_pln[, i] <- as.vector(jackknife_estimates[[i]])
  plot(pln$model_par$B, jackknife_estimates[[i]],
       xlab = "Coef. (full set)", ylab = "Coeff. (jackknife)",
       main = paste("Jackknife", i))
  abline(a = 0, b = 1, col = "red")
}

All jackknife estimators (computed within the PLN code) align perfectly with the full set estimators 😞

## Compute jackknife variance manually from jackknife estimates and compare with function output
## Use this form because the other leads to negative values due to numeric precision
betaSd1_manual <- sqrt((n-1)*(rowMeans( (betaJK_pln - rowMeans(betaJK_pln))^2)))
betaSd1_pln <- as.vector(standard_error(pln, type='jackknife'))
par(mfrow = c(1, 1))
plot(betaSd1_manual, betaSd1_pln,
     xlab = "Coef. (manual comp.)", ylab = "Coeff. (pln comp.)",
     main = "Jackknife variance estimators")
abline(a = 0, b = 1, col = "red")

No problem when computing the jackknife variance estimators from the jackknife estimators (manual computations are equal to computations done in the package ✅ )

# Jacknife sd estimate by hand
betaJK <- matrix(NA, p*d, n)
par(mfrow = c(4, 5))
for(i in 1:n){
  # cat(i, sep = "\n")
  plnTmp <- PLN(Y[-i, ] ~ -1 + X[-i, ], control = PLN_param(trace = 0))
  plot(pln$model_par$B, plnTmp$model_par$B,
       xlab = "Coef. (full set)", ylab = "Coeff. (jackknife)",
       main = paste("Jackknife", i))
  abline(a = 0, b = 1, col = "red")
  betaJK[, i]  <- as.vector(plnTmp$model_par$B)
}

The jackknife estimators (computed manually) don't align perfectly with the full set estimators ✅

betaSd2_manual <- sqrt((n-1)*(rowMeans( (betaJK - rowMeans(betaJK))^2)))

# Comparison
par(mfrow = c(1, 1))
plot(betaSd1_manual, betaSd2_manual,
     xlab = "Coef. (pln comp.)", ylab = "Coeff. (manual comp.)",
     main = "Jackknife variance estimators")

Hence the difference between the values from the package and the values computed by hand.

^{Created on 2023-11-16 with reprex v2.0.2}

The problem likely arises from the initialization in the jackknife loop: we reuse the values of $B$ and $S$ (but not $M$) estimated from the full set model and the iterative algorithm doesn't appear to move away from them.

PLNmodels/R/PLNfit-class.R

Lines 208 to 218 in 2657e50

    
           jacks <- future.apply::future_lapply(seq_len(self$n), function(i) { 
        
             data <- list(Y = Y[-i, , drop = FALSE], 
        
                          X = X[-i, , drop = FALSE], 
        
                          O = O[-i, , drop = FALSE], 
        
                          w = w[-i]) 
        
             args <- list(data = data, 
        
                          params = list(B = private$B, M = matrix(0, self$n-1, self$p), S = private$S[-i, ]), 
        
                          config = config) 
        
             optim_out <- do.call(private$optimizer$main, args) 
        
             optim_out[c("B", "Omega")] 
        
           }, future.seed = TRUE)

mahendra-mariadassou · 2023-11-15T23:59:55Z

If we perturb $B$ in the initialization of the jackknife estimators (here a multiplicative perturbation),

PLNmodels/R/PLNfit-class.R

Lines 215 to 219 in f89d7d0

    
           args <- list(data = data, 
        
                        params = list(B = private$B * (1 + matrix(runif(min = -0.5, max = 0.5, n = self$d * self$p), self$d, self$p)), 
        
                                      M = matrix(0, self$n-1, self$p), 
        
                                      S = private$S[-i, , drop = FALSE]), 
        
                        config = config)

we get more credible results when comparing jackknife estimators to full set estimators

and in turn the standard deviations obtained "directly from the package" and "manually" are closer to each other but not yet perfectly aligned (beware of the log-scale), probably because of the difference in starting points:

scj-robin · 2023-11-16T08:14:27Z

The thing is that the manual estimates seem to better fit the package ones, when compared to empirical variability of the estimates observed when simulating several datasets with same true parameters. I copy below more or less the same code as the initial one increasing the number of observations et adding the empirical sd.

# Test Jackknife variance estimate

library(PLNmodels); library(mvtnorm)

# Make parms
data("barents")
n <- nrow(barents$Abundance); p <- ncol(barents$Abundance)
fit <- PLN(barents$Abundance ~ barents$Latitude + barents$Longitude + barents$Depth + barents$Temperature)
X <- lm(barents$Abundance[, 1] ~ barents$Latitude + barents$Longitude + barents$Depth + barents$Temperature, x=TRUE)$x

# Simulate data
n <- 5e2
X <- X[sample(1:nrow(X), n, replace=TRUE), ]
Y <- matrix(rpois(n*p, exp(X%*%fit$model_par$B + rmvnorm(n, sigma=fit$model_par$Sigma))), n, p)
d <- ncol(X)

# Jacknife sd estimate
pln <- PLN(Y ~ -1 + X, control=PLN_param(config_post = list(jackknife=TRUE)))
betaSdPackage <- as.vector(standard_error(pln, type='jackknife'))

# Jacknife sd estimate by hand
betaJK <- matrix(NA, n, p*d)
for(i in 1:n){
  if(i %% round(sqrt(n))==0){cat(i, '')}
  plnTmp <- PLN(Y[-i, ] ~ -1 + X[-i, ], control=PLN_param(trace=0))
  betaJK[i, ]  <- as.vector(plnTmp$model_par$B)
}
cat('\n')
betaSdByHand <- sqrt((n-1)*(colMeans(betaJK^2) - (colMeans(betaJK)^2)))

# Empirical sd estimate 
B <- n
betaEmp <- matrix(NA, B, p*d)
for(b in 1:B){
  if(b %% round(sqrt(B))==0){cat(b, '')}
  Ytmp <- matrix(rpois(n*p, exp(X%*%fit$model_par$B + rmvnorm(n, sigma=fit$model_par$Sigma))), n, p)
  plnTmp <- PLN(Ytmp ~ -1 + X, control=PLN_param(trace=0))
  betaEmp[b, ]  <- as.vector(plnTmp$model_par$B)
}
betaSdEmpirical <- apply(betaEmp, 2, sd)

# Comparison
plot(as.data.frame(cbind(betaSdPackage, betaSdByHand, betaSdEmpirical)))

mahendra-mariadassou · 2023-11-16T08:41:15Z

I think this is the same problem. Right now, the jackknife estimators we get "from the package" are underdispersed because the starting point we use makes each jackknife estimator $B_{(i)}$ almost identical to the full dataset one $\hat{B}$.

When you do by it by hand, you use a different starting point for each jackknife dataset and you get jackknife estimator $B_{(i)}$ which are much more dispersed and mimic better the empirical estimate. We need to change the initialization for the jackknife estimator (in the package) to avoid being stuck in $\hat{B}$. The challenge consists in striking the right balance between recycling previous computations (e.g. reusing $\hat{B}$ as such, what's done in the package, which leads to underdispersion) and starting from scratch each time (what you did by hand, which is computationally more intensive).

jchiquet · 2023-11-16T09:21:41Z

Thank @scj-robin for the report and to @mahendra-mariadassou for the diagnostic.

As is often the case, we are reduced to a problem of local V-EM minima and/or difficult optimization in a flat area of the ELBO, with potentially strong variations in parameter values. As far as prediction is concerned, we don't really care, but to our great misfortune we do statistical models where parameter interpretation is important.

This is particularly visible on Stéphane's favorite data, Barent Fish, and not for the first time.

In any case, it's better to start again from scratch than to save calculation time. Easily parallelizable with future_lapply, which hasn't been done yet.

mahendra-mariadassou · 2023-11-16T09:47:42Z

TODO: change initialization in variance_jackknife() and variance_bootstrap() to use same starting point as model inferred "from scratch"

mahendra-mariadassou · 2023-11-17T08:15:29Z

I'm happy to report that we have a working fix:

library(PLNmodels); library(mvtnorm)

# Make params
data("barents")
n <- nrow(barents$Abundance); p <- ncol(barents$Abundance)
fit <- PLN(Abundance ~ Latitude + Longitude + Depth + Temperature, data = barents)
X <- model.matrix(Abundance ~ Latitude + Longitude + Depth + Temperature, data = barents)

# Simulate data (n = 100)
n <- 100
X <- X[sample(1:nrow(X), n, replace=TRUE), ]
B0 <- fit$model_par$B; Sigma0 <- fit$model_par$Sigma
Y <- matrix(rpois(n*p, exp(X %*% B0 + mvtnorm::rmvnorm(n, sigma= Sigma0))), n, p)
d <- ncol(X)

# Jackknife sd estimate (using the package)
tictoc::tic()
pln <- PLN(Y ~ -1 + X, control=PLN_param(config_post = list(jackknife=TRUE)))
#> 
#>  Initialization...
#>  Adjusting a full covariance PLN model with nlopt optimizer
#>  Post-treatments...
#>   Computing jackknife variance estimator
#> 
#>  DONE!
tictoc::toc()
#> 4.296 sec elapsed

betaSdPackage <- as.vector(standard_error(pln, type='jackknife'))
jackknife_estimates <- attr(pln$model_par$B, "estimates_jaccknife") |> purrr::map("B")
betaJKPackage <- matrix(NA, n, p*d)
for(i in 1:n){
  betaJKPackage[i, ]  <- as.vector(jackknife_estimates[[i]])
}


# Jackknife sd estimate (by hand)
tictoc::tic()
betaJK <- matrix(NA, n, p*d)
for(i in 1:n){
  plnTmp <- PLN(Y[-i, ] ~ -1 + X[-i, ], control=PLN_param(trace=0))
  betaJK[i, ]  <- as.vector(plnTmp$model_par$B)
}
tictoc::toc()
#> 20.509 sec elapsed
betaSdByHand <- sqrt((n-1)*apply(betaJK, 2, function(x) { mean(x^2) - mean(x)^2 }))

We can check that the two options (by hand and through the package) return identical results with a figure

## Check that "automatic" and "manual" jackknife estimates are identical
plot(as.vector(betaJK), as.vector(betaJKPackage),
     ylab = "Computed from package", xlab = "Computed by hand",
     main = "Jackknife estimates (all coefficients, all samples)")
abline(0, 1, col = "red")

Or by looking at the difference between coefficients

max(abs(betaJK - betaJKPackage))
#> [1] 0

But using the package is faster (4.296 sec versus 20.509 sec in this example) than doing it by hand (less overhead). And the jackknife estimator now mimics correctly the empirical sd.

# Empirical sd estimate
B <- n
betaEmp <- matrix(NA, B, p*d)
for(b in 1:B){
  if(b %% 10==0){cat(b, '')}
  Ytmp <- matrix(rpois(n*p, exp(X %*% B0 + rmvnorm(n, sigma= Sigma0))), n, p)
  plnTmp <- PLN(Ytmp ~ -1 + X, control=PLN_param(trace=0))
  betaEmp[b, ]  <- as.vector(plnTmp$model_par$B)
}
#> 10 20 30 40 50 60 70 80 90 100
cat("\n")
betaSdEmpirical <- apply(betaEmp, 2, sd)

# Comparison
plot(as.data.frame(cbind(betaSdPackage, betaSdByHand, betaSdEmpirical)))

^{Created on 2023-11-17 with reprex v2.0.2}

@jchiquet, I'll clean the code and do the PR.

jchiquet · 2023-11-17T08:25:17Z

Great news. Thanks a lot for the fix: how did you manage to do it?

…`PLN()`, based on problems reported in #112.

mahendra-mariadassou · 2023-11-17T08:40:15Z

Not in a very smart way, I essentially extracted the computations for the starting point of $M$, $B$ and $S$ into their own function:

PLNmodels/R/utils.R

Lines 256 to 265 in c740f2d

    
           #' barebone function to compute starting points for B and M 
        
           #' @importFrom stats lm.fit 
        
           starting_point <- function(Y, X, O, w) { 
        
             # Y = responses, X = covariates, O = offsets, w = weights 
        
             n <- nrow(Y); p <- ncol(Y); d <- ncol(X) 
        
             fits <- lm.fit(w * X, w * log((1 + Y)/exp(O))) 
        
             list(B = matrix(fits$coefficients, d, p), 
        
                  M = matrix(fits$residuals, n, p), 
        
                  S = matrix(.1, n, p)) 
        
           }

and used it in variance_*() functions.

PLNmodels/R/PLNfit-class.R

Lines 218 to 223 in c740f2d

    
           args <- list(data = data, 
        
                        # params = list(B = private$B, 
        
                        #               M = matrix(0, self$n-1, self$p), 
        
                        #               S = private$S[-i, , drop = FALSE]), 
        
                        params = do.call(starting_point, data), 
        
                        config = config)

I need to check that the tests pass and we should be good.

jchiquet · 2023-11-17T08:44:28Z

Efficiency and accuracy sound smart to me.

mahendra-mariadassou added a commit that referenced this issue Nov 15, 2023

Export jackknife estimates to investigate issue #112

1537ac0

mahendra-mariadassou self-assigned this Nov 16, 2023

mahendra-mariadassou added bug enhancement labels Nov 16, 2023

mahendra-mariadassou added a commit that referenced this issue Nov 17, 2023

Change initialization in variance_*() to mimic a brand new call to …

c740f2d

…`PLN()`, based on problems reported in #112.

mahendra-mariadassou added a commit that referenced this issue Nov 17, 2023

Add scripts used to investigate issue #112

e927362

mahendra-mariadassou mentioned this issue Nov 17, 2023

Issue 112 jackknife #113

Merged

mahendra-mariadassou closed this as completed in #113 Nov 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jackknife estimate of the variance #112

Jackknife estimate of the variance #112

scj-robin commented Nov 14, 2023 •

edited by mahendra-mariadassou

Loading

mahendra-mariadassou commented Nov 14, 2023

scj-robin commented Nov 15, 2023 •

edited by mahendra-mariadassou

Loading

mahendra-mariadassou commented Nov 15, 2023

mahendra-mariadassou commented Nov 15, 2023 •

edited

Loading

mahendra-mariadassou commented Nov 15, 2023

scj-robin commented Nov 16, 2023 •

edited by mahendra-mariadassou

Loading

mahendra-mariadassou commented Nov 16, 2023 •

edited

Loading

jchiquet commented Nov 16, 2023

mahendra-mariadassou commented Nov 16, 2023

mahendra-mariadassou commented Nov 17, 2023

jchiquet commented Nov 17, 2023 •

edited

Loading

mahendra-mariadassou commented Nov 17, 2023

jchiquet commented Nov 17, 2023

Jackknife estimate of the variance #112

Jackknife estimate of the variance #112

Comments

scj-robin commented Nov 14, 2023 • edited by mahendra-mariadassou Loading

mahendra-mariadassou commented Nov 14, 2023

scj-robin commented Nov 15, 2023 • edited by mahendra-mariadassou Loading

mahendra-mariadassou commented Nov 15, 2023

mahendra-mariadassou commented Nov 15, 2023 • edited Loading

mahendra-mariadassou commented Nov 15, 2023

scj-robin commented Nov 16, 2023 • edited by mahendra-mariadassou Loading

mahendra-mariadassou commented Nov 16, 2023 • edited Loading

jchiquet commented Nov 16, 2023

mahendra-mariadassou commented Nov 16, 2023

mahendra-mariadassou commented Nov 17, 2023

jchiquet commented Nov 17, 2023 • edited Loading

mahendra-mariadassou commented Nov 17, 2023

jchiquet commented Nov 17, 2023

scj-robin commented Nov 14, 2023 •

edited by mahendra-mariadassou

Loading

scj-robin commented Nov 15, 2023 •

edited by mahendra-mariadassou

Loading

mahendra-mariadassou commented Nov 15, 2023 •

edited

Loading

scj-robin commented Nov 16, 2023 •

edited by mahendra-mariadassou

Loading

mahendra-mariadassou commented Nov 16, 2023 •

edited

Loading

jchiquet commented Nov 17, 2023 •

edited

Loading