-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jackknife estimate of the variance #112
Comments
Hi @scj-robin, correct me if I'm wrong but the difference likely arises from the formula you used for the jackknife. Forgetting matrix notations for a bit, and noting whereas we use Tukey's expression, which is quite different (although slightly conservative, see for example DOI:10.1214/aos/1176345462) as shown here (where Line 225 in 2657e50
|
I do not think I use the first expression you suggest. I think that line betaSd2 <- sqrt((n-1)*(colMeans(betaJK^2) - (colMeans(betaJK)^2))) does compute the second formula you mention as the difference of colMeans is precisely |
You're completely right and I should stop trying to understand code late at night. I'll look into it. |
OK, I think I understand the source of the problem: our jackknife estimates are not variable enough. I made a branch to export the jackknife estimates (not just the jackknife variance estimator) and got this # Test Jackknife variance estimate
library(PLNmodels)
#> This is packages 'PLNmodels' version 1.0.4-0200
#> Use future::plan(multicore/multisession) to speed up PLNPCA/PLNmixture/stability_selection.
# Make params
data("barents")
## To keep things fast, use only n = 20 samples
barents <- barents[1:20, ]
n <- nrow(barents$Abundance); p <- ncol(barents$Abundance)
fit <- PLN(Abundance ~ Latitude + Longitude + Depth + Temperature, data = barents)
#>
#> Initialization...
#> Adjusting a full covariance PLN model with nlopt optimizer
#> Post-treatments...
#> DONE!
X <- model.matrix(Abundance ~ Latitude + Longitude + Depth + Temperature, data = barents)
# Simulate data
B0 <- fit$model_par$B; Sigma0 <- fit$model_par$Sigma
Y <- matrix(rpois(n*p, exp(X %*% B0 + mvtnorm::rmvnorm(n, sigma= Sigma0))), n, p)
d <- ncol(X)
# All jackknife estimates
pln <- PLN(Y ~ -1 + X, control=PLN_param(config_post = list(jackknife=TRUE)))
#>
#> Initialization...
#> Adjusting a full covariance PLN model with nlopt optimizer
#> Post-treatments...
#> Computing jackknife variance estimator
#>
#> DONE!
jackknife_estimates <- attr(pln$model_par$B, "estimates_jackknife") |> purrr::map("B")
betaJK_pln <- matrix(NA, p*d, n)
## Compare jackknife estimates to full set estimate
par(mfrow = c(4, 5))
for(i in 1:n){
betaJK_pln[, i] <- as.vector(jackknife_estimates[[i]])
plot(pln$model_par$B, jackknife_estimates[[i]],
xlab = "Coef. (full set)", ylab = "Coeff. (jackknife)",
main = paste("Jackknife", i))
abline(a = 0, b = 1, col = "red")
} All jackknife estimators (computed within the PLN code) align perfectly with the full set estimators 😞 ## Compute jackknife variance manually from jackknife estimates and compare with function output
## Use this form because the other leads to negative values due to numeric precision
betaSd1_manual <- sqrt((n-1)*(rowMeans( (betaJK_pln - rowMeans(betaJK_pln))^2)))
betaSd1_pln <- as.vector(standard_error(pln, type='jackknife'))
par(mfrow = c(1, 1))
plot(betaSd1_manual, betaSd1_pln,
xlab = "Coef. (manual comp.)", ylab = "Coeff. (pln comp.)",
main = "Jackknife variance estimators")
abline(a = 0, b = 1, col = "red") No problem when computing the jackknife variance estimators from the jackknife estimators (manual computations are equal to computations done in the package ✅ ) # Jacknife sd estimate by hand
betaJK <- matrix(NA, p*d, n)
par(mfrow = c(4, 5))
for(i in 1:n){
# cat(i, sep = "\n")
plnTmp <- PLN(Y[-i, ] ~ -1 + X[-i, ], control = PLN_param(trace = 0))
plot(pln$model_par$B, plnTmp$model_par$B,
xlab = "Coef. (full set)", ylab = "Coeff. (jackknife)",
main = paste("Jackknife", i))
abline(a = 0, b = 1, col = "red")
betaJK[, i] <- as.vector(plnTmp$model_par$B)
} The jackknife estimators (computed manually) don't align perfectly with the full set estimators ✅ betaSd2_manual <- sqrt((n-1)*(rowMeans( (betaJK - rowMeans(betaJK))^2)))
# Comparison
par(mfrow = c(1, 1))
plot(betaSd1_manual, betaSd2_manual,
xlab = "Coef. (pln comp.)", ylab = "Coeff. (manual comp.)",
main = "Jackknife variance estimators") Hence the difference between the values from the package and the values computed by hand. Created on 2023-11-16 with reprex v2.0.2 The problem likely arises from the initialization in the jackknife loop: we reuse the values of Lines 208 to 218 in 2657e50
|
If we perturb Lines 215 to 219 in f89d7d0
we get more credible results when comparing jackknife estimators to full set estimators and in turn the standard deviations obtained "directly from the package" and "manually" are closer to each other but not yet perfectly aligned (beware of the log-scale), probably because of the difference in starting points: |
The thing is that the manual estimates seem to better fit the package ones, when compared to empirical variability of the estimates observed when simulating several datasets with same true parameters. I copy below more or less the same code as the initial one increasing the number of observations et adding the empirical sd. # Test Jackknife variance estimate
library(PLNmodels); library(mvtnorm)
# Make parms
data("barents")
n <- nrow(barents$Abundance); p <- ncol(barents$Abundance)
fit <- PLN(barents$Abundance ~ barents$Latitude + barents$Longitude + barents$Depth + barents$Temperature)
X <- lm(barents$Abundance[, 1] ~ barents$Latitude + barents$Longitude + barents$Depth + barents$Temperature, x=TRUE)$x
# Simulate data
n <- 5e2
X <- X[sample(1:nrow(X), n, replace=TRUE), ]
Y <- matrix(rpois(n*p, exp(X%*%fit$model_par$B + rmvnorm(n, sigma=fit$model_par$Sigma))), n, p)
d <- ncol(X)
# Jacknife sd estimate
pln <- PLN(Y ~ -1 + X, control=PLN_param(config_post = list(jackknife=TRUE)))
betaSdPackage <- as.vector(standard_error(pln, type='jackknife'))
# Jacknife sd estimate by hand
betaJK <- matrix(NA, n, p*d)
for(i in 1:n){
if(i %% round(sqrt(n))==0){cat(i, '')}
plnTmp <- PLN(Y[-i, ] ~ -1 + X[-i, ], control=PLN_param(trace=0))
betaJK[i, ] <- as.vector(plnTmp$model_par$B)
}
cat('\n')
betaSdByHand <- sqrt((n-1)*(colMeans(betaJK^2) - (colMeans(betaJK)^2)))
# Empirical sd estimate
B <- n
betaEmp <- matrix(NA, B, p*d)
for(b in 1:B){
if(b %% round(sqrt(B))==0){cat(b, '')}
Ytmp <- matrix(rpois(n*p, exp(X%*%fit$model_par$B + rmvnorm(n, sigma=fit$model_par$Sigma))), n, p)
plnTmp <- PLN(Ytmp ~ -1 + X, control=PLN_param(trace=0))
betaEmp[b, ] <- as.vector(plnTmp$model_par$B)
}
betaSdEmpirical <- apply(betaEmp, 2, sd)
# Comparison
plot(as.data.frame(cbind(betaSdPackage, betaSdByHand, betaSdEmpirical))) |
I think this is the same problem. Right now, the jackknife estimators we get "from the package" are underdispersed because the starting point we use makes each jackknife estimator When you do by it by hand, you use a different starting point for each jackknife dataset and you get jackknife estimator |
Thank @scj-robin for the report and to @mahendra-mariadassou for the diagnostic. As is often the case, we are reduced to a problem of local V-EM minima and/or difficult optimization in a flat area of the ELBO, with potentially strong variations in parameter values. As far as prediction is concerned, we don't really care, but to our great misfortune we do statistical models where parameter interpretation is important. This is particularly visible on Stéphane's favorite data, Barent Fish, and not for the first time. In any case, it's better to start again from scratch than to save calculation time. Easily parallelizable with future_lapply, which hasn't been done yet. |
TODO: change initialization in |
I'm happy to report that we have a working fix: library(PLNmodels); library(mvtnorm)
# Make params
data("barents")
n <- nrow(barents$Abundance); p <- ncol(barents$Abundance)
fit <- PLN(Abundance ~ Latitude + Longitude + Depth + Temperature, data = barents)
X <- model.matrix(Abundance ~ Latitude + Longitude + Depth + Temperature, data = barents)
# Simulate data (n = 100)
n <- 100
X <- X[sample(1:nrow(X), n, replace=TRUE), ]
B0 <- fit$model_par$B; Sigma0 <- fit$model_par$Sigma
Y <- matrix(rpois(n*p, exp(X %*% B0 + mvtnorm::rmvnorm(n, sigma= Sigma0))), n, p)
d <- ncol(X)
# Jackknife sd estimate (using the package)
tictoc::tic()
pln <- PLN(Y ~ -1 + X, control=PLN_param(config_post = list(jackknife=TRUE)))
#>
#> Initialization...
#> Adjusting a full covariance PLN model with nlopt optimizer
#> Post-treatments...
#> Computing jackknife variance estimator
#>
#> DONE!
tictoc::toc()
#> 4.296 sec elapsed
betaSdPackage <- as.vector(standard_error(pln, type='jackknife'))
jackknife_estimates <- attr(pln$model_par$B, "estimates_jaccknife") |> purrr::map("B")
betaJKPackage <- matrix(NA, n, p*d)
for(i in 1:n){
betaJKPackage[i, ] <- as.vector(jackknife_estimates[[i]])
}
# Jackknife sd estimate (by hand)
tictoc::tic()
betaJK <- matrix(NA, n, p*d)
for(i in 1:n){
plnTmp <- PLN(Y[-i, ] ~ -1 + X[-i, ], control=PLN_param(trace=0))
betaJK[i, ] <- as.vector(plnTmp$model_par$B)
}
tictoc::toc()
#> 20.509 sec elapsed
betaSdByHand <- sqrt((n-1)*apply(betaJK, 2, function(x) { mean(x^2) - mean(x)^2 })) We can check that the two options (by hand and through the package) return identical results with a figure ## Check that "automatic" and "manual" jackknife estimates are identical
plot(as.vector(betaJK), as.vector(betaJKPackage),
ylab = "Computed from package", xlab = "Computed by hand",
main = "Jackknife estimates (all coefficients, all samples)")
abline(0, 1, col = "red") Or by looking at the difference between coefficients max(abs(betaJK - betaJKPackage))
#> [1] 0 But using the package is faster (4.296 sec versus 20.509 sec in this example) than doing it by hand (less overhead). And the jackknife estimator now mimics correctly the empirical sd. # Empirical sd estimate
B <- n
betaEmp <- matrix(NA, B, p*d)
for(b in 1:B){
if(b %% 10==0){cat(b, '')}
Ytmp <- matrix(rpois(n*p, exp(X %*% B0 + rmvnorm(n, sigma= Sigma0))), n, p)
plnTmp <- PLN(Ytmp ~ -1 + X, control=PLN_param(trace=0))
betaEmp[b, ] <- as.vector(plnTmp$model_par$B)
}
#> 10 20 30 40 50 60 70 80 90 100
cat("\n")
betaSdEmpirical <- apply(betaEmp, 2, sd)
# Comparison
plot(as.data.frame(cbind(betaSdPackage, betaSdByHand, betaSdEmpirical))) Created on 2023-11-17 with reprex v2.0.2 @jchiquet, I'll clean the code and do the PR. |
Great news. Thanks a lot for the fix: how did you manage to do it? |
…`PLN()`, based on problems reported in #112.
Not in a very smart way, I essentially extracted the computations for the starting point of Lines 256 to 265 in c740f2d
and used it in Lines 218 to 223 in c740f2d
I need to check that the tests pass and we should be good. |
Efficiency and accuracy sound smart to me. |
I do not get the same estimate of the sd using the jackknife option of PLN and doing it by hand. Did I do smthing wrong?
The text was updated successfully, but these errors were encountered: