Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting parameters from reported median and range #1

Open
adamkucharski opened this issue Jul 21, 2022 · 6 comments
Open

Extracting parameters from reported median and range #1

adamkucharski opened this issue Jul 21, 2022 · 6 comments

Comments

@adamkucharski
Copy link
Member

Descriptive studies often report summary values such as median, range, or percentiles (e.g. 95%) for estimated incubation periods. We already have functionality to extract parameters for an assumed distribution and its cdf from a reported percentiles using least squares in extract_param, e.g.
extract_param(type = "percentiles", values = c(5.9,21.4), distribution = "lnorm", percentiles = c(0.025,0.975))

However, it would also be useful to be able to extract parameters for an assumed distribution g(x) from a reported median. If we defined our observed median, range and number of samples as vals = c(median, min, max, n), then one option for a function to minimise over two parameters for a lognormal distribution, a and b would be:

fit_function_lnorm_range <- function(param, val) {
  
  # Median square residual
  median_sr <- (plnorm(val[1], meanlog = param[["a"]],sdlog = param[["b"]]) - 0.5)^2 

  # Probability of obtaining min, max and range:
  min_p <- dlnorm(val[2], meanlog = param[["a"]],sdlog = param[["b"]])
  max_p <- dlnorm(val[3], meanlog = param[["a"]],sdlog = param[["b"]])
  range_p <- (plnorm(val[3], meanlog = param[["a"]],sdlog = param[["b"]]) - plnorm(val[2], meanlog = param[["a"]],sdlog = param[["b"]]))^(val[["n"]]-2)
  
  # Range log likelihood
  range_sr <- -log(min_p*max_p*range_p)
  
  # Total value to be minimised
  range_sr + median_sr 
  
}

This seems to be able to recover the correct expected median and range for a given sample size in bootstrap simulations from the estimated distribution. But there may be a more elegant way of defining the function to be minimised.

@sbfnk
Copy link
Contributor

sbfnk commented Jul 22, 2022

I think it's a good idea to implement extraction of distributional parameters from a wide range summary statistics (e.g. mean, variance, CIs etc.). I'm not sure I follow what's going on here though: if f(x) is the distribution of the parameter with CDF F(x) this seems to define a loss function for given median, min, max and n as

(F(median) - 0.5)^2 - log(f(min) f(max) (F(max) - F(min))^(n-2))

Is that right?

@adamkucharski
Copy link
Member Author

There are two parts to the loss function above. One minimises the least square residual for the median, i.e. (F(median) - 0.5)^2 and the other is based on the probability of observing a specific min and max in n observations. The idea is we can define this as:

P(observe min) P(observe max) P(observe n-2 values between min and max) = f(min) f(max) (F(max) - F(min))^(n-2)

And above aims to minimise the log likelihood, by adding the negation of the above to the loss function for the median. Which is quite a rough approach, as it's implicitly weighting the two parts – and from some simulation recoveries, looks like can make a slight difference, e.g. whether or not to use log likelihood or just likelihood.

@adamkucharski
Copy link
Member Author

Reference for range_sr equation above is derived from equation 1 in Gumbel, 1947 – in this rough version, dropped the n multipliers as they wouldn't affect maximum likelihood. From what I recall, had some issues with recovering correct median as well if known, but then need to define some kind of weighing between the likelihood for min/max and likelihood/optimisation for median.

@prabasaj
Copy link

Two suggested updates to extract_param():

  1. Offer option to extract variance-covariance matrix of the extracted distribution parameters (through the Hessian option for optim...).
  2. For type = "range", a median is not necessary (although median adds precision) -- distribution parameters may be estimated from min and max through the log-likelihood function log(f(min) f(max) (F(max) - F(min))^(n-2)) [part of the function suggested by sbfnk].

@kellymccain28
Copy link
Contributor

Another suggested update to extract_param():

  1. For type = 'percentiles', it woudl be useful to be able to include the mean (e.g. in an example from the Ebola epireview database, there is a record for infectious period that has mean and 95% credible intervals, and as it is now, there is only a way to use the percentile information, not the mean, to estimate parameters). I'm not sure of the maths required for this, though

@adamkucharski
Copy link
Member Author

Posting example implementation in current {epiparameter} version for reference:

sample_out <- rlnorm(200,meanlog=2.5,sdlog=0.6)
hist(sample_out)

extract_param(
  type = "range",
  values = c(median(sample_out), round(min(sample_out)), round(max(sample_out))),
  distribution = "lnorm",
  samples = length(sample_out),
  control = list(max_iter = 100)
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants