Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

streamline S3 classes #451

Open
Tracked by #423
sbfnk opened this issue Sep 8, 2023 · 7 comments
Open
Tracked by #423

streamline S3 classes #451

sbfnk opened this issue Sep 8, 2023 · 7 comments

Comments

@sbfnk
Copy link
Contributor

sbfnk commented Sep 8, 2023

Currently the three models included in the package return S3 class with similar but slightly different contents. Whilst to some degree that's expected because of their different use cases it might be nice to streamline this a bit to simplify the user experience and allow common access functions.

I'm thinking that perhaps as a minimum they should all return args (the result of create_stan_args), fit (the stanfit object) and observations as everything else, I think, could be constructed from these three elements using appropriate summary or extract_... methods.

If doing this changes would be likely to break existing workflows, but as we're introducing breaking changes towards 2.0.0 anyway this might be a good time to do this if at all.

Current returns are:

library("EpiNow2")
library("data.table")

reported_cases <- example_confirmed[1:60]
est_inf <- estimate_infections(reported_cases)
#> Warning: There were 32 divergent transitions after warmup. See
#> https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
#> to find out why this is a problem and how to eliminate them.
#> Warning: Examine the pairs() plot to diagnose sampling problems

cases <- as.data.table(reported_cases)[, primary := confirm]
cases[, scaling := 0.4]
cases[, meanlog := 1.8][, sdlog := 0.5]
cases <- simulate_secondary(cases, type = "incidence")
est_sec <- estimate_secondary(cases,
  obs = obs_opts(scale = list(mean = 0.2, sd = 0.2), week_effect = FALSE)
)

trunc <- dist_spec(
  mean = convert_to_logmean(3, 2),
  mean_sd = 0.1,
  sd = convert_to_logsd(3, 2),
  sd_sd = 0.1,
  max = 10
)
construct_truncation <- function(index, cases, dist) {
  set.seed(index)
  if (dist$dist == 0) {
    dfunc <- dlnorm
  } else {
    dfunc <- dgamma
  }
  cmf <- cumsum(
    dfunc(
      1:(dist$max + 1),
      rnorm(1, dist$mean_mean, dist$mean_sd),
      rnorm(1, dist$sd_mean, dist$sd_sd)
    )
  )
  cmf <- cmf / cmf[dist$max + 1]
  cmf <- rev(cmf)[-1]
  trunc_cases <- data.table::copy(cases)[1:(.N - index)]
  trunc_cases[
    (.N - length(cmf) + 1):.N, confirm := as.integer(confirm * cmf)
  ]
  return(trunc_cases)
}
example_data <- purrr::map(c(20, 15, 10, 0),
  construct_truncation,
  cases = reported_cases,
  dist = trunc
)
est_trunc <- estimate_truncation(example_data,
  verbose = interactive(),
  chains = 2, iter = 2000
)

names(est_inf)
#> [1] "samples"      "summarised"   "fit"          "args"         "observations"
names(est_sec)
#> [1] "predictions" "posterior"   "data"        "fit"
names(est_trunc)
#> [1] "dist"     "obs"      "last_obs" "cmf"      "data"     "fit"

Created on 2023-09-08 with reprex v2.0.2

@seabbs
Copy link
Contributor

seabbs commented Sep 8, 2023

Yes, I agree. This is a nice idea. Helps make a point that the models are really subsets/supersets of each other as well.

@jamesmbaazam
Copy link
Contributor

jamesmbaazam commented Nov 27, 2023

I'm working on this and have a few questions:

and observations as everything else, I think, could be constructed from these three elements using appropriate summary or extract_... methods.

Are all three functions going to return the unformatted stanfit object, which will then be formatted downstream using the extract() and summary() methods?

Is observations supposed to be only the raw data passed through the reported_cases/reports/obs arguments?

estimate_truncation() returns a dist object that's passed on to the other two estimate_*() functions. Should this function be returning that as a fourth element? Alternatively, the truncation distribution can be constructed via a construct_trunc_dist() helper using the returned fit object.

@jamesmbaazam jamesmbaazam self-assigned this Nov 27, 2023
@jamesmbaazam jamesmbaazam added this to the CRAN v2.0 release milestone Nov 27, 2023
@sbfnk sbfnk removed this from EpiNow2 v2.0.0 May 1, 2024
@sbfnk
Copy link
Contributor Author

sbfnk commented Aug 1, 2024

Are all three functions going to return the unformatted stanfit object, which will then be formatted downstream using the extract() and summary() methods?

To me that sounds like a sensible approach, i.e. S3 methods which contain the stanfit, observations and args.

Is observations supposed to be only the raw data passed through the reported_cases/reports/obs arguments?

Yes

estimate_truncation() returns a dist object that's passed on to the other two estimate_*() functions. Should this function be returning that as a fourth element? Alternatively, the truncation distribution can be constructed via a construct_trunc_dist() helper using the returned fit object.

I think we want an extract_dist() function that works on any of these elements and returns the chosen <dist_spec>. You'd then call something like

dist <- extract_dist(est, distribution = "truncation")

@Bisaloo
Copy link
Member

Bisaloo commented Aug 6, 2024

I think I'm missing some context here but if I may chime in: I'm never a big fan of including inputs (observations & args) in the output. Is there a specific reason this is necessary?

If the user ever loses tracks of what these are, then something went wrong in their process and any potential fix here (by storing the inputs with the output) would not manage to solve entirely the problems this process issue introduces.

Additionally, using raw stanfit objects provides nice benefits: you don't necessarily have to write custom methods for everything since you can directly leverage the entire ecosystem developed around stanfit objects. (Some related thoughts are shared in this blog post: https://epiverse-trace.github.io/posts/parent-class/.)

We had a related discussion in epiverse-trace/serofoi#89.

@seabbs
Copy link
Contributor

seabbs commented Aug 6, 2024

I think it's the user responsibility. It's important to teach users to use reproducible scripts, and not carry their data from one session to the other, losing the object provenance.
Code workarounds to try and handle this to the user add complexity to the code, thus reducing maintainability, and paradoxically makes the outputs more difficult to manipulate. And even with this, you can't completely protect the user from losing the object provenance. Hence why I believe that teaching good practices is the only way to solve this issue.

Could you provide some evidence for this? In particular I would be interested in hearing more user evidence as to the relative advantages of methods that "just work" vs the value of learning to map your data to your model fit.

In my experience the output users directly want is rarely the output provided directly by the stan methods. I agree there is a lot of value if having an output that is simple and close to what users may have experience with (though the vast majority of push back I hear from the epiverse ecosystem is that users do not have familiarity with stan and so it seems odd to now see exposing this to them very clearly as being an advantage). A good counter point to you should just output stan models is brms which overloads stan with things like the model fit data to make for an easier user experience.

As a counterpoint to above points in epidist we are directly outputting brms models for this reason but there I think the question is a little different as the output brms gives is much closer to what a user might want.

@Bisaloo
Copy link
Member

Bisaloo commented Aug 6, 2024

I think we are making related but slightly different points. Please let me know if I misunderstood your point:

  • I'm not necessarily arguing against having a wrapper class. There may indeed be value in ensuring the methods are more specifically tailored to the epidemiological question addressed by the package, while stan or brms are generic statistic tools. This is my understanding of the point you're making, and I'm not arguing against this, especially if you had specific user feedback in this direction.
  • However, when designing this wrapper class, I believe there is value to keep it as close to (or even directly inherit) from the parent. This allows:
    • Simpler internal code to maintain since fewer steps are required before potential handing it over to the parent method
    • Users to use other parent methods for which you didn't have the need for a dedicated method for the wrapper class. Similarly, the output can be passed directly to functions that expect a object looking like the parent class. In other words, a much larger ecosystem is unlocked "for free", without the need for data wrangling.

But my initial point, here and in the other issue, was specifically that:

  • It is generally not necessary to store the inputs in the output, with rare exceptions (EpiNow2 may be one of them for plots containing both the observed and fitted data). It is absolutely necessary that the user keeps track on their inputs for reproducibility and it is something that technically cannot be handed over to the packages they are using (since we would then have to store the package versions, the seed, the hardware details, etc.). As a result, when designing the custom class, I would recommend stripping it down to only the required information, and dropping the inputs if they are kept only because of reproducibility concerns. Note that this may be what you were discussing here but it's not entirely clear to me, hence why I wanted to double check.

The result of these different considerations led to keeping the stanfit class in serofoi. The outcome may or may not be different here. I think the general considerations still hold true and should probably help designing the class.


As an aside:

though the vast majority of push back I hear from the epiverse ecosystem is that users do not have familiarity with stan

There are a number of points on which there isn't a single viewpoint within the project, hence why I find this kind of discussion useful & informative ☺️

@sbfnk
Copy link
Contributor Author

sbfnk commented Dec 5, 2024

I think the point about not saving input data makes sense but in this case the rationale to store inputs is that they're used for further processing, i.e. we need the observations to be able to plot them alongside the predictions (and to identify dates), and we need the stan arguments / passed data in order to know which delay variable corresponds to which epidemiological delay (and, once #871 has been merged, other parameters). In principle this information could be stored separately but that would add more overhead so that I think it's easier from a conceptual and maintenance point of view to store the data as they're already passed to stan. As an aside <stanfit> objects already contain the seed(s) used so we do get those, too.

I really like the idea of inheriting from the the stan objects. In this case this is tricky for two reasons:

  1. we would have to rewrite everything into S4 (perhaps doable but as far as I'm concerned would be a learning experience)
  2. we wouldn't be able to support both rstan and cmdstanr any more as they return quite different (S4 vs. R6) objects.

So we might be better off with a list-based class that has a fit member (the stan object) as well as args and observations (i.e. similar to what we have now).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants