streamline S3 classes #451

sbfnk · 2023-09-08T08:36:14Z

Currently the three models included in the package return S3 class with similar but slightly different contents. Whilst to some degree that's expected because of their different use cases it might be nice to streamline this a bit to simplify the user experience and allow common access functions.

I'm thinking that perhaps as a minimum they should all return args (the result of create_stan_args), fit (the stanfit object) and observations as everything else, I think, could be constructed from these three elements using appropriate summary or extract_... methods.

If doing this changes would be likely to break existing workflows, but as we're introducing breaking changes towards 2.0.0 anyway this might be a good time to do this if at all.

Current returns are:

library("EpiNow2")
library("data.table")

reported_cases <- example_confirmed[1:60]
est_inf <- estimate_infections(reported_cases)
#> Warning: There were 32 divergent transitions after warmup. See
#> https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
#> to find out why this is a problem and how to eliminate them.
#> Warning: Examine the pairs() plot to diagnose sampling problems

cases <- as.data.table(reported_cases)[, primary := confirm]
cases[, scaling := 0.4]
cases[, meanlog := 1.8][, sdlog := 0.5]
cases <- simulate_secondary(cases, type = "incidence")
est_sec <- estimate_secondary(cases,
  obs = obs_opts(scale = list(mean = 0.2, sd = 0.2), week_effect = FALSE)
)

trunc <- dist_spec(
  mean = convert_to_logmean(3, 2),
  mean_sd = 0.1,
  sd = convert_to_logsd(3, 2),
  sd_sd = 0.1,
  max = 10
)
construct_truncation <- function(index, cases, dist) {
  set.seed(index)
  if (dist$dist == 0) {
    dfunc <- dlnorm
  } else {
    dfunc <- dgamma
  }
  cmf <- cumsum(
    dfunc(
      1:(dist$max + 1),
      rnorm(1, dist$mean_mean, dist$mean_sd),
      rnorm(1, dist$sd_mean, dist$sd_sd)
    )
  )
  cmf <- cmf / cmf[dist$max + 1]
  cmf <- rev(cmf)[-1]
  trunc_cases <- data.table::copy(cases)[1:(.N - index)]
  trunc_cases[
    (.N - length(cmf) + 1):.N, confirm := as.integer(confirm * cmf)
  ]
  return(trunc_cases)
}
example_data <- purrr::map(c(20, 15, 10, 0),
  construct_truncation,
  cases = reported_cases,
  dist = trunc
)
est_trunc <- estimate_truncation(example_data,
  verbose = interactive(),
  chains = 2, iter = 2000
)

names(est_inf)
#> [1] "samples"      "summarised"   "fit"          "args"         "observations"
names(est_sec)
#> [1] "predictions" "posterior"   "data"        "fit"
names(est_trunc)
#> [1] "dist"     "obs"      "last_obs" "cmf"      "data"     "fit"

^{Created on 2023-09-08 with reprex v2.0.2}

The text was updated successfully, but these errors were encountered:

seabbs · 2023-09-08T10:22:54Z

Yes, I agree. This is a nice idea. Helps make a point that the models are really subsets/supersets of each other as well.

jamesmbaazam · 2023-11-27T16:59:23Z

I'm working on this and have a few questions:

and observations as everything else, I think, could be constructed from these three elements using appropriate summary or extract_... methods.

Are all three functions going to return the unformatted stanfit object, which will then be formatted downstream using the extract() and summary() methods?

Is observations supposed to be only the raw data passed through the reported_cases/reports/obs arguments?

estimate_truncation() returns a dist object that's passed on to the other two estimate_*() functions. Should this function be returning that as a fourth element? Alternatively, the truncation distribution can be constructed via a construct_trunc_dist() helper using the returned fit object.

sbfnk · 2024-08-01T11:32:28Z

Are all three functions going to return the unformatted stanfit object, which will then be formatted downstream using the extract() and summary() methods?

To me that sounds like a sensible approach, i.e. S3 methods which contain the stanfit, observations and args.

Is observations supposed to be only the raw data passed through the reported_cases/reports/obs arguments?

Yes

estimate_truncation() returns a dist object that's passed on to the other two estimate_*() functions. Should this function be returning that as a fourth element? Alternatively, the truncation distribution can be constructed via a construct_trunc_dist() helper using the returned fit object.

I think we want an extract_dist() function that works on any of these elements and returns the chosen <dist_spec>. You'd then call something like

dist <- extract_dist(est, distribution = "truncation")

Bisaloo · 2024-08-06T10:24:25Z

I think I'm missing some context here but if I may chime in: I'm never a big fan of including inputs (observations & args) in the output. Is there a specific reason this is necessary?

If the user ever loses tracks of what these are, then something went wrong in their process and any potential fix here (by storing the inputs with the output) would not manage to solve entirely the problems this process issue introduces.

Additionally, using raw stanfit objects provides nice benefits: you don't necessarily have to write custom methods for everything since you can directly leverage the entire ecosystem developed around stanfit objects. (Some related thoughts are shared in this blog post: https://epiverse-trace.github.io/posts/parent-class/.)

We had a related discussion in epiverse-trace/serofoi#89.

seabbs · 2024-08-06T11:22:40Z

I think it's the user responsibility. It's important to teach users to use reproducible scripts, and not carry their data from one session to the other, losing the object provenance.
Code workarounds to try and handle this to the user add complexity to the code, thus reducing maintainability, and paradoxically makes the outputs more difficult to manipulate. And even with this, you can't completely protect the user from losing the object provenance. Hence why I believe that teaching good practices is the only way to solve this issue.

Could you provide some evidence for this? In particular I would be interested in hearing more user evidence as to the relative advantages of methods that "just work" vs the value of learning to map your data to your model fit.

In my experience the output users directly want is rarely the output provided directly by the stan methods. I agree there is a lot of value if having an output that is simple and close to what users may have experience with (though the vast majority of push back I hear from the epiverse ecosystem is that users do not have familiarity with stan and so it seems odd to now see exposing this to them very clearly as being an advantage). A good counter point to you should just output stan models is brms which overloads stan with things like the model fit data to make for an easier user experience.

As a counterpoint to above points in epidist we are directly outputting brms models for this reason but there I think the question is a little different as the output brms gives is much closer to what a user might want.

Bisaloo · 2024-08-06T12:42:01Z

I think we are making related but slightly different points. Please let me know if I misunderstood your point:

I'm not necessarily arguing against having a wrapper class. There may indeed be value in ensuring the methods are more specifically tailored to the epidemiological question addressed by the package, while stan or brms are generic statistic tools. This is my understanding of the point you're making, and I'm not arguing against this, especially if you had specific user feedback in this direction.
However, when designing this wrapper class, I believe there is value to keep it as close to (or even directly inherit) from the parent. This allows:
- Simpler internal code to maintain since fewer steps are required before potential handing it over to the parent method
- Users to use other parent methods for which you didn't have the need for a dedicated method for the wrapper class. Similarly, the output can be passed directly to functions that expect a object looking like the parent class. In other words, a much larger ecosystem is unlocked "for free", without the need for data wrangling.

But my initial point, here and in the other issue, was specifically that:

It is generally not necessary to store the inputs in the output, with rare exceptions (EpiNow2 may be one of them for plots containing both the observed and fitted data). It is absolutely necessary that the user keeps track on their inputs for reproducibility and it is something that technically cannot be handed over to the packages they are using (since we would then have to store the package versions, the seed, the hardware details, etc.). As a result, when designing the custom class, I would recommend stripping it down to only the required information, and dropping the inputs if they are kept only because of reproducibility concerns. Note that this may be what you were discussing here but it's not entirely clear to me, hence why I wanted to double check.

The result of these different considerations led to keeping the stanfit class in serofoi. The outcome may or may not be different here. I think the general considerations still hold true and should probably help designing the class.

As an aside:

though the vast majority of push back I hear from the epiverse ecosystem is that users do not have familiarity with stan

There are a number of points on which there isn't a single viewpoint within the project, hence why I find this kind of discussion useful & informative ☺️

sbfnk · 2024-12-05T16:27:09Z

I think the point about not saving input data makes sense but in this case the rationale to store inputs is that they're used for further processing, i.e. we need the observations to be able to plot them alongside the predictions (and to identify dates), and we need the stan arguments / passed data in order to know which delay variable corresponds to which epidemiological delay (and, once #871 has been merged, other parameters). In principle this information could be stored separately but that would add more overhead so that I think it's easier from a conceptual and maintenance point of view to store the data as they're already passed to stan. As an aside <stanfit> objects already contain the seed(s) used so we do get those, too.

I really like the idea of inheriting from the the stan objects. In this case this is tricky for two reasons:

we would have to rewrite everything into S4 (perhaps doable but as far as I'm concerned would be a learning experience)
we wouldn't be able to support both rstan and cmdstanr any more as they return quite different (S4 vs. R6) objects.

So we might be better off with a list-based class that has a fit member (the stan object) as well as args and observations (i.e. similar to what we have now).

sbfnk mentioned this issue Oct 14, 2023

EpiNow2 2.0.0 roadmap #423

Closed

18 tasks

jamesmbaazam mentioned this issue Oct 31, 2023

Reporting templates and the report package #495

Closed

jamesmbaazam added this to EpiNow2 v2.0.0 Nov 10, 2023

jamesmbaazam moved this to Todo in EpiNow2 v2.0.0 Nov 10, 2023

jamesmbaazam moved this from Todo to In Progress in EpiNow2 v2.0.0 Nov 27, 2023

jamesmbaazam self-assigned this Nov 27, 2023

jamesmbaazam added this to the CRAN v2.0 release milestone Nov 27, 2023

sbfnk removed this from EpiNow2 v2.0.0 May 1, 2024

sbfnk modified the milestones: CRAN v2.0 release, CRAN v1.6 release May 14, 2024

jamesmbaazam removed their assignment Aug 7, 2024

jamesmbaazam mentioned this issue Aug 7, 2024

Return forecasts for easy processing with scoringutils #618

Open

sbfnk modified the milestones: CRAN v1.6 release, CRAN v1.7 release Sep 20, 2024

sbfnk mentioned this issue Dec 9, 2024

break up create_clean_reported_cases() #884

Merged

7 tasks

sbfnk modified the milestones: CRAN v1.7 release, CRAN v1.8 release Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

streamline S3 classes #451

streamline S3 classes #451

sbfnk commented Sep 8, 2023

seabbs commented Sep 8, 2023

jamesmbaazam commented Nov 27, 2023 •

edited

Loading

sbfnk commented Aug 1, 2024

Bisaloo commented Aug 6, 2024

seabbs commented Aug 6, 2024 •

edited

Loading

Bisaloo commented Aug 6, 2024

sbfnk commented Dec 5, 2024 •

edited

Loading

streamline S3 classes #451

streamline S3 classes #451

Comments

sbfnk commented Sep 8, 2023

seabbs commented Sep 8, 2023

jamesmbaazam commented Nov 27, 2023 • edited Loading

sbfnk commented Aug 1, 2024

Bisaloo commented Aug 6, 2024

seabbs commented Aug 6, 2024 • edited Loading

Bisaloo commented Aug 6, 2024

sbfnk commented Dec 5, 2024 • edited Loading

jamesmbaazam commented Nov 27, 2023 •

edited

Loading

seabbs commented Aug 6, 2024 •

edited

Loading

sbfnk commented Dec 5, 2024 •

edited

Loading