Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

int_pctl() should pick the _first_ eval time if eval_time = NULL #809

Closed
hfrick opened this issue Jan 16, 2024 · 2 comments
Closed

int_pctl() should pick the _first_ eval time if eval_time = NULL #809

hfrick opened this issue Jan 16, 2024 · 2 comments

Comments

@hfrick
Copy link
Member

hfrick commented Jan 16, 2024

Part of #766

int_pctl() should pick the first eval time stored in the object if eval_time = NULL is provided by the user.

Plus, choosing an eval time in this case is currently broken. AKA we need some tests for this in extratests, now filed as tidymodels/extratests#163.

library(tidymodels)
library(censored)
#> Loading required package: survival

set.seed(1)
sim_dat <- prodlim::SimSurv(500) %>%
  mutate(event_time = Surv(time, event)) %>%
  select(event_time, X1, X2) %>%
  as_tibble()

set.seed(2)
sim_rs <- vfold_cv(sim_dat)

time_points <- c(10, 1, 5, 15)

mod_spec <-
  proportional_hazards(penalty = tune(), mixture = 1) %>%
  set_engine("glmnet") %>%
  set_mode("censored regression")

grid <- tibble(penalty = 10^c(-4, -2, -1))

gctrl <- control_grid(save_pred = TRUE, save_workflow = TRUE)

mix_mtrc <- metric_set(brier_survival, brier_survival_integrated, concordance_survival)

set.seed(2193)
tune_res <-
  mod_spec %>%
  tune_grid(
    event_time ~ X1 + X2,
    resamples = sim_rs,
    grid = grid,
    metrics = mix_mtrc,
    eval_time = time_points,
    control = gctrl
  )


# `eval_time` across different functions --------------------------------

res <- show_best(tune_res)
#> Warning in show_best(tune_res): No value of `metric` was given;
#> "brier_survival" will be used.
#> Warning: 4 evaluation times are available; the first (10) will be used.

res <- select_best(tune_res)
#> Warning in select_best(tune_res): No value of `metric` was given; "brier_survival" will be used.
#> 4 evaluation times are available; the first (10) will be used.

res <- select_by_one_std_err(tune_res, penalty)
#> Warning in select_by_one_std_err(tune_res, penalty): No value of `metric` was given; "brier_survival" will be used.
#> 4 evaluation times are available; the first (10) will be used.

res <- select_by_pct_loss(tune_res, penalty)
#> Warning in select_by_pct_loss(tune_res, penalty): No value of `metric` was given; "brier_survival" will be used.
#> 4 evaluation times are available; the first (10) will be used.

res <- autoplot(tune_res)
#> Warning in filter_plot_eval_time(dat, eval_time): No evaluation time was set; a
#> value of 5 was used.

res <- int_pctl(tune_res)
#> Warning in int_pctl(tune_res): No evaluation time was set; a value of 5 was
#> used.
#> Error in `purrr::map2()`:
#> ℹ In index: 1.
#> Caused by error in `dplyr::rename()`:
#> ! Can't rename columns that don't exist.
#> ✖ Column `term` doesn't exist.

res <- augment(tune_res)
#> Warning: 4 evaluation times are available; the first (10) will be used.

res <- fit_best(tune_res)

Created on 2024-01-16 with reprex v2.0.2

@topepo
Copy link
Member

topepo commented Jan 24, 2024

I think that we should make the default (i.e. eval_time = NULL) to compute the intervals for all evaluation times.

For the select/show APIs, we have to pick a single value for those operations. In autoplot(), the main reason to plot a single time’s results is so that we don’t have a ton of facets (perhaps making the visualization unreadable).

I think the use case is different for int_pctl(). If the user specifies multiple times, they are probably interested in the results across times, and percentile intervals across time would be very informative:

library(tidymodels)
library(censored)
#> Loading required package: survival
library(doMC)
#> Loading required package: foreach
#> 
#> Attaching package: 'foreach'
#> The following objects are masked from 'package:purrr':
#> 
#>     accumulate, when
#> Loading required package: iterators
#> Loading required package: parallel
tidymodels_prefer()
theme_set(theme_bw())
options(pillar.advice = FALSE, pillar.min_title_chars = Inf)
registerDoMC(cores = parallel::detectCores())
set.seed(1)
sim_dat <- prodlim::SimSurv(500) %>%
  mutate(event_time = Surv(time, event)) %>%
  select(event_time, X1, X2)

set.seed(2)
split <- initial_split(sim_dat)
sim_tr <- training(split)
sim_te <- testing(split)
sim_rs <- vfold_cv(sim_tr)

time_points <- 1:20

mod_spec <-
  bag_tree() %>%
  set_mode("censored regression")

rsctrl <- control_resamples(save_pred = TRUE)

mix_mtrc  <- metric_set(brier_survival)

set.seed(2193)
rs__res <-
  mod_spec %>%
  fit_resamples(
    event_time ~ X1 + X2,
    resamples = sim_rs,
    metrics = mix_mtrc,
    eval_time = time_points,
    control = rsctrl
  )

set.seed(1)
int_pctl(rs__res, times = 2000, eval_time = time_points) %>% 
  ggplot(aes(.eval_time)) + 
  geom_line(aes(y = .estimate)) + 
  geom_ribbon(aes(ymin = .lower, ymax = .upper), alpha = 1 / 10) +
  labs(x = "Time", y = "Brier Score")

If they don't want them all, they could opt-in to excluding time points.

I think that multiple metrics is a good analogy; if they specify all of our event time metrics, they probably want intervals for all of them.

@topepo topepo closed this as completed Jan 25, 2024
Copy link

github-actions bot commented Feb 9, 2024

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Feb 9, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants