Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

revision_summary() outputs high/nonrepresentative time-to-semistable estimates from initial report #584

Open
brookslogan opened this issue Dec 12, 2024 · 2 comments

Comments

@brookslogan
Copy link
Contributor

E.g., on NSSP.

library(tidyverse)
library(epidatr)
#> ! epidatr cache is being used (set env var EPIDATR_USE_CACHE=FALSE if not
#>   intended).
#> ℹ The cache directory is ~/.cache/R/epidatr.
#> ℹ The cache will be cleared after 14 days and will be pruned if it exceeds 4096
#>   MB.
#> ℹ The log of cache transactions is stored at ~/.cache/R/epidatr/logfile.txt.
library(epiprocess)
#> Registered S3 method overwritten by 'tsibble':
#>   method               from 
#>   as_tibble.grouped_df dplyr
#> 
#> Attaching package: 'epiprocess'
#> 
#> The following object is masked from 'package:stats':
#> 
#>     filter
library(epipredict)
#> Loading required package: epidatasets
#> Loading required package: parsnip
#> Registered S3 method overwritten by 'epipredict':
#>   method            from   
#>   print.step_naomit recipes
#> 
#> Attaching package: 'epipredict'
#> 
#> The following object is masked from 'package:ggplot2':
#> 
#>     layer

cce <- covidcast_epidata()

nssp_issues <- cce$signals$`nssp:pct_ed_visits_influenza`$call("state", "*", "*", issues = epirange(123401, 202448))
#> Warning: Using cached results with `as_of` within the past week (or the future!). This
#> will likely result in an invalid cache. Consider
#> ℹ disabling the cache for this session with `disable_cache` or permanently with
#>   environmental variable `EPIDATR_USE_CACHE=FALSE`
#> ℹ setting `EPIDATR_CACHE_MAX_AGE_DAYS=1` to e.g. `3/24` (3 hours).
#> This warning is displayed once every 8 hours.
#> Warning: Loading from the cache at /home/fullname/.cache/R/epidatr; see
#> ~/.cache/R/epidatr/logfile.txt for more details.
#> This warning is displayed once every 8 hours.

nssp_archive <- nssp_issues |>
  select(geo_value, time_value, version = issue, pct_ed_visits_influenza = value) |>
  as_epi_archive(compactify = TRUE)

revision_stats <- revision_summary(nssp_archive, pct_ed_visits_influenza)
#> Min lag (time to first version):
#>      min   median       mean      max
#>   7 days 210 days 227.4 days 770 days
#> Fraction of epi_key+time_values with
#> No revisions:
#> • 3,903 out of 5,145 (75.86%)
#> 
#> Quick revisions (last revision within 3 days of the `time_value`):
#> • 0 out of 5,145 (0%)
#> 
#> Few revisions (At most 3 revisions for that `time_value`):
#> • 5,085 out of 5,145 (98.83%)
#> 
#> 
#> Fraction of revised epi_key+time_values which have:
#> Less than 0.1 spread in relative value:
#> • 917 out of 1,242 (73.83%)
#> 
#> Spread of more than 0.701 in actual value (when revised):
#> • 0 out of 1,242 (0%)
#> 
#> days until within 20% of the latest value:
#>      min   median       mean      max
#>   7 days 210 days 227.8 days 770 days

revision_stats |>
  select(geo_value, time_value, time_near_latest) |>
  group_by(time_value) |>
  summarize(mtnl = mean(time_near_latest)) |>
  print(n = 100)
#> # A tibble: 105 × 2
#>     time_value mtnl          
#>     <date>     <drtn>        
#>   1 2022-09-25 573.85714 days
#>   2 2022-10-02 564.14286 days
#>   3 2022-10-09 557.14286 days
#>   4 2022-10-16 550.14286 days
#>   5 2022-10-23 543.14286 days
#>   6 2022-10-30 536.14286 days
#>   7 2022-11-06 529.14286 days
#>   8 2022-11-13 522.14286 days
#>   9 2022-11-20 515.14286 days
#>  10 2022-11-27 508.14286 days
#>  11 2022-12-04 501.14286 days
#>  12 2022-12-11 494.14286 days
#>  13 2022-12-18 487.14286 days
#>  14 2022-12-25 480.14286 days
#>  15 2023-01-01 473.14286 days
#>  16 2023-01-08 466.14286 days
#>  17 2023-01-15 459.14286 days
#>  18 2023-01-22 452.14286 days
#>  19 2023-01-29 445.14286 days
#>  20 2023-02-05 438.14286 days
#>  21 2023-02-12 431.14286 days
#>  22 2023-02-19 424.14286 days
#>  23 2023-02-26 417.14286 days
#>  24 2023-03-05 410.14286 days
#>  25 2023-03-12 403.14286 days
#>  26 2023-03-19 396.14286 days
#>  27 2023-03-26 389.14286 days
#>  28 2023-04-02 382.14286 days
#>  29 2023-04-09 375.14286 days
#>  30 2023-04-16 368.14286 days
#>  31 2023-04-23 361.14286 days
#>  32 2023-04-30 354.14286 days
#>  33 2023-05-07 347.14286 days
#>  34 2023-05-14 340.14286 days
#>  35 2023-05-21 333.14286 days
#>  36 2023-05-28 326.14286 days
#>  37 2023-06-04 319.14286 days
#>  38 2023-06-11 312.14286 days
#>  39 2023-06-18 305.14286 days
#>  40 2023-06-25 298.14286 days
#>  41 2023-07-02 291.14286 days
#>  42 2023-07-09 284.14286 days
#>  43 2023-07-16 277.14286 days
#>  44 2023-07-23 270.14286 days
#>  45 2023-07-30 263.14286 days
#>  46 2023-08-06 256.14286 days
#>  47 2023-08-13 249.14286 days
#>  48 2023-08-20 242.14286 days
#>  49 2023-08-27 235.14286 days
#>  50 2023-09-03 231.57143 days
#>  51 2023-09-10 221.14286 days
#>  52 2023-09-17 214.14286 days
#>  53 2023-09-24 207.14286 days
#>  54 2023-10-01 200.14286 days
#>  55 2023-10-08 193.14286 days
#>  56 2023-10-15 186.14286 days
#>  57 2023-10-22 179.14286 days
#>  58 2023-10-29 172.14286 days
#>  59 2023-11-05 165.14286 days
#>  60 2023-11-12 158.14286 days
#>  61 2023-11-19 151.14286 days
#>  62 2023-11-26 144.14286 days
#>  63 2023-12-03 137.14286 days
#>  64 2023-12-10 130.14286 days
#>  65 2023-12-17 123.14286 days
#>  66 2023-12-24 116.14286 days
#>  67 2023-12-31 109.14286 days
#>  68 2024-01-07 102.14286 days
#>  69 2024-01-14  95.14286 days
#>  70 2024-01-21  88.14286 days
#>  71 2024-01-28  81.14286 days
#>  72 2024-02-04  74.14286 days
#>  73 2024-02-11  67.14286 days
#>  74 2024-02-18  60.14286 days
#>  75 2024-02-25  53.14286 days
#>  76 2024-03-03  46.14286 days
#>  77 2024-03-10  39.14286 days
#>  78 2024-03-17  32.14286 days
#>  79 2024-03-24  25.14286 days
#>  80 2024-03-31  18.28571 days
#>  81 2024-04-07  12.57143 days
#>  82 2024-04-14  11.85714 days
#>  83 2024-04-21  11.28571 days
#>  84 2024-04-28  11.71429 days
#>  85 2024-05-05  11.42857 days
#>  86 2024-05-12  17.28571 days
#>  87 2024-05-19  11.57143 days
#>  88 2024-05-26  12.14286 days
#>  89 2024-06-02  10.57143 days
#>  90 2024-06-09  10.57143 days
#>  91 2024-06-16  10.57143 days
#>  92 2024-06-23  11.28571 days
#>  93 2024-06-30  11.71429 days
#>  94 2024-07-07  16.28571 days
#>  95 2024-07-14  12.28571 days
#>  96 2024-07-21  12.00000 days
#>  97 2024-07-28  13.28571 days
#>  98 2024-08-04  12.42857 days
#>  99 2024-08-11  10.85714 days
#> 100 2024-08-18  13.28571 days
#> # ℹ 5 more rows
## ggplot(aes(time_value, mtnl)) +
## geom_line()

nssp_archive$DT[,unique(time_value)] |> sort() |> diff()
#> Time differences in days
#>   [1] 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
#>  [38] 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
#>  [75] 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7

nssp_archive2 <- nssp_archive$DT[time_value >= as.Date("2024-05-19")] |>
  as_epi_archive()

revision_stats2 <- nssp_archive2 |>
  revision_summary(pct_ed_visits_influenza)
#> Min lag (time to first version):
#>      min median      mean      max
#>   7 days 7 days 12.4 days 168 days
#> Fraction of epi_key+time_values with
#> No revisions:
#> • 468 out of 882 (53.06%)
#> 
#> Quick revisions (last revision within 3 days of the `time_value`):
#> • 0 out of 882 (0%)
#> 
#> Few revisions (At most 3 revisions for that `time_value`):
#> • 877 out of 882 (99.43%)
#> 
#> 
#> Fraction of revised epi_key+time_values which have:
#> Less than 0.1 spread in relative value:
#> • 188 out of 414 (45.41%)
#> 
#> Spread of more than 0.0925 in actual value (when revised):
#> • 9 out of 414 (2.17%)
#> 
#> days until within 20% of the latest value:
#>      min median      mean      max
#>   7 days 7 days 13.9 days 168 days

revision_stats2 |>
  select(geo_value, time_value, time_near_latest) |>
  group_by(time_value) |>
  summarize(mtnl = mean(time_near_latest), mdtnl = median(time_near_latest)) |>
  print(n = 100)
#> # A tibble: 18 × 3
#>    time_value mtnl          mdtnl  
#>    <date>     <drtn>        <drtn> 
#>  1 2024-05-19 11.57143 days  7 days
#>  2 2024-05-26 12.14286 days  7 days
#>  3 2024-06-02 10.57143 days  7 days
#>  4 2024-06-09 10.57143 days  7 days
#>  5 2024-06-16 10.57143 days  7 days
#>  6 2024-06-23 11.28571 days  7 days
#>  7 2024-06-30 11.71429 days  7 days
#>  8 2024-07-07 16.28571 days 14 days
#>  9 2024-07-14 12.28571 days  7 days
#> 10 2024-07-21 12.00000 days  7 days
#> 11 2024-07-28 13.28571 days  7 days
#> 12 2024-08-04 12.42857 days  7 days
#> 13 2024-08-11 10.85714 days  7 days
#> 14 2024-08-18 13.28571 days  7 days
#> 15 2024-08-25 13.00000 days  7 days
#> 16 2024-09-01 29.42857 days 28 days
#> 17 2024-09-08 22.42857 days 21 days
#> 18 2024-09-15 16.85714 days 14 days
  ## ggplot(aes(time_value, mtnl)) +
  ## ggplot(aes(time_value, mdtnl), color = "blue") +
  ## geom_line()

Created on 2024-12-12 with reprex v2.1.1

@dsweber2
Copy link
Contributor

I mean, shouldn't this be expected behavior? If you narrow down to time values starting at the first version, you get the "actual" revision behavior.

I suppose we could have a flag that triggers ignoring the time to the very first revision.

@brookslogan brookslogan changed the title revision_summary() outputs high/nonrepresentative time-to-semistable estimates when there's an (API) outage revision_summary() outputs high/nonrepresentative time-to-semistable estimates when there's a rare outage Dec 13, 2024
@brookslogan
Copy link
Contributor Author

brookslogan commented Dec 13, 2024

Well, it surprised me I guess. Not a great experience having to filter & reconvert, maybe better when we have a direct filter method.

And this + datetime versions + out-of-sync fetching challenges the underlying assumption of convenience that we have the entire history from (-Inf, versions_end]. One alternative would be replacing versions_end with a vector versions_observed (1 entry per snapshot recorded). Not sure if this would make the flag above easier to implement.

Based on this experience, I'd hope we'd default the flag to TRUE. I'd guess the mechanism for ignoring the first version would involve filtering time values somehow, though the simple way to filter time values would bias things a little bit towards lower reporting latency than was actually the case, since if we just exclude time values <= archive$DT[version == min(version), max(time_value)] then at the starting versions the minimum lag would be 0. Debiasing there seems like it might be a pain to do, but it may be fine to do without.

@brookslogan brookslogan changed the title revision_summary() outputs high/nonrepresentative time-to-semistable estimates when there's a rare outage revision_summary() outputs high/nonrepresentative time-to-semistable estimates from initial report Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants