Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create csv files with scores to back evals page on variant nowcast hub dashboard #23

Open
5 tasks
elray1 opened this issue Jan 24, 2025 · 2 comments
Open
5 tasks

Comments

@elray1
Copy link

elray1 commented Jan 24, 2025

These csv files should live on an orphan branch of the variant-nowcast-hub-dashboard repo named predevals/data. There should be a folder and file structure like the following:

|-scores/
  |  |-clade prop/
  |  |  |-Full season/
  |  |  |  |-location/
  |  |  |  |  |-scores.csv
  |  |  |  |-nowcast_date/
  |  |  |  |  |-scores.csv
  |  |  |  |-target_date/
  |  |  |  |  |-scores.csv
  |  |  |  |-horizon/
  |  |  |  |  |-scores.csv
  |  |  |  |-scores.csv

The scores in scores/clade prop/Full season/scores.csv will be overall average scores for each model, while the scores in subfolders will be disaggregated by the corresponding variable. For example, scores in scores/clade prop/Full season/location/scores.csv will be average scores for each model and location.

For scores/clade prop/Full season/scores.csv, we expect the following columns:

  • "model_id": the id of the model
  • "energy_score": the mean energy score for that model, obtained as an average of energy scores for each combination of location/nowcast_date/target_date that was predicted by that model over the course of the season.
  • "se_point": the mean squared error of predictive means for that model, obtained as an average of squared errors of predictive means for each combination of location/nowcast_date/target_date that was predicted by that model over the course of the season.
  • "interval_coverage_50": empirical coverage rate of marginal 50% prediction intervals for each clade. These should be prediction intervals for the observed data, i.e., obtained including a multinomial sampling step.
  • "interval_coverage_95": empirical coverage rate of marginal 95% prediction intervals for each clade. These should be prediction intervals for the observed data, i.e., obtained including a multinomial sampling step.
  • "n": the number of location/nowcast_date/target_date combinations that were averaged across for this model to compute the mean score

For scores/clade prop/Full season/location/scores.csv (for example, disaggregating by location), we expect the following columns:

  • "model_id": the id of the model
  • "location": the id of the location, a state code as used by the variant nowcast hub.
  • "energy_score": the average energy score for that model, obtained as a mean of energy scores for each combination of location/nowcast_date/target_date that was predicted by that model over the course of the season.
  • "interval_coverage_50": empirical coverage rate of marginal 50% prediction intervals for each clade. These should be prediction intervals for the observed data, i.e., obtained including a multinomial sampling step.
  • "interval_coverage_95": empirical coverage rate of marginal 95% prediction intervals for each clade. These should be prediction intervals for the observed data, i.e., obtained including a multinomial sampling step.
  • "n": the number of location/nowcast_date/target_date combinations that were averaged across for this model to compute the mean score

Note: A serious limitation of the above proposal is that average scores will average across the different sets of locations and dates predicted by each model, and as a result they will not be truly comparable across models. The way that we've most often handled this in the past is to calculate relative scores using the procedure outlined here: https://epiforecasts.io/scoringutils/reference/get_pairwise_comparisons.html. Unfortunately, I think it would be challenging for us to actually use that function as the scoringutils package uses quite a specific representation of scores data behind the scenes. But it would be good to eventually add in some approach to handling comparison of models that have submitted predictions for different subsets of locations and dates.

It likely makes sense to tackle this issue in stages, e.g. with the following steps (or whatever other stepped approach makes sense to the person doing this work):

  • Add only the overall average energy scores, not including squared errors or interval coverage rates and not broken down by location, nowcast_date, target_date, or horizon.
  • Add in scores disaggregated by location, nowcast_date, target_date, and horizon.
  • Add squared error of point predictions
  • Add interval coverage rates
  • Add relative skill scores.
@nickreich
Copy link
Member

This all seems like a good breakdown to me. One question that might be a more general hubdashboard kind of question. Are we locked into the "Full season" name? Or could we name it something else? Asking because the idea of a "season" is maybe not as relevant here, where we expect this to go on in and out of the regular respiratory virus season.

@elray1
Copy link
Author

elray1 commented Jan 24, 2025

Good point! We can name this whatever we want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

2 participants