create csv files with scores to back evals page on variant nowcast hub dashboard #23

elray1 · 2025-01-24T17:32:47Z

These csv files should live on an orphan branch of the variant-nowcast-hub-dashboard repo named predevals/data. There should be a folder and file structure like the following:

|-scores/
  |  |-clade prop/
  |  |  |-Full season/
  |  |  |  |-location/
  |  |  |  |  |-scores.csv
  |  |  |  |-nowcast_date/
  |  |  |  |  |-scores.csv
  |  |  |  |-target_date/
  |  |  |  |  |-scores.csv
  |  |  |  |-horizon/
  |  |  |  |  |-scores.csv
  |  |  |  |-scores.csv

The scores in scores/clade prop/Full season/scores.csv will be overall average scores for each model, while the scores in subfolders will be disaggregated by the corresponding variable. For example, scores in scores/clade prop/Full season/location/scores.csv will be average scores for each model and location.

For scores/clade prop/Full season/scores.csv, we expect the following columns:

"model_id": the id of the model
"energy_score": the mean energy score for that model, obtained as an average of energy scores for each combination of location/nowcast_date/target_date that was predicted by that model over the course of the season.
"se_point": the mean squared error of predictive means for that model, obtained as an average of squared errors of predictive means for each combination of location/nowcast_date/target_date that was predicted by that model over the course of the season.
"interval_coverage_50": empirical coverage rate of marginal 50% prediction intervals for each clade. These should be prediction intervals for the observed data, i.e., obtained including a multinomial sampling step.
"interval_coverage_95": empirical coverage rate of marginal 95% prediction intervals for each clade. These should be prediction intervals for the observed data, i.e., obtained including a multinomial sampling step.
"n": the number of location/nowcast_date/target_date combinations that were averaged across for this model to compute the mean score

For scores/clade prop/Full season/location/scores.csv (for example, disaggregating by location), we expect the following columns:

"model_id": the id of the model
"location": the id of the location, a state code as used by the variant nowcast hub.
"energy_score": the average energy score for that model, obtained as a mean of energy scores for each combination of location/nowcast_date/target_date that was predicted by that model over the course of the season.
"interval_coverage_50": empirical coverage rate of marginal 50% prediction intervals for each clade. These should be prediction intervals for the observed data, i.e., obtained including a multinomial sampling step.
"interval_coverage_95": empirical coverage rate of marginal 95% prediction intervals for each clade. These should be prediction intervals for the observed data, i.e., obtained including a multinomial sampling step.
"n": the number of location/nowcast_date/target_date combinations that were averaged across for this model to compute the mean score

Note: A serious limitation of the above proposal is that average scores will average across the different sets of locations and dates predicted by each model, and as a result they will not be truly comparable across models. The way that we've most often handled this in the past is to calculate relative scores using the procedure outlined here: https://epiforecasts.io/scoringutils/reference/get_pairwise_comparisons.html. Unfortunately, I think it would be challenging for us to actually use that function as the scoringutils package uses quite a specific representation of scores data behind the scenes. But it would be good to eventually add in some approach to handling comparison of models that have submitted predictions for different subsets of locations and dates.

It likely makes sense to tackle this issue in stages, e.g. with the following steps (or whatever other stepped approach makes sense to the person doing this work):

Add only the overall average energy scores, not including squared errors or interval coverage rates and not broken down by location, nowcast_date, target_date, or horizon.
Add in scores disaggregated by location, nowcast_date, target_date, and horizon.
Add squared error of point predictions
Add interval coverage rates
Add relative skill scores.

The text was updated successfully, but these errors were encountered:

nickreich · 2025-01-24T21:40:34Z

This all seems like a good breakdown to me. One question that might be a more general hubdashboard kind of question. Are we locked into the "Full season" name? Or could we name it something else? Asking because the idea of a "season" is maybe not as relevant here, where we expect this to go on in and out of the regular respiratory virus season.

elray1 · 2025-01-24T21:43:23Z

Good point! We can name this whatever we want.

github-project-automation bot added this to Lab Work Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

create csv files with scores to back evals page on variant nowcast hub dashboard #23

create csv files with scores to back evals page on variant nowcast hub dashboard #23

elray1 commented Jan 24, 2025

nickreich commented Jan 24, 2025

elray1 commented Jan 24, 2025

create csv files with scores to back evals page on variant nowcast hub dashboard #23

create csv files with scores to back evals page on variant nowcast hub dashboard #23

Comments

elray1 commented Jan 24, 2025

nickreich commented Jan 24, 2025

elray1 commented Jan 24, 2025