Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address methodologies for improving evaluation scoring #80

Open
donaldknoller opened this issue Oct 11, 2024 · 2 comments
Open

Address methodologies for improving evaluation scoring #80

donaldknoller opened this issue Oct 11, 2024 · 2 comments
Labels
discussion In Discussion

Comments

@donaldknoller
Copy link
Contributor

donaldknoller commented Oct 11, 2024

The current dataset being used for evaluation has the following properties:

  1. Fetch 2048 samples from historical synthetic dataset
  2. Fetch 2048 samples from latest 48 hours of synthetic data
  3. Evaluate on the 4096 samples gathered.

While this setup has been effective at preventing blatant overfitting, there are a few limitations. Mainly, the changing nature of the dataset lends itself to a higher degree of variance when it comes to scoring.

This issue aims to discuss potential solutions and the details of the implementation.
Some potential suggestions from miners:

  1. Utilize an epoch based dataset that is fixed in nature. A fixed dataset will reduce variance but requires some safeguards against overfitting
  2. Re evaluate existing models given some cycle. Implementation for this will also depend on interactions with emissions and re-calculating models in the case of "fluke" scores

More recently, the sample size for evaluation has been adjusted as seen here to aid in reducing variance, but there may be other implementations that may be effective as well

@donaldknoller
Copy link
Contributor Author

donaldknoller commented Oct 14, 2024

Additional suggestion from @torquedrop :

I recommend using the truncated mean scoring system. In this system, the highest and lowest scores are removed, and the average of the remaining scores is used to calculate the final result. To save time during score evaluation, consider reducing the sample size to 1024 or 512 for each scoring instance, and limiting the number of score calculations to 5.

@donaldknoller donaldknoller changed the title Address methodologies for reducing evaluation scoring variance Address methodologies for improving evaluation scoring Oct 14, 2024
@donaldknoller donaldknoller added the discussion In Discussion label Oct 15, 2024
@donaldknoller
Copy link
Contributor Author

Implementations to add:

  1. When querying the dataset API, use query parameters to specify the time range
  2. Add in some notes which query range was used when evaluating

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion In Discussion
Projects
None yet
Development

No branches or pull requests

1 participant