You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current dataset being used for evaluation has the following properties:
Fetch 2048 samples from historical synthetic dataset
Fetch 2048 samples from latest 48 hours of synthetic data
Evaluate on the 4096 samples gathered.
While this setup has been effective at preventing blatant overfitting, there are a few limitations. Mainly, the changing nature of the dataset lends itself to a higher degree of variance when it comes to scoring.
This issue aims to discuss potential solutions and the details of the implementation.
Some potential suggestions from miners:
Utilize an epoch based dataset that is fixed in nature. A fixed dataset will reduce variance but requires some safeguards against overfitting
Re evaluate existing models given some cycle. Implementation for this will also depend on interactions with emissions and re-calculating models in the case of "fluke" scores
More recently, the sample size for evaluation has been adjusted as seen here to aid in reducing variance, but there may be other implementations that may be effective as well
The text was updated successfully, but these errors were encountered:
I recommend using the truncated mean scoring system. In this system, the highest and lowest scores are removed, and the average of the remaining scores is used to calculate the final result. To save time during score evaluation, consider reducing the sample size to 1024 or 512 for each scoring instance, and limiting the number of score calculations to 5.
donaldknoller
changed the title
Address methodologies for reducing evaluation scoring variance
Address methodologies for improving evaluation scoring
Oct 14, 2024
The current dataset being used for evaluation has the following properties:
While this setup has been effective at preventing blatant overfitting, there are a few limitations. Mainly, the changing nature of the dataset lends itself to a higher degree of variance when it comes to scoring.
This issue aims to discuss potential solutions and the details of the implementation.
Some potential suggestions from miners:
More recently, the sample size for evaluation has been adjusted as seen here to aid in reducing variance, but there may be other implementations that may be effective as well
The text was updated successfully, but these errors were encountered: