-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sync eval changes in OLMo/ladder-1xC to here #1
Conversation
There's a timeout in the tests, maybe need to increase a timeout? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added one concern about the richer metric output (consequences for logging etc), but otherwise looks good!
"bpb": torch.tensor(sum(bpb) / len(bpb)), | ||
"soft": torch.tensor(sum(soft_score) / len(soft_score)), | ||
"soft_log": torch.tensor(sum(soft_log_score) / len(soft_log_score)), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new, richer metric output (along with the correlated changes in olmo-core's evaluator_callback.py) will change the recorded metrics for "acc" and "len_norm" metrics. Do we know that this won't interfere with other parts of the setup? (logging etc)
In general, it's definitely the right thing to do though, to compute these in a single go, rather than as separate versions of the same task (in fact, could also have "acc" and "len_norm" together), other than making it less clear which one is the "official" metric for each task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that makes a lot of sense! I have updated my edits in olmo-core so that it displays the mapped metric type text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This adds scaling law eval sets as in-loop.
Testing of metric is done in OLMo-core.
Testing tasks: