-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sync eval changes in OLMo/ladder-1xC to here #122
Merged
Merged
Changes from 21 commits
Commits
Show all changes
23 commits
Select commit
Hold shift + click to select a range
e06dac7
Change task metric interface to allow multiple metrics per eval
liujch1998 e433a04
Add downstream callback
liujch1998 c20b37c
Install custom olmo-eval
liujch1998 8623313
Install custom olmo-eval
liujch1998 d37ff23
Install custom olmo-eval
liujch1998 d49bd16
Install custom olmo-eval
liujch1998 307b5a9
Install custom olmo-eval
liujch1998 0ab9881
Fix zero rank_batch_size
liujch1998 0d465f8
Add back comet
liujch1998 6c39c7b
Add more tasks
liujch1998 be08609
Add more tasks
liujch1998 0734767
Set drop_last=False
liujch1998 cb0994a
Fix
liujch1998 9dc4474
Add more tasks
liujch1998 e503a53
Cleanup
liujch1998 0b01c06
Make the logged metric key consistent to previous behavior
liujch1998 e62a5e1
Change eval_batch_size to count in tokens
liujch1998 70aa809
Merge remote-tracking branch 'origin/main' into moreeval
liujch1998 a43a377
Fix check
liujch1998 768d488
Fix check
liujch1998 eb84ac1
Install upgraded olmo-eval from pypi
liujch1998 c2c9fd0
Rename downstream => donwstream_evaluator
liujch1998 7aba244
Merge remote-tracking branch 'origin/main' into moreeval
liujch1998 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -9,6 +9,9 @@ | |||||
from olmo_core.optim import AdamWConfig, OptimGroupOverride | ||||||
from olmo_core.train import TrainerConfig | ||||||
from olmo_core.train.callbacks import CheckpointerCallback, CometCallback, WandBCallback | ||||||
from olmo_core.train.callbacks.evaluator_callback import ( | ||||||
DownstreamEvaluatorCallbackConfig, | ||||||
) | ||||||
|
||||||
|
||||||
def build_model_config(common: CommonComponents) -> TransformerConfig: | ||||||
|
@@ -73,6 +76,48 @@ def build_trainer_config(common: CommonComponents) -> TrainerConfig: | |||||
cancel_check_interval=10, | ||||||
), | ||||||
) | ||||||
.with_callback( | ||||||
"downstream", | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's call this "downstream_evaluator" to be consistent with the naming convention for other callbacks.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thanks, fixed |
||||||
DownstreamEvaluatorCallbackConfig( | ||||||
tasks=[ | ||||||
"arc_challenge_val_rc_5shot", | ||||||
"arc_challenge_val_mc_5shot", | ||||||
"arc_challenge_test_rc_5shot", | ||||||
"arc_challenge_test_mc_5shot", | ||||||
"arc_easy_val_rc_5shot", | ||||||
"arc_easy_val_mc_5shot", | ||||||
"arc_easy_test_rc_5shot", | ||||||
"arc_easy_test_mc_5shot", | ||||||
"boolq_val_rc_5shot", | ||||||
"boolq_val_mc_5shot", | ||||||
"csqa_val_rc_5shot", | ||||||
"csqa_val_mc_5shot", | ||||||
"hellaswag_val_rc_5shot", | ||||||
"hellaswag_val_mc_5shot", | ||||||
"openbookqa_val_rc_5shot", | ||||||
"openbookqa_val_mc_5shot", | ||||||
"openbookqa_test_rc_5shot", | ||||||
"openbookqa_test_mc_5shot", | ||||||
"piqa_val_rc_5shot", | ||||||
"piqa_val_mc_5shot", | ||||||
"socialiqa_val_rc_5shot", | ||||||
"socialiqa_val_mc_5shot", | ||||||
"winogrande_val_rc_5shot", | ||||||
"winogrande_val_mc_5shot", | ||||||
"mmlu_stem_val_rc_5shot", | ||||||
"mmlu_stem_val_mc_5shot", | ||||||
"mmlu_humanities_val_rc_5shot", | ||||||
"mmlu_humanities_val_mc_5shot", | ||||||
"mmlu_social_sciences_val_rc_5shot", | ||||||
"mmlu_social_sciences_val_mc_5shot", | ||||||
"mmlu_other_val_rc_5shot", | ||||||
"mmlu_other_val_mc_5shot", | ||||||
], | ||||||
tokenizer=common.tokenizer, | ||||||
eval_batch_size=16 * 4096, | ||||||
eval_interval=1, | ||||||
), | ||||||
) | ||||||
) | ||||||
|
||||||
|
||||||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This causes a bug. I don't see why we should divide batch size by seq len. Batch size was already number of examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where was it set to number of examples instead of tokens? It should always be set in tokens. This change will cause bugs elsewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's derived from Line 230 of this file,
eval_batch_size
. Do you mean this variable should be in number of tokens? It seems to be a semantic change from the old repo, and each eval's max seq length is dependent on the task so we can't set a fixed batch size ...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a change from the old repo.
Well batch size is roughly fixed by number of tokens, not instances. This is more efficient because we can pack more instances from shorter tasks together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I've reverted it so now
eval_batch_size
takes number of tokens.