Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync eval changes in OLMo/ladder-1xC to here #1

Merged
merged 4 commits into from
Dec 18, 2024
Merged

Sync eval changes in OLMo/ladder-1xC to here #1

merged 4 commits into from
Dec 18, 2024

Conversation

liujch1998
Copy link
Contributor

This adds scaling law eval sets as in-loop.

Testing of metric is done in OLMo-core.

Testing tasks:

>>> from olmo_eval.tasks import list_tasks, build_task

>>> task = build_task(label="arc_challenge_val_rc_5shot", tokenizer=tokenizer)
>>> task
<olmo_eval.tasks.OEEvalTask object at 0x7f5a56b10ed0>
>>> task.metric_type
'len_norm'
>>> len(task.samples)
1194
>>> task.samples[0]
{'doc_id': 0, 'cont_id': 0, 'ctx': [23433, 27, 6086, 5605, 281, 5890, 521, 3564, 4541, 407, 35770, 731, 15, 6758, 4808, 2553, 588, 4711, 253, 954, 4250, 32, 187, 32869, 27, 6079, 38101, 187, 187, 23433, 27, 6758, 273, 253, 1563, 7234, 1682, 11424, 2139, 43733, 3798, 7356, 281, 247, 29794, 3369, 32, 187, 32869, 27, 380, 29794, 3369, 4428, 6871, 15, 187, 187, 23433, 27, 329, 7975, 2540, 275, 8090, 273, 26556, 552, 5561, 954, 2779, 7369, 432, 253, 187, 32869, 27, 5975, 3390, 273, 23396, 267, 10855, 15, 187, 187, 23433, 27, 6758, 273, 841, 513, 10950, 3959, 347, 253, 954, 3332, 8813, 347, 281, 2139, 1142, 6244, 285, 5074, 4962, 562, 387, 253, 990, 273, 253, 19926, 36154, 280, 8685, 32, 187, 32869, 27, 3486, 273, 271, 46125, 3562, 8660, 326, 13230, 253, 23993, 187, 187, 23433, 27, 6758, 273, 253, 1563, 310, 247, 18177, 326, 247, 4370, 1057, 5803, 30686, 432, 697, 4651, 32, 187, 32869, 27, 253, 1979, 273, 697, 27690, 187, 187, 23433, 27, 21587, 285, 3905, 9499, 25059, 4533, 247, 1643, 5113, 1066, 247, 18556, 15, 1583, 971, 281, 923, 534, 1789, 21968, 253, 18337, 3957, 15, 1737, 943, 597, 513, 594, 597, 476, 10280, 616, 5839, 32, 187, 32869, 27], 'continuation': [9272, 253, 5113, 275, 2390, 15], 'ctx_len': 202, 'dc_len': 1, 'cont_len': 6, 'cont_str_len': 26, 'cont_byte_len': 26, 'query': [23433, 27, 6086, 5605, 281, 5890, 521, 3564, 4541, 407, 35770, 731, 15, 6758, 4808, 2553, 588, 4711, 253, 954, 4250, 32, 187, 32869, 27, 6079, 38101, 187, 187, 23433, 27, 6758, 273, 253, 1563, 7234, 1682, 11424, 2139, 43733, 3798, 7356, 281, 247, 29794, 3369, 32, 187, 32869, 27, 380, 29794, 3369, 4428, 6871, 15, 187, 187, 23433, 27, 329, 7975, 2540, 275, 8090, 273, 26556, 552, 5561, 954, 2779, 7369, 432, 253, 187, 32869, 27, 5975, 3390, 273, 23396, 267, 10855, 15, 187, 187, 23433, 27, 6758, 273, 841, 513, 10950, 3959, 347, 253, 954, 3332, 8813, 347, 281, 2139, 1142, 6244, 285, 5074, 4962, 562, 387, 253, 990, 273, 253, 19926, 36154, 280, 8685, 32, 187, 32869, 27, 3486, 273, 271, 46125, 3562, 8660, 326, 13230, 253, 23993, 187, 187, 23433, 27, 6758, 273, 253, 1563, 310, 247, 18177, 326, 247, 4370, 1057, 5803, 30686, 432, 697, 4651, 32, 187, 32869, 27, 253, 1979, 273, 697, 27690, 187, 187, 23433, 27, 21587, 285, 3905, 9499, 25059, 4533, 247, 1643, 5113, 1066, 247, 18556, 15, 1583, 971, 281, 923, 534, 1789, 21968, 253, 18337, 3957, 15, 1737, 943, 597, 513, 594, 597, 476, 10280, 616, 5839, 32, 187, 32869, 27, 9272, 253, 5113, 275, 2390], 'dc_query': [209, 9272, 253, 5113, 275, 2390], 'label_id': 3}

>>> for task in list_tasks():
...     print(task)
... 
piqa
hellaswag
winogrande
openbook_qa
boolq
sciq
arc_easy
arc_easy_ppl
arc_challenge
basic_arithmetic
copa
rte
commitment_bank
mrpc
sst2
commonsense_qa
social_iqa
trivia_qa_wiki_ppl
natural_qs_open_ppl
mmlu_stem_test
mmlu_humanities_test
mmlu_social_sciences_test
mmlu_other_test
mmlu_stem
mmlu_humanities
mmlu_social_sciences
mmlu_other
mmlu_stem_bpb
mmlu_humanities_bpb
mmlu_social_sciences_bpb
mmlu_other_bpb
mmlu_stem_var
mmlu_humanities_var
mmlu_social_sciences_var
mmlu_other_var
mmlu_stem_var_bpb
mmlu_humanities_var_bpb
mmlu_social_sciences_var_bpb
mmlu_other_var_bpb
mmlu_stem_mc_5shot
mmlu_humanities_mc_5shot
mmlu_social_sciences_mc_5shot
mmlu_other_mc_5shot
mmlu_stem_mc_5shot_test
mmlu_humanities_mc_5shot_test
mmlu_social_sciences_mc_5shot_test
mmlu_other_mc_5shot_test
arc_challenge_mc_5shot
arc_challenge_mc_5shot_bpb
arc_challenge_rc_0shot
arc_challenge_rc_0shot_bpb
arc_challenge_rc_5shot
arc_challenge_rc_5shot_bpb
arc_easy_mc_5shot
arc_easy_mc_5shot_bpb
arc_easy_rc_0shot
arc_easy_rc_0shot_bpb
arc_easy_rc_5shot
arc_easy_rc_5shot_bpb
boolq_mc_5shot
boolq_mc_5shot_bpb
boolq_rc_0shot
boolq_rc_0shot_bpb
boolq_rc_5shot
boolq_rc_5shot_bpb
copa_rc_0shot
copa_rc_0shot_bpb
copycolors_10way
copycolors_10way_bpb
copycolors_xl_10way
copycolors_xl_10way_bpb
csqa_mc_5shot
csqa_mc_5shot_bpb
csqa_rc_0shot
csqa_rc_0shot_bpb
csqa_rc_5shot
csqa_rc_5shot_bpb
hellaswag_mc_5shot
hellaswag_mc_5shot_bpb
hellaswag_rc_0shot
hellaswag_rc_0shot_bpb
hellaswag_rc_5shot
hellaswag_rc_5shot_bpb
openbookqa_mc_5shot
openbookqa_mc_5shot_bpb
openbookqa_rc_0shot
openbookqa_rc_0shot_bpb
openbookqa_rc_5shot
openbookqa_rc_5shot_bpb
piqa_mc_5shot
piqa_mc_5shot_bpb
piqa_rc_0shot
piqa_rc_0shot_bpb
piqa_rc_5shot
piqa_rc_5shot_bpb
sciq_rc_0shot
sciq_rc_0shot_bpb
socialiqa_mc_5shot
socialiqa_mc_5shot_bpb
socialiqa_rc_0shot
socialiqa_rc_0shot_bpb
socialiqa_rc_5shot
socialiqa_rc_5shot_bpb
winogrande_mc_5shot
winogrande_mc_5shot_bpb
winogrande_rc_0shot
winogrande_rc_0shot_bpb
winogrande_rc_5shot
winogrande_rc_5shot_bpb
arc_challenge_train_rc_5shot
arc_challenge_train_mc_5shot
arc_challenge_val_rc_5shot
arc_challenge_val_mc_5shot
arc_challenge_test_rc_5shot
arc_challenge_test_mc_5shot
arc_easy_train_rc_5shot
arc_easy_train_mc_5shot
arc_easy_val_rc_5shot
arc_easy_val_mc_5shot
arc_easy_test_rc_5shot
arc_easy_test_mc_5shot
boolq_train_rc_5shot
boolq_train_mc_5shot
boolq_val_rc_5shot
boolq_val_mc_5shot
csqa_train_rc_5shot
csqa_train_mc_5shot
csqa_val_rc_5shot
csqa_val_mc_5shot
hellaswag_train_rc_5shot
hellaswag_train_mc_5shot
hellaswag_val_rc_5shot
hellaswag_val_mc_5shot
openbookqa_train_rc_5shot
openbookqa_train_mc_5shot
openbookqa_val_rc_5shot
openbookqa_val_mc_5shot
openbookqa_test_rc_5shot
openbookqa_test_mc_5shot
piqa_train_rc_5shot
piqa_train_mc_5shot
piqa_val_rc_5shot
piqa_val_mc_5shot
socialiqa_train_rc_5shot
socialiqa_train_mc_5shot
socialiqa_val_rc_5shot
socialiqa_val_mc_5shot
winogrande_train_rc_5shot
winogrande_train_mc_5shot
winogrande_val_rc_5shot
winogrande_val_mc_5shot
mmlu_stem_val_rc_var
mmlu_stem_val_rc_5shot
mmlu_stem_val_mc_5shot
mmlu_stem_test_rc_var
mmlu_stem_test_rc_5shot
mmlu_stem_test_mc_5shot
mmlu_humanities_val_rc_var
mmlu_humanities_val_rc_5shot
mmlu_humanities_val_mc_5shot
mmlu_humanities_test_rc_var
mmlu_humanities_test_rc_5shot
mmlu_humanities_test_mc_5shot
mmlu_social_sciences_val_rc_var
mmlu_social_sciences_val_rc_5shot
mmlu_social_sciences_val_mc_5shot
mmlu_social_sciences_test_rc_var
mmlu_social_sciences_test_rc_5shot
mmlu_social_sciences_test_mc_5shot
mmlu_other_val_rc_var
mmlu_other_val_rc_5shot
mmlu_other_val_mc_5shot
mmlu_other_test_rc_var
mmlu_other_test_rc_5shot
mmlu_other_test_mc_5shot

@liujch1998 liujch1998 marked this pull request as ready for review December 15, 2024 08:45
@liujch1998 liujch1998 requested a review from epwalsh December 15, 2024 08:45
@OyvindTafjord
Copy link

There's a timeout in the tests, maybe need to increase a timeout?

Copy link

@OyvindTafjord OyvindTafjord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added one concern about the richer metric output (consequences for logging etc), but otherwise looks good!

"bpb": torch.tensor(sum(bpb) / len(bpb)),
"soft": torch.tensor(sum(soft_score) / len(soft_score)),
"soft_log": torch.tensor(sum(soft_log_score) / len(soft_log_score)),
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new, richer metric output (along with the correlated changes in olmo-core's evaluator_callback.py) will change the recorded metrics for "acc" and "len_norm" metrics. Do we know that this won't interfere with other parts of the setup? (logging etc)

In general, it's definitely the right thing to do though, to compute these in a single go, rather than as separate versions of the same task (in fact, could also have "acc" and "len_norm" together), other than making it less clear which one is the "official" metric for each task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that makes a lot of sense! I have updated my edits in olmo-core so that it displays the mapped metric type text.

Copy link

@yulinggu-cs yulinggu-cs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@liujch1998 liujch1998 merged commit 5f3db3c into main Dec 18, 2024
8 checks passed
@liujch1998 liujch1998 deleted the moreeval branch December 18, 2024 23:14
@liujch1998 liujch1998 restored the moreeval branch December 18, 2024 23:14
@liujch1998 liujch1998 deleted the moreeval branch December 19, 2024 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants