Sync eval changes in OLMo/ladder-1xC to here #1

liujch1998 · 2024-12-10T17:22:34Z

This adds scaling law eval sets as in-loop.

Testing of metric is done in OLMo-core.

Testing tasks:

>>> from olmo_eval.tasks import list_tasks, build_task

>>> task = build_task(label="arc_challenge_val_rc_5shot", tokenizer=tokenizer)
>>> task
<olmo_eval.tasks.OEEvalTask object at 0x7f5a56b10ed0>
>>> task.metric_type
'len_norm'
>>> len(task.samples)
1194
>>> task.samples[0]
{'doc_id': 0, 'cont_id': 0, 'ctx': [23433, 27, 6086, 5605, 281, 5890, 521, 3564, 4541, 407, 35770, 731, 15, 6758, 4808, 2553, 588, 4711, 253, 954, 4250, 32, 187, 32869, 27, 6079, 38101, 187, 187, 23433, 27, 6758, 273, 253, 1563, 7234, 1682, 11424, 2139, 43733, 3798, 7356, 281, 247, 29794, 3369, 32, 187, 32869, 27, 380, 29794, 3369, 4428, 6871, 15, 187, 187, 23433, 27, 329, 7975, 2540, 275, 8090, 273, 26556, 552, 5561, 954, 2779, 7369, 432, 253, 187, 32869, 27, 5975, 3390, 273, 23396, 267, 10855, 15, 187, 187, 23433, 27, 6758, 273, 841, 513, 10950, 3959, 347, 253, 954, 3332, 8813, 347, 281, 2139, 1142, 6244, 285, 5074, 4962, 562, 387, 253, 990, 273, 253, 19926, 36154, 280, 8685, 32, 187, 32869, 27, 3486, 273, 271, 46125, 3562, 8660, 326, 13230, 253, 23993, 187, 187, 23433, 27, 6758, 273, 253, 1563, 310, 247, 18177, 326, 247, 4370, 1057, 5803, 30686, 432, 697, 4651, 32, 187, 32869, 27, 253, 1979, 273, 697, 27690, 187, 187, 23433, 27, 21587, 285, 3905, 9499, 25059, 4533, 247, 1643, 5113, 1066, 247, 18556, 15, 1583, 971, 281, 923, 534, 1789, 21968, 253, 18337, 3957, 15, 1737, 943, 597, 513, 594, 597, 476, 10280, 616, 5839, 32, 187, 32869, 27], 'continuation': [9272, 253, 5113, 275, 2390, 15], 'ctx_len': 202, 'dc_len': 1, 'cont_len': 6, 'cont_str_len': 26, 'cont_byte_len': 26, 'query': [23433, 27, 6086, 5605, 281, 5890, 521, 3564, 4541, 407, 35770, 731, 15, 6758, 4808, 2553, 588, 4711, 253, 954, 4250, 32, 187, 32869, 27, 6079, 38101, 187, 187, 23433, 27, 6758, 273, 253, 1563, 7234, 1682, 11424, 2139, 43733, 3798, 7356, 281, 247, 29794, 3369, 32, 187, 32869, 27, 380, 29794, 3369, 4428, 6871, 15, 187, 187, 23433, 27, 329, 7975, 2540, 275, 8090, 273, 26556, 552, 5561, 954, 2779, 7369, 432, 253, 187, 32869, 27, 5975, 3390, 273, 23396, 267, 10855, 15, 187, 187, 23433, 27, 6758, 273, 841, 513, 10950, 3959, 347, 253, 954, 3332, 8813, 347, 281, 2139, 1142, 6244, 285, 5074, 4962, 562, 387, 253, 990, 273, 253, 19926, 36154, 280, 8685, 32, 187, 32869, 27, 3486, 273, 271, 46125, 3562, 8660, 326, 13230, 253, 23993, 187, 187, 23433, 27, 6758, 273, 253, 1563, 310, 247, 18177, 326, 247, 4370, 1057, 5803, 30686, 432, 697, 4651, 32, 187, 32869, 27, 253, 1979, 273, 697, 27690, 187, 187, 23433, 27, 21587, 285, 3905, 9499, 25059, 4533, 247, 1643, 5113, 1066, 247, 18556, 15, 1583, 971, 281, 923, 534, 1789, 21968, 253, 18337, 3957, 15, 1737, 943, 597, 513, 594, 597, 476, 10280, 616, 5839, 32, 187, 32869, 27, 9272, 253, 5113, 275, 2390], 'dc_query': [209, 9272, 253, 5113, 275, 2390], 'label_id': 3}

>>> for task in list_tasks():
...     print(task)
... 
piqa
hellaswag
winogrande
openbook_qa
boolq
sciq
arc_easy
arc_easy_ppl
arc_challenge
basic_arithmetic
copa
rte
commitment_bank
mrpc
sst2
commonsense_qa
social_iqa
trivia_qa_wiki_ppl
natural_qs_open_ppl
mmlu_stem_test
mmlu_humanities_test
mmlu_social_sciences_test
mmlu_other_test
mmlu_stem
mmlu_humanities
mmlu_social_sciences
mmlu_other
mmlu_stem_bpb
mmlu_humanities_bpb
mmlu_social_sciences_bpb
mmlu_other_bpb
mmlu_stem_var
mmlu_humanities_var
mmlu_social_sciences_var
mmlu_other_var
mmlu_stem_var_bpb
mmlu_humanities_var_bpb
mmlu_social_sciences_var_bpb
mmlu_other_var_bpb
mmlu_stem_mc_5shot
mmlu_humanities_mc_5shot
mmlu_social_sciences_mc_5shot
mmlu_other_mc_5shot
mmlu_stem_mc_5shot_test
mmlu_humanities_mc_5shot_test
mmlu_social_sciences_mc_5shot_test
mmlu_other_mc_5shot_test
arc_challenge_mc_5shot
arc_challenge_mc_5shot_bpb
arc_challenge_rc_0shot
arc_challenge_rc_0shot_bpb
arc_challenge_rc_5shot
arc_challenge_rc_5shot_bpb
arc_easy_mc_5shot
arc_easy_mc_5shot_bpb
arc_easy_rc_0shot
arc_easy_rc_0shot_bpb
arc_easy_rc_5shot
arc_easy_rc_5shot_bpb
boolq_mc_5shot
boolq_mc_5shot_bpb
boolq_rc_0shot
boolq_rc_0shot_bpb
boolq_rc_5shot
boolq_rc_5shot_bpb
copa_rc_0shot
copa_rc_0shot_bpb
copycolors_10way
copycolors_10way_bpb
copycolors_xl_10way
copycolors_xl_10way_bpb
csqa_mc_5shot
csqa_mc_5shot_bpb
csqa_rc_0shot
csqa_rc_0shot_bpb
csqa_rc_5shot
csqa_rc_5shot_bpb
hellaswag_mc_5shot
hellaswag_mc_5shot_bpb
hellaswag_rc_0shot
hellaswag_rc_0shot_bpb
hellaswag_rc_5shot
hellaswag_rc_5shot_bpb
openbookqa_mc_5shot
openbookqa_mc_5shot_bpb
openbookqa_rc_0shot
openbookqa_rc_0shot_bpb
openbookqa_rc_5shot
openbookqa_rc_5shot_bpb
piqa_mc_5shot
piqa_mc_5shot_bpb
piqa_rc_0shot
piqa_rc_0shot_bpb
piqa_rc_5shot
piqa_rc_5shot_bpb
sciq_rc_0shot
sciq_rc_0shot_bpb
socialiqa_mc_5shot
socialiqa_mc_5shot_bpb
socialiqa_rc_0shot
socialiqa_rc_0shot_bpb
socialiqa_rc_5shot
socialiqa_rc_5shot_bpb
winogrande_mc_5shot
winogrande_mc_5shot_bpb
winogrande_rc_0shot
winogrande_rc_0shot_bpb
winogrande_rc_5shot
winogrande_rc_5shot_bpb
arc_challenge_train_rc_5shot
arc_challenge_train_mc_5shot
arc_challenge_val_rc_5shot
arc_challenge_val_mc_5shot
arc_challenge_test_rc_5shot
arc_challenge_test_mc_5shot
arc_easy_train_rc_5shot
arc_easy_train_mc_5shot
arc_easy_val_rc_5shot
arc_easy_val_mc_5shot
arc_easy_test_rc_5shot
arc_easy_test_mc_5shot
boolq_train_rc_5shot
boolq_train_mc_5shot
boolq_val_rc_5shot
boolq_val_mc_5shot
csqa_train_rc_5shot
csqa_train_mc_5shot
csqa_val_rc_5shot
csqa_val_mc_5shot
hellaswag_train_rc_5shot
hellaswag_train_mc_5shot
hellaswag_val_rc_5shot
hellaswag_val_mc_5shot
openbookqa_train_rc_5shot
openbookqa_train_mc_5shot
openbookqa_val_rc_5shot
openbookqa_val_mc_5shot
openbookqa_test_rc_5shot
openbookqa_test_mc_5shot
piqa_train_rc_5shot
piqa_train_mc_5shot
piqa_val_rc_5shot
piqa_val_mc_5shot
socialiqa_train_rc_5shot
socialiqa_train_mc_5shot
socialiqa_val_rc_5shot
socialiqa_val_mc_5shot
winogrande_train_rc_5shot
winogrande_train_mc_5shot
winogrande_val_rc_5shot
winogrande_val_mc_5shot
mmlu_stem_val_rc_var
mmlu_stem_val_rc_5shot
mmlu_stem_val_mc_5shot
mmlu_stem_test_rc_var
mmlu_stem_test_rc_5shot
mmlu_stem_test_mc_5shot
mmlu_humanities_val_rc_var
mmlu_humanities_val_rc_5shot
mmlu_humanities_val_mc_5shot
mmlu_humanities_test_rc_var
mmlu_humanities_test_rc_5shot
mmlu_humanities_test_mc_5shot
mmlu_social_sciences_val_rc_var
mmlu_social_sciences_val_rc_5shot
mmlu_social_sciences_val_mc_5shot
mmlu_social_sciences_test_rc_var
mmlu_social_sciences_test_rc_5shot
mmlu_social_sciences_test_mc_5shot
mmlu_other_val_rc_var
mmlu_other_val_rc_5shot
mmlu_other_val_mc_5shot
mmlu_other_test_rc_var
mmlu_other_test_rc_5shot
mmlu_other_test_mc_5shot

OyvindTafjord · 2024-12-17T22:56:43Z

There's a timeout in the tests, maybe need to increase a timeout?

OyvindTafjord

I added one concern about the richer metric output (consequences for logging etc), but otherwise looks good!

OyvindTafjord · 2024-12-17T22:50:03Z

src/olmo_eval/metrics.py

+                "bpb": torch.tensor(sum(bpb) / len(bpb)),
+                "soft": torch.tensor(sum(soft_score) / len(soft_score)),
+                "soft_log": torch.tensor(sum(soft_log_score) / len(soft_log_score)),
+            }


This new, richer metric output (along with the correlated changes in olmo-core's evaluator_callback.py) will change the recorded metrics for "acc" and "len_norm" metrics. Do we know that this won't interfere with other parts of the setup? (logging etc)

In general, it's definitely the right thing to do though, to compute these in a single go, rather than as separate versions of the same task (in fact, could also have "acc" and "len_norm" together), other than making it less clear which one is the "official" metric for each task.

Ah that makes a lot of sense! I have updated my edits in olmo-core so that it displays the mapped metric type text.

yulinggu-cs

LGTM

liujch1998 added 2 commits December 10, 2024 07:23

Sync eval changes in OLMo/ladder-1xC to here

ab70c28

Lint

c16177c

liujch1998 marked this pull request as ready for review December 15, 2024 08:45

liujch1998 requested a review from epwalsh December 15, 2024 08:45

epwalsh requested review from OyvindTafjord and yulinggu-cs December 16, 2024 19:54

OyvindTafjord approved these changes Dec 17, 2024

View reviewed changes

Increase github workflow timeout to 15min

1f8ebb6

yulinggu-cs approved these changes Dec 18, 2024

View reviewed changes

Increase github workflow timeout to 30min

ca51395

liujch1998 merged commit 5f3db3c into main Dec 18, 2024
8 checks passed

liujch1998 deleted the moreeval branch December 18, 2024 23:14

liujch1998 restored the moreeval branch December 18, 2024 23:14

liujch1998 deleted the moreeval branch December 19, 2024 19:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync eval changes in OLMo/ladder-1xC to here #1

Sync eval changes in OLMo/ladder-1xC to here #1

liujch1998 commented Dec 10, 2024

OyvindTafjord commented Dec 17, 2024

OyvindTafjord left a comment

OyvindTafjord Dec 17, 2024

liujch1998 Dec 18, 2024

yulinggu-cs left a comment

Sync eval changes in OLMo/ladder-1xC to here #1

Sync eval changes in OLMo/ladder-1xC to here #1

Conversation

liujch1998 commented Dec 10, 2024

OyvindTafjord commented Dec 17, 2024

OyvindTafjord left a comment

Choose a reason for hiding this comment

OyvindTafjord Dec 17, 2024

Choose a reason for hiding this comment

liujch1998 Dec 18, 2024

Choose a reason for hiding this comment

yulinggu-cs left a comment

Choose a reason for hiding this comment