Sync eval changes in OLMo/ladder-1xC to here #122

liujch1998 · 2024-12-15T08:32:34Z

This adds scaling law eval sets as in-loop.

Testing of metric: https://legacy.beaker.org/ex/01JF4NNA49YJGC55P3Q5FPEAPA/tasks/01JF4NNA4HM9Q90BQNQ99XSJ9Y/job/01JF4P6XRZVTDXWC3J2559R0K5

2024-12-15T08:21:11.301073649Z 2024-12-15 08:21:11.300	d22e6d646321:0	olmo_core.train.callbacks.evaluator_callback:68	INFO	Running downstream evals...
2024-12-15T08:21:14.829675802Z 2024-12-15 08:21:14.829	d22e6d646321:0	olmo_core.train.callbacks.evaluator_callback:111	INFO	[eval=downstream,step=5/75]
2024-12-15T08:21:14.940428448Z 2024-12-15 08:21:14.940	d22e6d646321:0	olmo_core.train.callbacks.evaluator_callback:111	INFO	[eval=downstream,step=10/75]
2024-12-15T08:21:15.049435484Z 2024-12-15 08:21:15.049	d22e6d646321:0	olmo_core.train.callbacks.evaluator_callback:111	INFO	[eval=downstream,step=15/75]
2024-12-15T08:21:15.157967512Z 2024-12-15 08:21:15.157	d22e6d646321:0	olmo_core.train.callbacks.evaluator_callback:111	INFO	[eval=downstream,step=20/75]
2024-12-15T08:21:15.267427337Z 2024-12-15 08:21:15.267	d22e6d646321:0	olmo_core.train.callbacks.evaluator_callback:111	INFO	[eval=downstream,step=25/75]
2024-12-15T08:21:15.375047960Z 2024-12-15 08:21:15.374	d22e6d646321:0	olmo_core.train.callbacks.evaluator_callback:111	INFO	[eval=downstream,step=30/75]
2024-12-15T08:21:15.483513780Z 2024-12-15 08:21:15.483	d22e6d646321:0	olmo_core.train.callbacks.evaluator_callback:111	INFO	[eval=downstream,step=35/75]
2024-12-15T08:21:15.594538312Z 2024-12-15 08:21:15.594	d22e6d646321:0	olmo_core.train.callbacks.evaluator_callback:111	INFO	[eval=downstream,step=40/75]
2024-12-15T08:21:15.702422918Z 2024-12-15 08:21:15.702	d22e6d646321:0	olmo_core.train.callbacks.evaluator_callback:111	INFO	[eval=downstream,step=45/75]
2024-12-15T08:21:15.811504739Z 2024-12-15 08:21:15.811	d22e6d646321:0	olmo_core.train.callbacks.evaluator_callback:111	INFO	[eval=downstream,step=50/75]
2024-12-15T08:21:15.919817749Z 2024-12-15 08:21:15.919	d22e6d646321:0	olmo_core.train.callbacks.evaluator_callback:111	INFO	[eval=downstream,step=55/75]
2024-12-15T08:21:16.026753004Z 2024-12-15 08:21:16.026	d22e6d646321:0	olmo_core.train.callbacks.evaluator_callback:111	INFO	[eval=downstream,step=60/75]
2024-12-15T08:21:16.133501599Z 2024-12-15 08:21:16.133	d22e6d646321:0	olmo_core.train.callbacks.evaluator_callback:111	INFO	[eval=downstream,step=65/75]
2024-12-15T08:21:16.240990822Z 2024-12-15 08:21:16.240	d22e6d646321:0	olmo_core.train.callbacks.evaluator_callback:111	INFO	[eval=downstream,step=70/75]
2024-12-15T08:21:16.348730485Z 2024-12-15 08:21:16.348	d22e6d646321:0	olmo_core.train.callbacks.evaluator_callback:111	INFO	[eval=downstream,step=75/75]
2024-12-15T08:21:17.056109188Z 2024-12-15 08:21:17.055	d22e6d646321:0	olmo_core.train.callbacks.evaluator_callback:104	INFO	Eval metrics:
2024-12-15T08:21:17.056129669Z     arc_challenge_val_rc_5shot (len_norm)=0.2441
2024-12-15T08:21:17.056131828Z     arc_challenge_val_rc_5shot (ce_loss)=2.472
2024-12-15T08:21:17.056133529Z     arc_challenge_val_rc_5shot (bpb)=3.565
2024-12-15T08:21:17.056134965Z     arc_challenge_val_rc_5shot (soft)=0.2539
2024-12-15T08:21:17.056136416Z     arc_challenge_val_rc_5shot (soft_log)=-1.46E+00

To see things in Comet: https://www.comet.com/ai2/olmo-core-1b/7a3614872861484dbc7ad651ad5c9e35

liujch1998 · 2024-12-15T08:33:26Z

src/olmo_core/internal/common.py

+            # install ai2-olmo-eval from source git repo
+            "pip uninstall -y ai2-olmo-eval",
+            "pip install git+https://github.com/allenai/OLMo-in-loop-evals.git@moreeval",


will clean this up when the PR in olmo-eval lands

liujch1998 · 2024-12-15T08:34:14Z

src/olmo_core/train/callbacks/evaluator_callback.py

                shuffle=False,
                num_replicas=get_world_size(dp_process_group),
                rank=get_rank(dp_process_group),
            )

-        rank_batch_size_instances = max(0, rank_batch_size // self.task.max_sequence_length)


This causes a bug. I don't see why we should divide batch size by seq len. Batch size was already number of examples.

Batch size was already number of examples.

Where was it set to number of examples instead of tokens? It should always be set in tokens. This change will cause bugs elsewhere

It's derived from Line 230 of this file, eval_batch_size. Do you mean this variable should be in number of tokens? It seems to be a semantic change from the old repo, and each eval's max seq length is dependent on the task so we can't set a fixed batch size ...

It is a change from the old repo.

each eval's max seq length is dependent on the task so we can't set a fixed batch size

Well batch size is roughly fixed by number of tokens, not instances. This is more efficient because we can pack more instances from shorter tasks together.

Got it. I've reverted it so now eval_batch_size takes number of tokens.

epwalsh

One minor comment, otherwise LGTM!

epwalsh · 2024-12-19T19:00:14Z

src/scripts/train/OLMo2-1B.py

@@ -73,6 +76,48 @@ def build_trainer_config(common: CommonComponents) -> TrainerConfig:
                cancel_check_interval=10,
            ),
        )
+        .with_callback(
+            "downstream",


Let's call this "downstream_evaluator" to be consistent with the naming convention for other callbacks.

Suggested change

"downstream",

"downstream_evaluator",

thanks, fixed

liujch1998 added 14 commits December 10, 2024 07:25

Change task metric interface to allow multiple metrics per eval

e06dac7

Add downstream callback

e433a04

Install custom olmo-eval

c20b37c

Install custom olmo-eval

8623313

Install custom olmo-eval

d37ff23

Install custom olmo-eval

d49bd16

Install custom olmo-eval

307b5a9

Fix zero rank_batch_size

0ab9881

Add back comet

0d465f8

Add more tasks

6c39c7b

Add more tasks

be08609

Set drop_last=False

0734767

Fix

cb0994a

Add more tasks

9dc4474

liujch1998 commented Dec 15, 2024

View reviewed changes

Cleanup

e503a53

liujch1998 marked this pull request as ready for review December 15, 2024 08:45

liujch1998 added 6 commits December 18, 2024 22:17

Make the logged metric key consistent to previous behavior

0b01c06

Change eval_batch_size to count in tokens

e62a5e1

Merge remote-tracking branch 'origin/main' into moreeval

70aa809

Fix check

a43a377

Fix check

768d488

Install upgraded olmo-eval from pypi

eb84ac1

liujch1998 requested a review from epwalsh December 19, 2024 17:58

epwalsh reviewed Dec 19, 2024

View reviewed changes

Rename downstream => donwstream_evaluator

c2c9fd0

liujch1998 requested a review from epwalsh December 19, 2024 19:42

epwalsh approved these changes Dec 19, 2024

View reviewed changes

Merge remote-tracking branch 'origin/main' into moreeval

7aba244

liujch1998 merged commit ee27348 into main Dec 19, 2024
14 checks passed

liujch1998 deleted the moreeval branch December 19, 2024 21:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync eval changes in OLMo/ladder-1xC to here #122

Sync eval changes in OLMo/ladder-1xC to here #122

liujch1998 commented Dec 15, 2024 •

edited

Loading

liujch1998 Dec 15, 2024

liujch1998 Dec 15, 2024

epwalsh Dec 16, 2024

liujch1998 Dec 16, 2024

epwalsh Dec 17, 2024

liujch1998 Dec 19, 2024

epwalsh left a comment

epwalsh Dec 19, 2024

liujch1998 Dec 19, 2024

Sync eval changes in OLMo/ladder-1xC to here #122

Sync eval changes in OLMo/ladder-1xC to here #122

Conversation

liujch1998 commented Dec 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

epwalsh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liujch1998 commented Dec 15, 2024 •

edited

Loading