Code clean-ups #171

vince62s · 2024-12-24T11:17:27Z

No description provided.

vince62s · 2024-12-27T09:36:16Z

Main difference in perf comes from the use of the context manager:
with sdpa_kernel([SDPBackend.EFFICIENT_ATTENTION]): before the scaled_dot_product_attention()
The counterpart is that it takes longer to startup because of some recompilings.
Not sure what is the best.

For the rest, some code clean-ups.

Before this PR, EuroLLM-9B finetuning the estimator:

[2024-12-27 09:50:51,765 INFO] Step 10/ 4000; acc: 54.8; ppl: 24.84; xent: 3.21; aux: 0.599; lr: 3.33e-06; sents:    2560; bsz:  208/ 208/ 2; 1694/1694 tok/s;    157 sec;
[2024-12-27 09:51:39,617 INFO] Step 20/ 4000; acc: 55.0; ppl: 24.50; xent: 3.20; aux: 0.548; lr: 6.67e-06; sents:    2560; bsz:  210/ 210/ 2; 5625/5625 tok/s;    205 sec;
[2024-12-27 09:52:28,284 INFO] Step 30/ 4000; acc: 55.3; ppl: 23.62; xent: 3.16; aux: 0.451; lr: 1.00e-05; sents:    2560; bsz:  213/ 213/ 2; 5602/5602 tok/s;    253 sec;
[2024-12-27 09:53:17,060 INFO] Step 40/ 4000; acc: 55.2; ppl: 24.26; xent: 3.19; aux: 0.248; lr: 1.33e-05; sents:    2560; bsz:  211/ 211/ 2; 5545/5545 tok/s;    302 sec;
[2024-12-27 09:54:06,210 INFO] Step 50/ 4000; acc: 55.4; ppl: 23.51; xent: 3.16; aux: 0.110; lr: 1.67e-05; sents:    2560; bsz:  215/ 215/ 2; 5592/5592 tok/s;    351 sec;
[2024-12-27 09:54:54,319 INFO] Step 60/ 4000; acc: 54.8; ppl: 25.14; xent: 3.22; aux: 0.061; lr: 2.00e-05; sents:    2560; bsz:  206/ 206/ 2; 5481/5481 tok/s;    399 sec;

This PR:

[2024-12-27 10:20:55,554 INFO] Step 10/ 4000; acc: 54.8; ppl: 24.81; xent: 3.21; aux: 0.599; lr: 3.33e-06; sents:    2560; bsz:  208/ 208/ 2; 830/830 tok/s;    320 sec;
[2024-12-27 10:21:41,045 INFO] Step 20/ 4000; acc: 55.0; ppl: 24.47; xent: 3.20; aux: 0.548; lr: 6.67e-06; sents:    2560; bsz:  210/ 210/ 2; 5917/5917 tok/s;    366 sec;
[2024-12-27 10:22:27,125 INFO] Step 30/ 4000; acc: 55.3; ppl: 23.59; xent: 3.16; aux: 0.451; lr: 1.00e-05; sents:    2560; bsz:  213/ 213/ 2; 5916/5916 tok/s;    412 sec;
[2024-12-27 10:23:13,582 INFO] Step 40/ 4000; acc: 55.2; ppl: 24.22; xent: 3.19; aux: 0.248; lr: 1.33e-05; sents:    2560; bsz:  211/ 211/ 2; 5821/5821 tok/s;    458 sec;
[2024-12-27 10:24:00,311 INFO] Step 50/ 4000; acc: 55.3; ppl: 23.48; xent: 3.16; aux: 0.110; lr: 1.67e-05; sents:    2560; bsz:  215/ 215/ 2; 5881/5881 tok/s;    505 sec;
[2024-12-27 10:24:45,582 INFO] Step 60/ 4000; acc: 54.8; ppl: 25.10; xent: 3.22; aux: 0.061; lr: 2.00e-05; sents:    2560; bsz:  206/ 206/ 2; 5825/5825 tok/s;    550 sec;

Before this PR, Encoder-Decoder training:

[2024-12-27 09:59:18,407 INFO] Step 100/200000; acc: 13.6; ppl: 10780.07; xent: 9.29; aux: 0.000; lr: 6.72e-06; sents:  118220; bsz: 8506/10554/197; 26799/33251 tok/s;    190 sec;
[2024-12-27 10:01:04,435 INFO] Step 200/200000; acc: 17.8; ppl: 3006.98; xent: 8.01; aux: 0.000; lr: 1.34e-05; sents:  107813; bsz: 8515/10553/180; 48188/59718 tok/s;    296 sec;
[2024-12-27 10:02:51,181 INFO] Step 300/200000; acc: 19.5; ppl: 1181.01; xent: 7.07; aux: 0.000; lr: 2.02e-05; sents:  100017; bsz: 8583/10569/167; 48244/59408 tok/s;    403 sec;

This PR:

[2024-12-27 10:11:14,616 INFO] Step 100/200000; acc: 13.6; ppl: 10779.73; xent: 9.29; aux: 0.000; lr: 6.72e-06; sents:  118220; bsz: 8506/10554/197; 12455/15453 tok/s;    410 sec;
[2024-12-27 10:12:59,324 INFO] Step 200/200000; acc: 17.8; ppl: 3007.03; xent: 8.01; aux: 0.000; lr: 1.34e-05; sents:  107813; bsz: 8515/10553/180; 48796/60471 tok/s;    514 sec;
[2024-12-27 10:14:44,983 INFO] Step 300/200000; acc: 19.5; ppl: 1184.54; xent: 7.08; aux: 0.000; lr: 2.02e-05; sents:  100017; bsz: 8583/10569/167; 48740/60020 tok/s;    620 sec;

…rue=yes we attend)

francoishernandez

Nice clean-up for the new year!
Quite a few comments not necessarily all relevant.

Also, do we know where the main diffs in the test outputs come from? Is it just the attention backend related changes? Or maybe a few numerical differences ensuing from the modified operations here and there?

.github/workflows/push.yml

eole/decoders/ensemble.py

eole/decoders/transformer_decoder.py

eole/decoders/transformer_lm_decoder.py

eole/predict/inference.py

eole/predict/translator.py

eole/tests/test_model_lm/config.json

eole/train_single.py

vince62s added 7 commits December 24, 2024 12:13

misc optimization

1dbf345

clear cache after translation during scoring

c23abae

more

25614f8

allow more recompiles

c0f9ec5

fixes

e7b4a89

black

1158397

fixes

315f1fa

vince62s changed the title ~~misc optimization~~ Code clean-ups Dec 27, 2024

vince62s added 7 commits December 29, 2024 15:15

set rope / position_embeddings at model build

3bd756b

revert to sequence mask due to tiling

c22ebcf

rearrange tests

586b7bb

remove BPTT

373ca56

clarify pad_mask(true=yes we pad, so we won't attend) and attn_mask(t…

a2cc453

…rue=yes we attend)

update readme

7694a4a

preallocate KV cache even in "pytorch" path (same as flash)

b057c2e

francoishernandez reviewed Jan 3, 2025

View reviewed changes

vince62s added 5 commits January 3, 2025 10:25

more clean-ups (old logic)

bd613e8

address most comments

a0430c5

fixes

32603e7

fix rope_config

bb840a5

reduce config updates

96b7b5a

vince62s merged commit 8a8987f into eole-nlp:main Jan 3, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code clean-ups #171

Code clean-ups #171

vince62s commented Dec 24, 2024

vince62s commented Dec 27, 2024

francoishernandez left a comment •

edited

Loading

Code clean-ups #171

Code clean-ups #171

Conversation

vince62s commented Dec 24, 2024

vince62s commented Dec 27, 2024

francoishernandez left a comment • edited Loading

Choose a reason for hiding this comment

francoishernandez left a comment •

edited

Loading