Fix training integration test #517

j316chuck · 2023-08-10T22:59:59Z

This PR fixes the test_training integration test for both cpu and gpu training runs.

Before, this test was skipped (x-fail) since the my-copy-c4 dataset did not exist. This PR adds the logic to create a mocked my-copy-c4 dataset, kicks of a simple training run, and checks that the codebase reproduces a golden loss score.

Before:

https://github.com/mosaicml/llm-foundry/actions/runs/5800250754/job/15721956423?pr=510

tests/test_training.py::test_train[None-cpu] XFAIL (c4 dataset not s...) [ 97%]
tests/test_training.py::test_train[None-cuda] SKIPPED (testing with ...) [ 97%]
tests/test_training.py::test_train[0.036-cpu] XFAIL (c4 dataset not ...) [ 98%]
tests/test_training.py::test_train[0.036-cuda] SKIPPED (testing with...) [ 98%]
tests/test_training.py::test_train[inv_sqrt_d_model-cpu] XFAIL (c4 d...) [ 99%]
tests/test_training.py::test_train[inv_sqrt_d_model-cuda] SKIPPED (t...) [100%]

After:

https://github.com/mosaicml/llm-foundry/actions/runs/5826923270/job/15801908817?pr=517

2023-08-10 23:12:52 [    INFO] Setting seed to 17 (reproducibility.py:159)
2023-08-10 23:12:52 [    INFO] Added key: store_based_barrier_key:1 to store for rank: 0 (distributed_c10d.py:319)
2023-08-10 23:12:52 [    INFO] Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes. (distributed_c10d.py:353)
2023-08-10 23:12:52 [    INFO] The number of tokens in the tokenizer is less than the number of tokens in the model. You may want to resize the model embeddings to 50277 from 50368 by calling `model.resize_token_embeddings(len(tokenizer))` before calling the `HuggingFaceModel` constructor. The vocab size is sometimes intentionally set to a multiple of 32 or 64 to improve performance. (huggingface.py:111)
2023-08-10 23:12:52 [    INFO] Setting seed to 17 (reproducibility.py:159)
2023-08-10 23:12:52 [    INFO] Run name: llm (trainer.py:1000)
2023-08-10 23:12:52 [    INFO] Stepping schedulers every batch. To step schedulers every epoch, set `step_schedulers_every_batch=False`. (trainer.py:99)
2023-08-10 23:12:52 [    INFO] Setting seed to 17 (trainer.py:1380)
2023-08-10 23:12:52 [    INFO] Setting seed to 17 (reproducibility.py:159)
2023-08-10 23:12:52 [    INFO] Using precision Precision.FP32 (trainer.py:1913)
PASSED                                                                   [ 99%]

working test fix on cpu

llmfoundry/data/text_data.py

tests/test_training.py

llmfoundry/data/text_data.py

scripts/train/train.py

tests/test_training.py

add test

98d4dd3

working test fix on cpu

j316chuck requested review from vchiley, mvpatel2000 and dakinggg August 11, 2023 00:35

j316chuck mentioned this pull request Aug 11, 2023

Add runtime error in train.py if yaml config is improperly formatted with extraneous or missing values #506

Merged

dakinggg approved these changes Aug 11, 2023

View reviewed changes

llmfoundry/data/text_data.py Outdated Show resolved Hide resolved

tests/test_training.py Outdated Show resolved Hide resolved

j316chuck changed the title ~~Fix training test~~ Fix training integration test Aug 11, 2023

vchiley requested changes Aug 11, 2023

View reviewed changes

j316chuck added 3 commits August 10, 2023 20:24

address comments

3caba2e

fix streaming stuff

8b0689d

pyright

de9b66b

j316chuck requested a review from vchiley August 11, 2023 04:26

j316chuck added 2 commits August 10, 2023 21:39

restore train

8b2833e

remove cmt

724f7af

vchiley reviewed Aug 11, 2023

View reviewed changes

tests/test_training.py Show resolved Hide resolved

tests/test_training.py Outdated Show resolved Hide resolved

j316chuck added 2 commits August 11, 2023 10:33

address comment

d01bd9b

fmt

a287356

j316chuck requested a review from vchiley August 11, 2023 17:36

vchiley approved these changes Aug 11, 2023

View reviewed changes

j316chuck merged commit 6d58287 into main Aug 11, 2023
9 checks passed

j316chuck deleted the chuck/fix_test branch October 5, 2023 21:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix training integration test #517

Fix training integration test #517

j316chuck commented Aug 10, 2023 •

edited

Loading

Fix training integration test #517

Fix training integration test #517

Conversation

j316chuck commented Aug 10, 2023 • edited Loading

j316chuck commented Aug 10, 2023 •

edited

Loading