Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix training integration test #517

Merged
merged 8 commits into from
Aug 11, 2023
Merged

Fix training integration test #517

merged 8 commits into from
Aug 11, 2023

Conversation

j316chuck
Copy link
Contributor

@j316chuck j316chuck commented Aug 10, 2023

This PR fixes the test_training integration test for both cpu and gpu training runs.

Before, this test was skipped (x-fail) since the my-copy-c4 dataset did not exist. This PR adds the logic to create a mocked my-copy-c4 dataset, kicks of a simple training run, and checks that the codebase reproduces a golden loss score.

Before:

tests/test_training.py::test_train[None-cpu] XFAIL (c4 dataset not s...) [ 97%]
tests/test_training.py::test_train[None-cuda] SKIPPED (testing with ...) [ 97%]
tests/test_training.py::test_train[0.036-cpu] XFAIL (c4 dataset not ...) [ 98%]
tests/test_training.py::test_train[0.036-cuda] SKIPPED (testing with...) [ 98%]
tests/test_training.py::test_train[inv_sqrt_d_model-cpu] XFAIL (c4 d...) [ 99%]
tests/test_training.py::test_train[inv_sqrt_d_model-cuda] SKIPPED (t...) [100%]

After:

2023-08-10 23:12:52 [    INFO] Setting seed to 17 (reproducibility.py:159)
2023-08-10 23:12:52 [    INFO] Added key: store_based_barrier_key:1 to store for rank: 0 (distributed_c10d.py:319)
2023-08-10 23:12:52 [    INFO] Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes. (distributed_c10d.py:353)
2023-08-10 23:12:52 [    INFO] The number of tokens in the tokenizer is less than the number of tokens in the model. You may want to resize the model embeddings to 50277 from 50368 by calling `model.resize_token_embeddings(len(tokenizer))` before calling the `HuggingFaceModel` constructor. The vocab size is sometimes intentionally set to a multiple of 32 or 64 to improve performance. (huggingface.py:111)
2023-08-10 23:12:52 [    INFO] Setting seed to 17 (reproducibility.py:159)
2023-08-10 23:12:52 [    INFO] Run name: llm (trainer.py:1000)
2023-08-10 23:12:52 [    INFO] Stepping schedulers every batch. To step schedulers every epoch, set `step_schedulers_every_batch=False`. (trainer.py:99)
2023-08-10 23:12:52 [    INFO] Setting seed to 17 (trainer.py:1380)
2023-08-10 23:12:52 [    INFO] Setting seed to 17 (reproducibility.py:159)
2023-08-10 23:12:52 [    INFO] Using precision Precision.FP32 (trainer.py:1913)
PASSED                                                                   [ 99%]

working test

fix on cpu
llmfoundry/data/text_data.py Outdated Show resolved Hide resolved
tests/test_training.py Outdated Show resolved Hide resolved
@j316chuck j316chuck changed the title Fix training test Fix training integration test Aug 11, 2023
llmfoundry/data/text_data.py Outdated Show resolved Hide resolved
llmfoundry/data/text_data.py Outdated Show resolved Hide resolved
scripts/train/train.py Outdated Show resolved Hide resolved
tests/test_training.py Outdated Show resolved Hide resolved
tests/test_training.py Outdated Show resolved Hide resolved
@j316chuck j316chuck requested a review from vchiley August 11, 2023 04:26
tests/test_training.py Show resolved Hide resolved
tests/test_training.py Outdated Show resolved Hide resolved
@j316chuck j316chuck requested a review from vchiley August 11, 2023 17:36
@j316chuck j316chuck merged commit 6d58287 into main Aug 11, 2023
9 checks passed
@j316chuck j316chuck deleted the chuck/fix_test branch October 5, 2023 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants