Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2/n][torchtune integration] implement job management and return training artifacts #593

Merged
merged 21 commits into from
Dec 13, 2024

Conversation

SLR722
Copy link
Contributor

@SLR722 SLR722 commented Dec 10, 2024

Context

In this PR, we

  • Implement the post training job management and get training artifacts apis
  • Refactor the post training and training types definition to make them more intuitive.
  • Rewrite the checkpointer to make it compatible with llama-stack file system and can be recognized during inference

Test

Unit test
pytest llama_stack/providers/tests/post_training/test_post_training.py -m "torchtune_post_training_huggingface_datasetio" -v -s --tb=short --disable-warnings

Screenshot 2024-12-10 at 4 06 17 PM

e2e test with client side call

Screenshot 2024-12-10 at 4 09 44 PM

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 10, 2024
@SLR722 SLR722 changed the title [WIP][2/n][torchtune integration] Implementation job management and return training artifacts [WIP][2/n][torchtune integration] implement job management and return training artifacts Dec 10, 2024
@SLR722 SLR722 changed the base branch from main to post_training_v2 December 10, 2024 19:10
@SLR722 SLR722 changed the title [WIP][2/n][torchtune integration] implement job management and return training artifacts [2/n][torchtune integration] implement job management and return training artifacts Dec 11, 2024
@SLR722 SLR722 marked this pull request as ready for review December 11, 2024 00:10
Base automatically changed from post_training_v2 to main December 13, 2024 19:05
@SLR722 SLR722 merged commit c294a01 into main Dec 13, 2024
2 checks passed
@SLR722 SLR722 deleted the post_training_v3 branch December 13, 2024 23:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants