Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support torch async dist checkpoint #2612

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

novahow
Copy link
Contributor

@novahow novahow commented Jul 25, 2024

Tracking issue

Closes flyteorg/flyte#5488

Why are the changes needed?

currently I think we use torch.save and upload it to s3. As models get larger, sync saving isn't time-efficient.

What changes were proposed in this pull request?

We use futures to put it in another thread so that user can continue training. If user saves again, we wait till the prev save& upload to finish and submit the next save+upload request.

How was this patch tested?

n/a. Tried to run on local computer, but my computer was too low-end and crashed.

Setup process

Screenshots

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

Signed-off-by: novahow <[email protected]>
Signed-off-by: novahow <[email protected]>
Signed-off-by: novahow <[email protected]>

Revert "exp with union imagespec"

This reverts commit c75529e.

exp with union imagespec test version

Signed-off-by: novahow <[email protected]>

exp with union imagespec test version 1.5.0

Signed-off-by: novahow <[email protected]>

exp with union imagespec test version 1.15.0

Signed-off-by: novahow <[email protected]>
Signed-off-by: novahow <[email protected]>
@novahow novahow force-pushed the plugins/torch_async branch from 8fc342d to 340365e Compare December 13, 2024 16:24
Copy link

codecov bot commented Dec 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.15%. Comparing base (f99d50e) to head (340365e).

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #2612       +/-   ##
===========================================
+ Coverage   51.08%   92.15%   +41.07%     
===========================================
  Files         201       33      -168     
  Lines       21231     1734    -19497     
  Branches     2731        0     -2731     
===========================================
- Hits        10846     1598     -9248     
+ Misses       9787      136     -9651     
+ Partials      598        0      -598     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flytekit checkpoint improvement- pytorch
1 participant