You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To improve checkpointing performance in Flytekit for PyTorch, leveraging asynchronous checkpointing as described in the PyTorch blog is a viable approach. This method reduces the downtime for training due to checkpointing by moving the final checkpointing process off the critical path to CPU threads, allowing GPU training to continue.
Would you like more details on how to integrate this into Flytekit?
Motivation: Why do you think this is important?
When using elastic we can greatly improve checkpointing performance using https://pytorch.org/blog/reducing-checkpointing-times/
Goal: What should the final outcome look like, ideally?
Checkpoints are faster
Describe alternatives you've considered
Na
Propose: Link/Inline OR Additional context
No response
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?
The text was updated successfully, but these errors were encountered: