Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flytekit checkpoint improvement- pytorch #5488

Open
2 tasks done
kumare3 opened this issue Jun 19, 2024 · 1 comment · May be fixed by flyteorg/flytekit#2612
Open
2 tasks done

Flytekit checkpoint improvement- pytorch #5488

kumare3 opened this issue Jun 19, 2024 · 1 comment · May be fixed by flyteorg/flytekit#2612
Labels
backlogged For internal use. Reserved for contributor team workflow.

Comments

@kumare3
Copy link
Contributor

kumare3 commented Jun 19, 2024

Motivation: Why do you think this is important?

When using elastic we can greatly improve checkpointing performance using https://pytorch.org/blog/reducing-checkpointing-times/

Goal: What should the final outcome look like, ideally?

Checkpoints are faster

Describe alternatives you've considered

Na

Propose: Link/Inline OR Additional context

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
Copy link

runllm bot commented Jun 19, 2024

To improve checkpointing performance in Flytekit for PyTorch, leveraging asynchronous checkpointing as described in the PyTorch blog is a viable approach. This method reduces the downtime for training due to checkpointing by moving the final checkpointing process off the critical path to CPU threads, allowing GPU training to continue.

Would you like more details on how to integrate this into Flytekit?

Sources:


This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @RunLLM to ask follow-up questions. See runllm.com for more!

@eapolinario eapolinario added the backlogged For internal use. Reserved for contributor team workflow. label Jun 20, 2024
@novahow novahow linked a pull request Jul 25, 2024 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlogged For internal use. Reserved for contributor team workflow.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants