-
Notifications
You must be signed in to change notification settings - Fork 341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spot instances- Runner must be able to restart workflow #174
Comments
https://aws.amazon.com/blogs/compute/best-practices-for-handling-ec2-spot-instance-interruptions/
|
https://cloud.google.com/compute/docs/shutdownscript
This would imply add that ability in our docker container fork |
Hi @DavidGOrtega - was there any progress made on these? Would love to be able to use CML with spot instances |
Hi @btjones-me we are preparing a release that allow you t deploy spot instances using our terraform provider. But restarting the workflow to continue training is something that we are still develping. |
Hi @DavidGOrtega ! Can you share the progress on this please? I'm looking into using CML with DVC for a product and being able to use spot instances to train and evaluate models is pretty crucial to keep costs reasonable. Thanks! |
👋 @SebastianCallh you can use spot instances with CML, the feature that we are solving here is the ability to transparently move to another spot instance if the spot instance is depleted. |
Thank you for the rapid response! I see. Sorry to say that's probably a deal breaker for my team. It would be impossible to babysit all training/evaluation jobs. Can you share some rough estimate on when this might be solved? |
Sure, let me check with the team what are the estimations of this |
That's great! Thank you so much for your assistance and your work on this project! |
@SebastianCallh In the meantime, My I ask whats the solution that your team use right now to renew the spot instances? spot.io maybe? |
Sure! Currently we are using SageMaker to provision all cloud compute |
@DavidGOrtega any news? |
@SebastianCallh We have been two days discussing this and we made a small prototype. I can tell you an exact day but its close. The trick resides in our runner. |
Ideally a third job could help, a workflow for GH and GL would be:
however this approach has has two issues:
This implies that we have to provide the cleanup scripts when deploying the spot instances, this scrips just only need to run the runner cleanup and restart. of the workflow.
The text was updated successfully, but these errors were encountered: