Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spot instances- Runner must be able to restart workflow #174

Closed
DavidGOrtega opened this issue Jul 23, 2020 · 14 comments · Fixed by #583
Closed

Spot instances- Runner must be able to restart workflow #174

DavidGOrtega opened this issue Jul 23, 2020 · 14 comments · Fixed by #583
Assignees
Labels
cml-runner Subcommand p1-important High priority

Comments

@DavidGOrtega
Copy link
Contributor

Ideally a third job could help, a workflow for GH and GL would be:

stages:
  - ml
  - check

train:
  stage: ml
  tags:
    - gpu
   - check

  cache:
    paths:
    - ./models
    
  script:
    -  echo "setup a pipeline here"

check:
  stage: check
  when: on_failure
  needs:
    - train

  script:
    - echo "Restarting..."
name: cml

on: [push]

jobs:
  train:
    # needs: deploy
    runs-on: [self-hosted,gpu]

    steps:
      - uses: actions/checkout@v2

      - name: Cache multiple paths
        uses: actions/cache@v2
        with:
          path: |
            ./models
          key: models

      - name: cml_run
        shell: bash
        env:
          repo_token: ${{ secrets.GITHUB_TOKEN }} 
        run: |
          echo "setup a pipeline here"

  check:
    if: failure()
    needs: train
    runs-on: [ubuntu-latest]
    steps:
      - name: cml_check
        run: |
          echo "Restarting...."

however this approach has has two issues:

  • While in GH the lost of the runner can be recovered ending with a failed job in GL the job without a valid runner can run forever. I opened a ticket here
  • The biggest drawback would be restarting the workflow in a loop. Having the runner the ability to listen for the spot instance eviction will be a better warranty of acting properly

This implies that we have to provide the cleanup scripts when deploying the spot instances, this scrips just only need to run the runner cleanup and restart. of the workflow.

@DavidGOrtega
Copy link
Contributor Author

https://aws.amazon.com/blogs/compute/best-practices-for-handling-ec2-spot-instance-interruptions/

When an ECS Container Instance is interrupted, and the instance is marked as DRAINING, running tasks are stopped on the instance. When these tasks are stopped, a SIGTERM signal is sent to the running task, and ECS waits up to 2 minutes before forcefully stopping the task, resulting in a SIGKILL signal sent to the running container. This 2 minute window is configurable through a stopTimeout container timeout option, or through ECS Agent Configuration, as shown in the prior code, giving you flexibility within your container to handle the interruption. If you set this value to be greater than 120 seconds, it will not prevent your instance from being interrupted after the 2 minute warning. So, I recommend setting to be less than or equal to 120 seconds.
You can capture the SIGTERM signal within your containerized applications.

@DavidGOrtega
Copy link
Contributor Author

DavidGOrtega commented Jul 23, 2020

https://cloud.google.com/compute/docs/shutdownscript

gcloud compute instances create example-instance
--metadata-from-file shutdown-script=examples/scripts/install.sh

This would imply add that ability in our docker container fork

@DavidGOrtega
Copy link
Contributor Author

@btjones-me
Copy link

Hi @DavidGOrtega - was there any progress made on these? Would love to be able to use CML with spot instances

@DavidGOrtega
Copy link
Contributor Author

Hi @btjones-me we are preparing a release that allow you t deploy spot instances using our terraform provider. But restarting the workflow to continue training is something that we are still develping.

@DavidGOrtega DavidGOrtega added p1-important High priority cml-runner Subcommand labels Feb 23, 2021
@SebastianCallh
Copy link

Hi @DavidGOrtega ! Can you share the progress on this please? I'm looking into using CML with DVC for a product and being able to use spot instances to train and evaluate models is pretty crucial to keep costs reasonable. Thanks!

@DavidGOrtega
Copy link
Contributor Author

👋 @SebastianCallh you can use spot instances with CML, the feature that we are solving here is the ability to transparently move to another spot instance if the spot instance is depleted.
Said that you could use DVC to cache the intermediate weights and restart the workflow manually.

@SebastianCallh
Copy link

Thank you for the rapid response! I see. Sorry to say that's probably a deal breaker for my team. It would be impossible to babysit all training/evaluation jobs. Can you share some rough estimate on when this might be solved?

@DavidGOrtega
Copy link
Contributor Author

Sure, let me check with the team what are the estimations of this

@SebastianCallh
Copy link

That's great! Thank you so much for your assistance and your work on this project!

@DavidGOrtega
Copy link
Contributor Author

DavidGOrtega commented Apr 30, 2021

@SebastianCallh In the meantime, My I ask whats the solution that your team use right now to renew the spot instances? spot.io maybe?

@SebastianCallh
Copy link

SebastianCallh commented Apr 30, 2021

Sure! Currently we are using SageMaker to provision all cloud compute

@SebastianCallh
Copy link

@DavidGOrtega any news?

@DavidGOrtega
Copy link
Contributor Author

@SebastianCallh We have been two days discussing this and we made a small prototype. I can tell you an exact day but its close. The trick resides in our runner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cml-runner Subcommand p1-important High priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants