Spot instances- Runner must be able to restart workflow #174

DavidGOrtega · 2020-07-23T11:15:25Z

Ideally a third job could help, a workflow for GH and GL would be:

stages:
  - ml
  - check

train:
  stage: ml
  tags:
    - gpu
   - check

  cache:
    paths:
    - ./models
    
  script:
    -  echo "setup a pipeline here"

check:
  stage: check
  when: on_failure
  needs:
    - train

  script:
    - echo "Restarting..."

name: cml

on: [push]

jobs:
  train:
    # needs: deploy
    runs-on: [self-hosted,gpu]

    steps:
      - uses: actions/checkout@v2

      - name: Cache multiple paths
        uses: actions/cache@v2
        with:
          path: |
            ./models
          key: models

      - name: cml_run
        shell: bash
        env:
          repo_token: ${{ secrets.GITHUB_TOKEN }} 
        run: |
          echo "setup a pipeline here"

  check:
    if: failure()
    needs: train
    runs-on: [ubuntu-latest]
    steps:
      - name: cml_check
        run: |
          echo "Restarting...."

however this approach has has two issues:

While in GH the lost of the runner can be recovered ending with a failed job in GL the job without a valid runner can run forever. I opened a ticket here
The biggest drawback would be restarting the workflow in a loop. Having the runner the ability to listen for the spot instance eviction will be a better warranty of acting properly

This implies that we have to provide the cleanup scripts when deploying the spot instances, this scrips just only need to run the runner cleanup and restart. of the workflow.

DavidGOrtega · 2020-07-23T11:21:51Z

https://aws.amazon.com/blogs/compute/best-practices-for-handling-ec2-spot-instance-interruptions/

When an ECS Container Instance is interrupted, and the instance is marked as DRAINING, running tasks are stopped on the instance. When these tasks are stopped, a SIGTERM signal is sent to the running task, and ECS waits up to 2 minutes before forcefully stopping the task, resulting in a SIGKILL signal sent to the running container. This 2 minute window is configurable through a stopTimeout container timeout option, or through ECS Agent Configuration, as shown in the prior code, giving you flexibility within your container to handle the interruption. If you set this value to be greater than 120 seconds, it will not prevent your instance from being interrupted after the 2 minute warning. So, I recommend setting to be less than or equal to 120 seconds.
You can capture the SIGTERM signal within your containerized applications.

DavidGOrtega · 2020-07-23T11:26:36Z

https://cloud.google.com/compute/docs/shutdownscript

gcloud compute instances create example-instance
--metadata-from-file shutdown-script=examples/scripts/install.sh

This would imply add that ability in our docker container fork

DavidGOrtega · 2020-07-23T11:39:55Z

https://stackoverflow.com/questions/62248891/microsoft-azure-spot-instance-shutdown-script

btjones-me · 2021-02-08T19:35:31Z

Hi @DavidGOrtega - was there any progress made on these? Would love to be able to use CML with spot instances

DavidGOrtega · 2021-02-12T10:24:37Z

Hi @btjones-me we are preparing a release that allow you t deploy spot instances using our terraform provider. But restarting the workflow to continue training is something that we are still develping.

SebastianCallh · 2021-04-30T06:53:59Z

Hi @DavidGOrtega ! Can you share the progress on this please? I'm looking into using CML with DVC for a product and being able to use spot instances to train and evaluate models is pretty crucial to keep costs reasonable. Thanks!

DavidGOrtega · 2021-04-30T07:08:34Z

👋 @SebastianCallh you can use spot instances with CML, the feature that we are solving here is the ability to transparently move to another spot instance if the spot instance is depleted.
Said that you could use DVC to cache the intermediate weights and restart the workflow manually.

SebastianCallh · 2021-04-30T07:19:07Z

Thank you for the rapid response! I see. Sorry to say that's probably a deal breaker for my team. It would be impossible to babysit all training/evaluation jobs. Can you share some rough estimate on when this might be solved?

DavidGOrtega · 2021-04-30T07:52:31Z

Sure, let me check with the team what are the estimations of this

SebastianCallh · 2021-04-30T08:01:21Z

That's great! Thank you so much for your assistance and your work on this project!

DavidGOrtega · 2021-04-30T09:54:04Z

@SebastianCallh In the meantime, My I ask whats the solution that your team use right now to renew the spot instances? spot.io maybe?

SebastianCallh · 2021-04-30T11:04:07Z

Sure! Currently we are using SageMaker to provision all cloud compute

SebastianCallh · 2021-05-04T06:56:44Z

@DavidGOrtega any news?

DavidGOrtega · 2021-05-05T18:56:40Z

@SebastianCallh We have been two days discussing this and we made a small prototype. I can tell you an exact day but its close. The trick resides in our runner.

DavidGOrtega mentioned this issue Aug 6, 2020

Workflow timeout a better scenario #208

Closed

DavidGOrtega added p1-important High priority cml-runner Subcommand labels Feb 23, 2021

DavidGOrtega self-assigned this Jun 1, 2021

casperdcl mentioned this issue Jun 1, 2021

DVC feature requests #560

Closed

4 tasks

DavidGOrtega mentioned this issue Jun 8, 2021

Cml runner long job #583

Merged

DavidGOrtega closed this as completed in #583 Jul 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spot instances- Runner must be able to restart workflow #174

Spot instances- Runner must be able to restart workflow #174

DavidGOrtega commented Jul 23, 2020

DavidGOrtega commented Jul 23, 2020

DavidGOrtega commented Jul 23, 2020 •

edited

Loading

DavidGOrtega commented Jul 23, 2020

btjones-me commented Feb 8, 2021

DavidGOrtega commented Feb 12, 2021

SebastianCallh commented Apr 30, 2021

DavidGOrtega commented Apr 30, 2021

SebastianCallh commented Apr 30, 2021

DavidGOrtega commented Apr 30, 2021

SebastianCallh commented Apr 30, 2021

DavidGOrtega commented Apr 30, 2021 •

edited

Loading

SebastianCallh commented Apr 30, 2021 •

edited

Loading

SebastianCallh commented May 4, 2021

DavidGOrtega commented May 5, 2021

Spot instances- Runner must be able to restart workflow #174

Spot instances- Runner must be able to restart workflow #174

Comments

DavidGOrtega commented Jul 23, 2020

DavidGOrtega commented Jul 23, 2020

DavidGOrtega commented Jul 23, 2020 • edited Loading

DavidGOrtega commented Jul 23, 2020

btjones-me commented Feb 8, 2021

DavidGOrtega commented Feb 12, 2021

SebastianCallh commented Apr 30, 2021

DavidGOrtega commented Apr 30, 2021

SebastianCallh commented Apr 30, 2021

DavidGOrtega commented Apr 30, 2021

SebastianCallh commented Apr 30, 2021

DavidGOrtega commented Apr 30, 2021 • edited Loading

SebastianCallh commented Apr 30, 2021 • edited Loading

SebastianCallh commented May 4, 2021

DavidGOrtega commented May 5, 2021

DavidGOrtega commented Jul 23, 2020 •

edited

Loading

DavidGOrtega commented Apr 30, 2021 •

edited

Loading

SebastianCallh commented Apr 30, 2021 •

edited

Loading