cml-runner fails to deploy runners on ec2 #741

thatGreekGuy96 · 2021-09-23T09:54:56Z

Hey everyone,
A random issue started appearing yesterday and cml-runner now fails to deploy runners. The issue seems to coincide with the release of version 0.7.0 but switching back to 0.6.3 does not seem to sort the problem! Updating to 0.7.1 also didn't fix the problem!

The command I am running is:

name: Run-Engine-Tests

      - name: "Deploy runner on EC2"
        shell: bash
        env:
          repo_token: ${{ secrets.ACCESS_TOKEN_CML_TESTING }}
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID_TESTING }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY_TESTING }}
          CASE_NAME: ${{ matrix.case_name }}
          N_RUNNERS: ${{ fromJson(needs.setup_config.outputs.json_string).n_runners }}
          NEPTUNE_CUSTOM_RUN_ID: ${{ needs.setup_neptune_custom_run_id.outputs.neptune_custom_run_id }}

        run: |
          for (( i=1; i<=N_RUNNERS; i++ ))
          do
            echo "Deploying runner ${i}"
            cml-runner \
            --cloud aws \
            --cloud-region eu-west-2 \
            --cloud-type=m \
            --cloud-hdd-size 100 \
            --cloud-spot \
            --labels=cml-runner-${NEPTUNE_CUSTOM_RUN_ID} || exit 1 &
          done
          wait
          echo "Deployed ${N_RUNNERS} runners."
      - run: >-
          cat "$TF_LOG_PATH"

I've cut it a bit short so that you can only see the relevant part. I'm also attaching the terraform logs, hopefully it helps!

Looking at the EC2 console on the AWS side, I can see that the EC2 instances spin up properly, but then get shut down after about 30 seconds. On the spot requests tab, the status is displayed as terminated-by-user, so it's not AWS shutting them down.

Finally, I also noticed that the name of the runners on EC2 is now Hosted Agent, which didn't use to be the case before. It used to be something like iterative-<random_stirng>. Not sure if it's relevant but putting it out there just in case!

1_Set up job.txt
2_Run [email protected]
3_Run [email protected]
4_Deploy runner on EC2.txt
5_Run cat $TF_LOG_PATH.txt
10_Post Run [email protected]
11_Complete job.txt

The text was updated successfully, but these errors were encountered:

0x2b3bfa0 · 2021-09-23T09:56:40Z

Duplicate of #738; thanks for the detailed report!

thatGreekGuy96 · 2021-09-23T09:58:07Z

Ah cool, i did spot that issue but I wasn't sure if it was relevant or not! Is there anything I can do to fix it until it's sorted on your side? @0x2b3bfa0

0x2b3bfa0 · 2021-09-23T09:58:43Z

@thatGreekGuy96, please run the following to confirm the issue:

      - name: "Deploy runner on EC2"
        shell: bash
        env:
          repo_token: ${{ secrets.ACCESS_TOKEN_CML_TESTING }}
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID_TESTING }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY_TESTING }}
          CASE_NAME: ${{ matrix.case_name }}
          N_RUNNERS: ${{ fromJson(needs.setup_config.outputs.json_string).n_runners }}
          NEPTUNE_CUSTOM_RUN_ID: ${{ needs.setup_neptune_custom_run_id.outputs.neptune_custom_run_id }}
        run: |
          for (( i=1; i<=N_RUNNERS; i++ ))
          do
            echo "Deploying runner ${i}"
            RUNNER_NAME= cml-runner \
            --cloud aws \
            --cloud-region eu-west-2 \
            --cloud-type=m \
            --cloud-hdd-size 100 \
            --cloud-spot \
            --labels=cml-runner-${NEPTUNE_CUSTOM_RUN_ID} || exit 1 &
          done
          wait
          echo "Deployed ${N_RUNNERS} runners."
      - run: >-
          cat "$TF_LOG_PATH"

0x2b3bfa0 · 2021-09-23T10:05:19Z

Is there anything I can do to fix it until it's sorted on your side?

Yes, please try the workaround from #741 (comment) in the meanwhile.

thatGreekGuy96 · 2021-09-23T10:07:14Z

yup i can confirm this fixes it! Do you mind explaining why 😅 i'm just curious!

0x2b3bfa0 · 2021-09-23T10:12:42Z

Yes, people at GitHub have started to use the RUNNER_NAME environment variable for their own purposes, overriding the default runner name with some creative strings that include spaces, like Hosted Agent or GitHub Actions 3 😑

0x2b3bfa0 self-assigned this Sep 23, 2021

0x2b3bfa0 added bug Something isn't working duplicate Déjà lu p0-critical Max priority (ASAP) labels Sep 23, 2021

0x2b3bfa0 closed this as completed Sep 23, 2021

0x2b3bfa0 mentioned this issue Sep 23, 2021

GitHub–hosted runners have taken over the RUNNER_ environment variable prefix #738

Closed

0x2b3bfa0 added cloud-aws Amazon Web Services cml-runner Subcommand labels Sep 23, 2021

0x2b3bfa0 linked a pull request Sep 23, 2021 that will close this issue

Manage all the environment variables through yargs #739

Merged

casperdcl mentioned this issue Oct 12, 2021

Move REPO_TOKEN to CML_ environment variable namespace #763

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cml-runner fails to deploy runners on ec2 #741

cml-runner fails to deploy runners on ec2 #741

thatGreekGuy96 commented Sep 23, 2021

0x2b3bfa0 commented Sep 23, 2021

thatGreekGuy96 commented Sep 23, 2021 •

edited

Loading

0x2b3bfa0 commented Sep 23, 2021

0x2b3bfa0 commented Sep 23, 2021

thatGreekGuy96 commented Sep 23, 2021

0x2b3bfa0 commented Sep 23, 2021

cml-runner fails to deploy runners on ec2 #741

cml-runner fails to deploy runners on ec2 #741

Comments

thatGreekGuy96 commented Sep 23, 2021

0x2b3bfa0 commented Sep 23, 2021

thatGreekGuy96 commented Sep 23, 2021 • edited Loading

0x2b3bfa0 commented Sep 23, 2021

0x2b3bfa0 commented Sep 23, 2021

thatGreekGuy96 commented Sep 23, 2021

0x2b3bfa0 commented Sep 23, 2021

thatGreekGuy96 commented Sep 23, 2021 •

edited

Loading