Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cml-runner fails to deploy runners on ec2 #741

Closed
thatGreekGuy96 opened this issue Sep 23, 2021 · 6 comments · Fixed by #739
Closed

cml-runner fails to deploy runners on ec2 #741

thatGreekGuy96 opened this issue Sep 23, 2021 · 6 comments · Fixed by #739
Assignees
Labels
bug Something isn't working cloud-aws Amazon Web Services cml-runner Subcommand duplicate Déjà lu p0-critical Max priority (ASAP)

Comments

@thatGreekGuy96
Copy link

Hey everyone,
A random issue started appearing yesterday and cml-runner now fails to deploy runners. The issue seems to coincide with the release of version 0.7.0 but switching back to 0.6.3 does not seem to sort the problem! Updating to 0.7.1 also didn't fix the problem!

The command I am running is:

name: Run-Engine-Tests

      - name: "Deploy runner on EC2"
        shell: bash
        env:
          repo_token: ${{ secrets.ACCESS_TOKEN_CML_TESTING }}
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID_TESTING }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY_TESTING }}
          CASE_NAME: ${{ matrix.case_name }}
          N_RUNNERS: ${{ fromJson(needs.setup_config.outputs.json_string).n_runners }}
          NEPTUNE_CUSTOM_RUN_ID: ${{ needs.setup_neptune_custom_run_id.outputs.neptune_custom_run_id }}

        run: |
          for (( i=1; i<=N_RUNNERS; i++ ))
          do
            echo "Deploying runner ${i}"
            cml-runner \
            --cloud aws \
            --cloud-region eu-west-2 \
            --cloud-type=m \
            --cloud-hdd-size 100 \
            --cloud-spot \
            --labels=cml-runner-${NEPTUNE_CUSTOM_RUN_ID} || exit 1 &
          done
          wait
          echo "Deployed ${N_RUNNERS} runners."
      - run: >-
          cat "$TF_LOG_PATH"

I've cut it a bit short so that you can only see the relevant part. I'm also attaching the terraform logs, hopefully it helps!

Looking at the EC2 console on the AWS side, I can see that the EC2 instances spin up properly, but then get shut down after about 30 seconds. On the spot requests tab, the status is displayed as terminated-by-user, so it's not AWS shutting them down.

Finally, I also noticed that the name of the runners on EC2 is now Hosted Agent, which didn't use to be the case before. It used to be something like iterative-<random_stirng>. Not sure if it's relevant but putting it out there just in case!

1_Set up job.txt
2_Run [email protected]
3_Run [email protected]
4_Deploy runner on EC2.txt
5_Run cat $TF_LOG_PATH.txt
10_Post Run [email protected]
11_Complete job.txt

@0x2b3bfa0 0x2b3bfa0 self-assigned this Sep 23, 2021
@0x2b3bfa0 0x2b3bfa0 added bug Something isn't working duplicate Déjà lu p0-critical Max priority (ASAP) labels Sep 23, 2021
@0x2b3bfa0
Copy link
Member

Duplicate of #738; thanks for the detailed report!

@thatGreekGuy96
Copy link
Author

thatGreekGuy96 commented Sep 23, 2021

Ah cool, i did spot that issue but I wasn't sure if it was relevant or not! Is there anything I can do to fix it until it's sorted on your side? @0x2b3bfa0

@0x2b3bfa0
Copy link
Member

@thatGreekGuy96, please run the following to confirm the issue:

      - name: "Deploy runner on EC2"
        shell: bash
        env:
          repo_token: ${{ secrets.ACCESS_TOKEN_CML_TESTING }}
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID_TESTING }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY_TESTING }}
          CASE_NAME: ${{ matrix.case_name }}
          N_RUNNERS: ${{ fromJson(needs.setup_config.outputs.json_string).n_runners }}
          NEPTUNE_CUSTOM_RUN_ID: ${{ needs.setup_neptune_custom_run_id.outputs.neptune_custom_run_id }}
        run: |
          for (( i=1; i<=N_RUNNERS; i++ ))
          do
            echo "Deploying runner ${i}"
            RUNNER_NAME= cml-runner \
            --cloud aws \
            --cloud-region eu-west-2 \
            --cloud-type=m \
            --cloud-hdd-size 100 \
            --cloud-spot \
            --labels=cml-runner-${NEPTUNE_CUSTOM_RUN_ID} || exit 1 &
          done
          wait
          echo "Deployed ${N_RUNNERS} runners."
      - run: >-
          cat "$TF_LOG_PATH"

@0x2b3bfa0
Copy link
Member

Is there anything I can do to fix it until it's sorted on your side?

Yes, please try the workaround from #741 (comment) in the meanwhile.

@thatGreekGuy96
Copy link
Author

yup i can confirm this fixes it! Do you mind explaining why 😅 i'm just curious!

@0x2b3bfa0 0x2b3bfa0 added cloud-aws Amazon Web Services cml-runner Subcommand labels Sep 23, 2021
@0x2b3bfa0
Copy link
Member

Yes, people at GitHub have started to use the RUNNER_NAME environment variable for their own purposes, overriding the default runner name with some creative strings that include spaces, like Hosted Agent or GitHub Actions 3 😑

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cloud-aws Amazon Web Services cml-runner Subcommand duplicate Déjà lu p0-critical Max priority (ASAP)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants