Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matrix jobs not being picked up #3607

Closed
PerGon opened this issue Nov 13, 2023 · 9 comments
Closed

Matrix jobs not being picked up #3607

PerGon opened this issue Nov 13, 2023 · 9 comments

Comments

@PerGon
Copy link
Contributor

PerGon commented Nov 13, 2023

We believe this issue is actually caused by GitHub other than how this module works. But I wonder if someone else faced this problem and if there's any good workaround.

Quick specs of our setup:

  • Running version v5.3.0
  • Using ephemeral runners
  • We use GHES 3.8.11

(Let me know if additional info is necessary)

Replicate the issue

The issue can be replicated with a GHA like this:

name: test

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

jobs:
  test:
    runs-on: [ self-hosted, test-ubuntu ]
    strategy:
      max-parallel: 1
      matrix:
        index: [0,1,2,3,4,5,6,7,8,9]
    steps:
      - run: |
          echo "Index ${{ matrix.index }}"
          echo "Sleeping 10 minutes"
          date
          sleep 600
          date

Observations

What we have observed is that GitHub will send the queue events all at once (see screenshot below), but because max-parallel won't allow all jobs to run at once. So jobs will run in sequence.
But because all queue events are sent in one go, the EC2 instances will be booted up immediately (even tho they can't pick up the jobs). After a few minutes of all 10 instances being booted, they will be terminated for being stale. This means that when the jobs are eventually ready to be executed, there are no EC2 runners available.
image

In the end, the matrix looks something like this:

image

Timeline representation

image

Similar action that doesn't have the same problem

This GHA is similar, but doesn't have the same issue, as GH will only send the queue event when each job is meant to actually be executed (not all at once).

name: test

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

jobs:
  test01:
    runs-on: [ self-hosted, test-ubuntu ]
    steps:
      - run: |
          sleep 600
  test02:
    runs-on: [ self-hosted, test-ubuntu ]
    needs: [test01]
    steps:
      - run: |
          sleep 600
  test03:
    runs-on: [ self-hosted, test-ubuntu ]
    needs: [test02]
    steps:
      - run: |
          sleep 600
  test04:
    runs-on: [ self-hosted, test-ubuntu ]
    needs: [test03]
    steps:
      - run: |
          sleep 600
  test05:
    runs-on: [ self-hosted, test-ubuntu ]
    needs: [test04]
    steps:
      - run: |
          sleep 600

We have opened a ticket with GitHub asking some clarification if the way the webhook calls are handled is intended. But I'm wondering if someone else has faced this issue and if there is any strategy in this module to handled this sort of problem.

Thanks!

@toindev
Copy link

toindev commented Dec 12, 2023

Hey, just to let you know we have just run into the exact same problem. A "random" occurence, because our matrix job had only 2 cases, and the first one took slightly longer than usual so the second runner got scaled down.

This is quite a edge case. We have thought about a few possible solutions:

  • keep a runner idle all the time: not great yet for us, but we are scaling up our usage of runners on AWS, so this could become relevant
  • rework the workflow as jobs instead of a matrix
  • allow re-use of the runner

As we are about to deploy using multi-runners, so as to offer configurations matching the need of our teams, we will probably allow one to have re-usable runners, and this should pick-up the waiting matrix job immediatly. In your case, if there is some set-up involved in the runner, that is identical between jobs, it could even speed-up things.

In the end, our reason for having a single matrix job running is an AWS rate limit on an API, but that could be circonvened now that the job is running withing AWS, so that is an other way we are looking at. Probably not relevant to your case though.

@jizi
Copy link
Contributor

jizi commented Jan 5, 2024

Hi, I guess a workaround for this could be setting of minimum_running_time_in_minutes to a number higher than duration of the workflow. This should prevent scale down of the runner VMs.
It is of course not optimal if the individual jobs lasts significant amount of time but without changes on the GitHub side I am not sure we can do better.

@rasmus
Copy link
Contributor

rasmus commented Jan 5, 2024

FYI, we created a issue with GitHub support a while back on the inconsistency of the webhook events, i.e., that queued events are sent even though that jobs aren't really "in-queue" and ready to be processed.

Unfortunately, due to the way matrixing is done, I do not see another workaround for the process you are using. If immediacy is not a necessary requirement you could put in a set delay timer as the jobs in the queue will wait for a runner.

Another option would be to break out the matrix into individual jobs. And then set a requires for each job to keep them running in order and only one at a time.

@rasmus
Copy link
Contributor

rasmus commented Jan 15, 2024

Any suggestions on how to proceed with this? We have 200+ organizations on our GitHub instance, so merely setting up persistent runner groups would increase our costs significantly.

Would a possibility to mimic the behavior of GitHub ARC (https://github.com/actions/actions-runner-controller) be a possibility? Here they use the webhooks as indicators for adjusting the horizontal pod autoscaling for an organization. I know it would be a considerable feature, but could potentially make it more reliant.

@jizi
Copy link
Contributor

jizi commented Jan 15, 2024

Hi, I guess a workaround for this could be setting of minimum_running_time_in_minutes to a number higher than duration of the workflow.

Have you tried this? Seems it is working for us with ephemeral runners. Also we had to configure enable_job_queued_check = false and

      redrive_build_queue = {
        enabled         = true
        maxReceiveCount = 100
      }

@rasmus
Copy link
Contributor

rasmus commented Jan 19, 2024

We'll give that a go. Thanks ❤️

Copy link
Contributor

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the Stale label Feb 19, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 1, 2024
@mariusfilipowski
Copy link
Contributor

We'll give that a go. Thanks ❤️

Did you have any positive feedback about this change? We seem to have a similar problem in GHES 3.11

Why do we need to set enable_job_queued_check = false?

@jizi
Copy link
Contributor

jizi commented Sep 11, 2024

In the end we set enable_job_queued_check back to true because it has some other undesired consequences. We now use the ephemeral runners in combination with the pool and this combination resolves this issue pretty well.
Also this feature could be interesting if you try to avoid using the pool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants