Matrix jobs not being picked up #3607

PerGon · 2023-11-13T12:41:31Z

We believe this issue is actually caused by GitHub other than how this module works. But I wonder if someone else faced this problem and if there's any good workaround.

Quick specs of our setup:

Running version v5.3.0
Using ephemeral runners
We use GHES 3.8.11

(Let me know if additional info is necessary)

Replicate the issue

The issue can be replicated with a GHA like this:

name: test

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

jobs:
  test:
    runs-on: [ self-hosted, test-ubuntu ]
    strategy:
      max-parallel: 1
      matrix:
        index: [0,1,2,3,4,5,6,7,8,9]
    steps:
      - run: |
          echo "Index ${{ matrix.index }}"
          echo "Sleeping 10 minutes"
          date
          sleep 600
          date

Observations

What we have observed is that GitHub will send the queue events all at once (see screenshot below), but because max-parallel won't allow all jobs to run at once. So jobs will run in sequence.
But because all queue events are sent in one go, the EC2 instances will be booted up immediately (even tho they can't pick up the jobs). After a few minutes of all 10 instances being booted, they will be terminated for being stale. This means that when the jobs are eventually ready to be executed, there are no EC2 runners available.

In the end, the matrix looks something like this:

Timeline representation

Similar action that doesn't have the same problem

This GHA is similar, but doesn't have the same issue, as GH will only send the queue event when each job is meant to actually be executed (not all at once).

name: test

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

jobs:
  test01:
    runs-on: [ self-hosted, test-ubuntu ]
    steps:
      - run: |
          sleep 600
  test02:
    runs-on: [ self-hosted, test-ubuntu ]
    needs: [test01]
    steps:
      - run: |
          sleep 600
  test03:
    runs-on: [ self-hosted, test-ubuntu ]
    needs: [test02]
    steps:
      - run: |
          sleep 600
  test04:
    runs-on: [ self-hosted, test-ubuntu ]
    needs: [test03]
    steps:
      - run: |
          sleep 600
  test05:
    runs-on: [ self-hosted, test-ubuntu ]
    needs: [test04]
    steps:
      - run: |
          sleep 600

We have opened a ticket with GitHub asking some clarification if the way the webhook calls are handled is intended. But I'm wondering if someone else has faced this issue and if there is any strategy in this module to handled this sort of problem.

Thanks!

The text was updated successfully, but these errors were encountered:

toindev · 2023-12-12T09:56:32Z

Hey, just to let you know we have just run into the exact same problem. A "random" occurence, because our matrix job had only 2 cases, and the first one took slightly longer than usual so the second runner got scaled down.

This is quite a edge case. We have thought about a few possible solutions:

keep a runner idle all the time: not great yet for us, but we are scaling up our usage of runners on AWS, so this could become relevant
rework the workflow as jobs instead of a matrix
allow re-use of the runner

As we are about to deploy using multi-runners, so as to offer configurations matching the need of our teams, we will probably allow one to have re-usable runners, and this should pick-up the waiting matrix job immediatly. In your case, if there is some set-up involved in the runner, that is identical between jobs, it could even speed-up things.

In the end, our reason for having a single matrix job running is an AWS rate limit on an API, but that could be circonvened now that the job is running withing AWS, so that is an other way we are looking at. Probably not relevant to your case though.

jizi · 2024-01-05T07:47:34Z

Hi, I guess a workaround for this could be setting of minimum_running_time_in_minutes to a number higher than duration of the workflow. This should prevent scale down of the runner VMs.
It is of course not optimal if the individual jobs lasts significant amount of time but without changes on the GitHub side I am not sure we can do better.

rasmus · 2024-01-05T09:13:51Z

FYI, we created a issue with GitHub support a while back on the inconsistency of the webhook events, i.e., that queued events are sent even though that jobs aren't really "in-queue" and ready to be processed.

Unfortunately, due to the way matrixing is done, I do not see another workaround for the process you are using. If immediacy is not a necessary requirement you could put in a set delay timer as the jobs in the queue will wait for a runner.

Another option would be to break out the matrix into individual jobs. And then set a requires for each job to keep them running in order and only one at a time.

rasmus · 2024-01-15T06:53:54Z

Any suggestions on how to proceed with this? We have 200+ organizations on our GitHub instance, so merely setting up persistent runner groups would increase our costs significantly.

Would a possibility to mimic the behavior of GitHub ARC (https://github.com/actions/actions-runner-controller) be a possibility? Here they use the webhooks as indicators for adjusting the horizontal pod autoscaling for an organization. I know it would be a considerable feature, but could potentially make it more reliant.

jizi · 2024-01-15T09:27:41Z

Hi, I guess a workaround for this could be setting of minimum_running_time_in_minutes to a number higher than duration of the workflow.

Have you tried this? Seems it is working for us with ephemeral runners. Also we had to configure enable_job_queued_check = false and

      redrive_build_queue = {
        enabled         = true
        maxReceiveCount = 100
      }

rasmus · 2024-01-19T07:57:01Z

We'll give that a go. Thanks ❤️

github-actions · 2024-02-19T01:46:56Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed if no further activity occurs. Thank you for your contributions.

mariusfilipowski · 2024-09-10T08:18:40Z

We'll give that a go. Thanks ❤️

Did you have any positive feedback about this change? We seem to have a similar problem in GHES 3.11

Why do we need to set enable_job_queued_check = false?

jizi · 2024-09-11T06:18:42Z

In the end we set enable_job_queued_check back to true because it has some other undesired consequences. We now use the ephemeral runners in combination with the pool and this combination resolves this issue pretty well.
Also this feature could be interesting if you try to avoid using the pool.

github-actions bot added the Stale label Feb 19, 2024

github-actions bot added the abandoned label Mar 1, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matrix jobs not being picked up #3607

Matrix jobs not being picked up #3607

PerGon commented Nov 13, 2023

toindev commented Dec 12, 2023

jizi commented Jan 5, 2024

rasmus commented Jan 5, 2024

rasmus commented Jan 15, 2024

jizi commented Jan 15, 2024

rasmus commented Jan 19, 2024

github-actions bot commented Feb 19, 2024

mariusfilipowski commented Sep 10, 2024

jizi commented Sep 11, 2024

Matrix jobs not being picked up #3607

Matrix jobs not being picked up #3607

Comments

PerGon commented Nov 13, 2023

Replicate the issue

Observations

Timeline representation

Similar action that doesn't have the same problem

toindev commented Dec 12, 2023

jizi commented Jan 5, 2024

rasmus commented Jan 5, 2024

rasmus commented Jan 15, 2024

jizi commented Jan 15, 2024

rasmus commented Jan 19, 2024

github-actions bot commented Feb 19, 2024

mariusfilipowski commented Sep 10, 2024

jizi commented Sep 11, 2024