-
Notifications
You must be signed in to change notification settings - Fork 629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Matrix jobs not being picked up #3607
Comments
Hey, just to let you know we have just run into the exact same problem. A "random" occurence, because our matrix job had only 2 cases, and the first one took slightly longer than usual so the second runner got scaled down. This is quite a edge case. We have thought about a few possible solutions:
As we are about to deploy using multi-runners, so as to offer configurations matching the need of our teams, we will probably allow one to have re-usable runners, and this should pick-up the waiting matrix job immediatly. In your case, if there is some set-up involved in the runner, that is identical between jobs, it could even speed-up things. In the end, our reason for having a single matrix job running is an AWS rate limit on an API, but that could be circonvened now that the job is running withing AWS, so that is an other way we are looking at. Probably not relevant to your case though. |
Hi, I guess a workaround for this could be setting of |
FYI, we created a issue with GitHub support a while back on the inconsistency of the webhook events, i.e., that
|
Any suggestions on how to proceed with this? We have 200+ organizations on our GitHub instance, so merely setting up persistent runner groups would increase our costs significantly. Would a possibility to mimic the behavior of GitHub ARC (https://github.com/actions/actions-runner-controller) be a possibility? Here they use the webhooks as indicators for adjusting the horizontal pod autoscaling for an organization. I know it would be a considerable feature, but could potentially make it more reliant. |
Have you tried this? Seems it is working for us with ephemeral runners. Also we had to configure
|
We'll give that a go. Thanks ❤️ |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed if no further activity occurs. Thank you for your contributions. |
Did you have any positive feedback about this change? We seem to have a similar problem in GHES 3.11 Why do we need to set enable_job_queued_check = false? |
In the end we set |
We believe this issue is actually caused by GitHub other than how this module works. But I wonder if someone else faced this problem and if there's any good workaround.
Quick specs of our setup:
v5.3.0
ephemeral
runnersGHES 3.8.11
(Let me know if additional info is necessary)
Replicate the issue
The issue can be replicated with a GHA like this:
Observations
What we have observed is that GitHub will send the
queue
events all at once (see screenshot below), but becausemax-parallel
won't allow all jobs to run at once. So jobs will run in sequence.But because all
queue
events are sent in one go, the EC2 instances will be booted up immediately (even tho they can't pick up the jobs). After a few minutes of all 10 instances being booted, they will be terminated for being stale. This means that when the jobs are eventually ready to be executed, there are no EC2 runners available.In the end, the matrix looks something like this:
Timeline representation
Similar action that doesn't have the same problem
This GHA is similar, but doesn't have the same issue, as GH will only send the
queue
event when each job is meant to actually be executed (not all at once).We have opened a ticket with GitHub asking some clarification if the way the webhook calls are handled is intended. But I'm wondering if someone else has faced this issue and if there is any strategy in this module to handled this sort of problem.
Thanks!
The text was updated successfully, but these errors were encountered: