Option to limit runner to a particular run or job #620

simonbyrne · 2020-07-29T04:48:23Z

Describe the enhancement

In addition to supporting the --once option (#510), it would be useful if a runner could be limited to a specific run or job, exiting once this job is completed (or cancelled or not otherwise queued).

A typical use case would be for autoscaling: either via a webhook or polling the workflow runs API, i would start up the necessary number of workers for the jobs/runs.

Code Snippet

./run.sh --job=<job_id> # run job matching job_id, or exit if complete/cancelled

Additional information
Similar to the Buildkite --acquire-job option.

The text was updated successfully, but these errors were encountered:

jgoux · 2022-03-08T08:19:04Z

It would be invaluable for us as well. Right now we have to maintain a "garbage collector" lambda which kills runners older than X hours if there are zombies runners.

kmaehashi · 2022-10-03T12:57:53Z

Cross-linking the discussion: community/community#19784

alexellis · 2022-10-03T15:05:56Z

I didn't find this issue when searching, probably due to the language used in this issue.

Please see also: Tie a specific workflow job to a specific ephemeral runner via labels for a more detailed explanation of the problem this solves.

arianvp · 2023-12-09T10:23:06Z

Without this it's basically impossible to use jit_config without getting a lot of zombie instances when workflows get cancelled. I don't see how you can reliably use Just-in-time runners without this feature. Please. this would be great to have!

Edit: another solution would be: unconditionally terminate the instance when you get a completed workflow_job webhook. That will clean up cancelled runs as well

bduffany · 2024-02-21T14:53:51Z

Without this it's basically impossible to use jit_config without getting a lot of zombie instances when workflows get cancelled

I agree with this, and would like to mention the experience that I've had so far trying to work with the JIT runner feature.

My first approach was:

Listen for workflow_job queued events
On each event, create a jit config and spawn a runner with the config.

In this approach, instead of matching up job IDs to runner IDs (which is currently not supported), we spawn N jobs for N runners and any runner can take any job. So the total count of queued events that we receive should match the total count of runners that we spawn.

But this approach is unreliable because:

If a runner crashes, we are now permanently short one runner and there will always be a job stuck in "queued" state until the next job comes in.
If a job is cancelled before a runner claims it, we will always have an extra runner that sits around for a very long time, consuming valuable compute resources.
Both of these above issues compound over time, either resulting in many stuck jobs or many "zombie" runners as another commenter calls them :)

I thought of a solution for these issues, but it is prohibitively complicated:

Have a worker that periodically monitors the number of runners running vs the number of jobs queued. If the number of runners is less than the number of available jobs, spawn a new runner.
- This monitoring is not straightforward to implement; we need to do it in such a way to avoid consuming excessive GitHub API quota, while also being immediately responsive to webhook events.
When starting the runner, also spawn a separate process which monitors the runner and does the following: if the runner does not claim a job after a certain (short) timeout, e.g. 5 minutes, then kill it.
- ~~This comment suggests that we can do this by just always sending SIGINT to the runner process after a certain timeframe, but~~ (edit: this is wrong, see Option to limit runner to a particular run or job #620 (comment)).
- This approach means the runner does not get automatically unregistered (until after 30 days or something?), which partly defeats the purpose of the "JIT" runner approach.

I wonder if the GitHub team had a different use case in mind when designing jit runners? The current jit design has led me to a pretty complicated design in order to make it work reliably for my use case, and I wonder if I am missing something obvious in how jit runners are supposed to be used.

I think the --job flag would solve most (all?) of these issues:

If a runner has crashed or did not get started for some reason, when we can simply restart the runner with a new jitconfig and the same --job flag (my mental model might be off here - I'm not exactly sure what happens if a runner crashes mid-run - does this permanently fail the job? But at least this would handle the case of the runner crashing before it accepts the job?)
- We can implement this with a simple table of "job" entries, where we insert an entry when we get a queued event. When a runner completes, we can mark the associated job complete in the table as well. No need to poll the github API and consume API quota.
If the job does not exist by the time the runner starts (e.g. if it's cancelled) then the runner can immediately exit, resulting in minimal wasted compute resources.

ChristopherHX · 2024-02-21T23:12:45Z

I also agree we need Service support, never built a full autoscaler myself.

ChristopherHX/github-act-runner#60 (comment) suggests that we can do this by just always sending SIGINT to the runner process after a certain timeframe, but it has the downside that the runner does not get automatically unregistered, which largely defeats the purpose of the "JIT" runner approach.

This doesn't apply to actions/runner, that's an implementation detail of the runner I have written from scratch.

There are (internal) actions apis that are not covered by any rate limit I'm aware of.

the official scaleset has a go client for it https://github.com/actions/actions-runner-controller/blob/9fba37540a5dab4bb20ce2d931fa298ad5f132d4/github/actions/client.go#L46
- Seem to be also able to generate jit token, with much less costs of rate per token
- Your scaler might get to know more about what is scheduled on the service (never tried)
My own lib has a go client that can manage a lot of runner registrations while using less than one rate limit per registration/removal https://github.com/ChristopherHX/github-act-runner/tree/main/runnerconfiguration -
- https://github.com/christopherHX/megascaler for how to cache the token to avoid rate limit
- can generate like 1000 of jit tokens (for the official actions/runner) with the same runner registration token and consumes 1 rate limit token

alexellis · 2024-05-08T16:54:34Z

I landed back here again. Does the GitHub runner team have an update? Would this be something that you would consider?

hanwen-flow · 2024-12-18T14:17:52Z

Is someone from github still reading this?

I am also interested in this feature request. In addition to the reasons mentioned by the previous posters, even within a set of uniform runners, there can be per-runner cache affinity: in build systems, the snapshot of post-submit successful build provides a great cache seed for an incremental pre-submit test run targeting the same branch. Exploiting this requires knowing something about the job (eg. the branch it is targeting) so we can setup the right snapshot before kicking off the agent.

simonbyrne added the enhancement New feature or request label Jul 29, 2020

simonbyrne mentioned this issue Nov 2, 2020

Evaluate Travis or move CI away CEED/libCEED#652

Closed

nikola-jokic added the Runner Feature Feature scope to the runner label Mar 4, 2022

nwf mentioned this issue Sep 20, 2022

Allow to run a workflow by id on ephemeral runners #2106

Closed

nikola-jokic mentioned this issue Oct 3, 2022

Tie a specific workflow job to a specific ephemeral runner via labels #2147

Closed

ChristopherHX mentioned this issue Oct 5, 2022

Optionally bind ephemeral runner to particular GitHub job? ChristopherHX/github-act-runner#59

Open

nikola-jokic removed the Runner Feature Feature scope to the runner label Dec 23, 2022

bduffany mentioned this issue Feb 22, 2024

Add initial GitHub Actions runner support buildbuddy-io/buildbuddy#5970

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to limit runner to a particular run or job #620

Option to limit runner to a particular run or job #620

simonbyrne commented Jul 29, 2020 •

edited

Loading

jgoux commented Mar 8, 2022

kmaehashi commented Oct 3, 2022

alexellis commented Oct 3, 2022

arianvp commented Dec 9, 2023 •

edited

Loading

bduffany commented Feb 21, 2024 •

edited

Loading

ChristopherHX commented Feb 21, 2024

alexellis commented May 8, 2024

hanwen-flow commented Dec 18, 2024

Option to limit runner to a particular run or job #620

Option to limit runner to a particular run or job #620

Comments

simonbyrne commented Jul 29, 2020 • edited Loading

jgoux commented Mar 8, 2022

kmaehashi commented Oct 3, 2022

alexellis commented Oct 3, 2022

arianvp commented Dec 9, 2023 • edited Loading

bduffany commented Feb 21, 2024 • edited Loading

ChristopherHX commented Feb 21, 2024

alexellis commented May 8, 2024

hanwen-flow commented Dec 18, 2024

simonbyrne commented Jul 29, 2020 •

edited

Loading

arianvp commented Dec 9, 2023 •

edited

Loading

bduffany commented Feb 21, 2024 •

edited

Loading