Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to limit runner to a particular run or job #620

Open
simonbyrne opened this issue Jul 29, 2020 · 8 comments
Open

Option to limit runner to a particular run or job #620

simonbyrne opened this issue Jul 29, 2020 · 8 comments
Labels
enhancement New feature or request

Comments

@simonbyrne
Copy link

simonbyrne commented Jul 29, 2020

Describe the enhancement

In addition to supporting the --once option (#510), it would be useful if a runner could be limited to a specific run or job, exiting once this job is completed (or cancelled or not otherwise queued).

A typical use case would be for autoscaling: either via a webhook or polling the workflow runs API, i would start up the necessary number of workers for the jobs/runs.

Code Snippet

./run.sh --job=<job_id> # run job matching job_id, or exit if complete/cancelled

Additional information
Similar to the Buildkite --acquire-job option.

@simonbyrne simonbyrne added the enhancement New feature or request label Jul 29, 2020
@nikola-jokic nikola-jokic added the Runner Feature Feature scope to the runner label Mar 4, 2022
@jgoux
Copy link

jgoux commented Mar 8, 2022

It would be invaluable for us as well. Right now we have to maintain a "garbage collector" lambda which kills runners older than X hours if there are zombies runners.

@kmaehashi
Copy link

Cross-linking the discussion: community/community#19784

@alexellis
Copy link

I didn't find this issue when searching, probably due to the language used in this issue.

Please see also: Tie a specific workflow job to a specific ephemeral runner via labels for a more detailed explanation of the problem this solves.

@arianvp
Copy link

arianvp commented Dec 9, 2023

Without this it's basically impossible to use jit_config without getting a lot of zombie instances when workflows get cancelled. I don't see how you can reliably use Just-in-time runners without this feature. Please. this would be great to have!

Edit: another solution would be: unconditionally terminate the instance when you get a completed workflow_job webhook. That will clean up cancelled runs as well

@bduffany
Copy link

bduffany commented Feb 21, 2024

Without this it's basically impossible to use jit_config without getting a lot of zombie instances when workflows get cancelled

I agree with this, and would like to mention the experience that I've had so far trying to work with the JIT runner feature.

My first approach was:

  • Listen for workflow_job queued events
  • On each event, create a jit config and spawn a runner with the config.

In this approach, instead of matching up job IDs to runner IDs (which is currently not supported), we spawn N jobs for N runners and any runner can take any job. So the total count of queued events that we receive should match the total count of runners that we spawn.

But this approach is unreliable because:

  • If a runner crashes, we are now permanently short one runner and there will always be a job stuck in "queued" state until the next job comes in.
  • If a job is cancelled before a runner claims it, we will always have an extra runner that sits around for a very long time, consuming valuable compute resources.
  • Both of these above issues compound over time, either resulting in many stuck jobs or many "zombie" runners as another commenter calls them :)

I thought of a solution for these issues, but it is prohibitively complicated:

  • Have a worker that periodically monitors the number of runners running vs the number of jobs queued. If the number of runners is less than the number of available jobs, spawn a new runner.
    • This monitoring is not straightforward to implement; we need to do it in such a way to avoid consuming excessive GitHub API quota, while also being immediately responsive to webhook events.
  • When starting the runner, also spawn a separate process which monitors the runner and does the following: if the runner does not claim a job after a certain (short) timeout, e.g. 5 minutes, then kill it.
    • This comment suggests that we can do this by just always sending SIGINT to the runner process after a certain timeframe, but (edit: this is wrong, see Option to limit runner to a particular run or job #620 (comment)).
    • This approach means the runner does not get automatically unregistered (until after 30 days or something?), which partly defeats the purpose of the "JIT" runner approach.

I wonder if the GitHub team had a different use case in mind when designing jit runners? The current jit design has led me to a pretty complicated design in order to make it work reliably for my use case, and I wonder if I am missing something obvious in how jit runners are supposed to be used.

I think the --job flag would solve most (all?) of these issues:

  • If a runner has crashed or did not get started for some reason, when we can simply restart the runner with a new jitconfig and the same --job flag (my mental model might be off here - I'm not exactly sure what happens if a runner crashes mid-run - does this permanently fail the job? But at least this would handle the case of the runner crashing before it accepts the job?)
    • We can implement this with a simple table of "job" entries, where we insert an entry when we get a queued event. When a runner completes, we can mark the associated job complete in the table as well. No need to poll the github API and consume API quota.
  • If the job does not exist by the time the runner starts (e.g. if it's cancelled) then the runner can immediately exit, resulting in minimal wasted compute resources.

@ChristopherHX
Copy link
Contributor

I also agree we need Service support, never built a full autoscaler myself.

ChristopherHX/github-act-runner#60 (comment) suggests that we can do this by just always sending SIGINT to the runner process after a certain timeframe, but it has the downside that the runner does not get automatically unregistered, which largely defeats the purpose of the "JIT" runner approach.

This doesn't apply to actions/runner, that's an implementation detail of the runner I have written from scratch.

There are (internal) actions apis that are not covered by any rate limit I'm aware of.

@alexellis
Copy link

I landed back here again. Does the GitHub runner team have an update? Would this be something that you would consider?

@hanwen-flow
Copy link

Is someone from github still reading this?

I am also interested in this feature request. In addition to the reasons mentioned by the previous posters, even within a set of uniform runners, there can be per-runner cache affinity: in build systems, the snapshot of post-submit successful build provides a great cache seed for an incremental pre-submit test run targeting the same branch. Exploiting this requires knowing something about the job (eg. the branch it is targeting) so we can setup the right snapshot before kicking off the agent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

9 participants