Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: scale up for long waiting jobs (job retry) (#4064)
## Description This feature add the capability to retry scaling a runner when a job is still queued after a defined delay. This feature is added to avoid pool for ephemeral runners. ## Implementation The module is extended with configuration top optional enable one or more retries. Once enabled the scale-up lambda will publish the same message as it recieves extend with a counter on a retry-job-queueu with a delay. A new lambda will pick the message from this queue and checks if the job is still queued (via GitHub API). In case it is still queued it is published again on je the job queue, incoming queue of the scale-up lambda ## Consequences - This feature is meant for small fleets with ephemeral runners. Each retry check is casuing a GitHub API which can trigger a rate limit for the app. - This feature should make ephemerla runners more resposnive without having a pool to pick up missed jobs. - The module allows you to force a job check before scaling, this check should be disabled. - The delay should be set to a time that is higher than the normal boottime of a runner. ## Testing Testing can be done as follow - Trigger a workflow - Terminate the created instance before the job starts - Wait, after the delay the retry job should publish the message again which triggers a new instance creation. - [x] Multi runners. - [x] Default runners, not enabled requires configuraton update ## Tasks - [x] Update docs - [x] Update multi-runner - [x] Check CMK keys for SQS - [x] Limit delay to max delay of a queue. - [x] Add optional metric for retry - [x] Update issue with more details --------- Co-authored-by: forest-pr|bot <forest-pr[bot]@users.noreply.github.com> Co-authored-by: Brend Smits <[email protected]>
- Loading branch information