Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: scale up for long waiting jobs (job retry) #4064

Merged
merged 26 commits into from
Aug 16, 2024
Merged

Conversation

npalm
Copy link
Collaborator

@npalm npalm commented Aug 13, 2024

Description

This feature add the capability to retry scaling a runner when a job is still queued after a defined delay. This feature is added to avoid pool for ephemeral runners.

Implementation

The module is extended with configuration top optional enable one or more retries. Once enabled the scale-up lambda will publish the same message as it recieves extend with a counter on a retry-job-queueu with a delay. A new lambda will pick the message from this queue and checks if the job is still queued (via GitHub API). In case it is still queued it is published again on je the job queue, incoming queue of the scale-up lambda

Consequences

  • This feature is meant for small fleets with ephemeral runners. Each retry check is casuing a GitHub API which can trigger a rate limit for the app.
  • This feature should make ephemerla runners more resposnive without having a pool to pick up missed jobs.
  • The module allows you to force a job check before scaling, this check should be disabled.
  • The delay should be set to a time that is higher than the normal boottime of a runner.

Testing

Testing can be done as follow

  • Trigger a workflow

  • Terminate the created instance before the job starts

  • Wait, after the delay the retry job should publish the message again which triggers a new instance creation.

  • Multi runners.

  • Default runners, not enabled requires configuraton update

Tasks

  • Update docs
  • Update multi-runner
  • Check CMK keys for SQS
  • Limit delay to max delay of a queue.
  • Add optional metric for retry
  • Update issue with more details

@npalm npalm marked this pull request as draft August 13, 2024 15:38
@npalm npalm marked this pull request as ready for review August 15, 2024 07:36
@npalm npalm requested review from Brend-Smits and stuartp44 August 15, 2024 07:36
@npalm npalm changed the title feat: job retry feat: scale up for long waiting jobs (job retry) Aug 15, 2024
Copy link
Contributor

@Brend-Smits Brend-Smits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I have yet to test these changes, will do it after pressing this request changes button.

docs/configuration.md Outdated Show resolved Hide resolved
docs/index.md Show resolved Hide resolved
docs/index.md Show resolved Hide resolved
docs/index.md Show resolved Hide resolved
docs/index.md Show resolved Hide resolved
modules/runners/job-retry.tf Outdated Show resolved Hide resolved
modules/runners/job-retry/README.md Outdated Show resolved Hide resolved
modules/runners/job-retry/main.tf Outdated Show resolved Hide resolved
modules/runners/scale-up.tf Outdated Show resolved Hide resolved
@npalm npalm requested a review from Brend-Smits August 15, 2024 14:53
@npalm npalm force-pushed the npalm/feature/job-retry branch from 3726de8 to 161e50e Compare August 16, 2024 08:14
Copy link
Contributor

@Brend-Smits Brend-Smits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I tested it and works :)
Good job 👍🏼

@npalm npalm merged commit 6120571 into main Aug 16, 2024
44 checks passed
@npalm npalm deleted the npalm/feature/job-retry branch August 16, 2024 10:47
npalm pushed a commit that referenced this pull request Aug 16, 2024
🤖 I have created a release *beep* *boop*
---


##
[5.15.0](philips-labs/terraform-aws-github-runner@v5.14.1...v5.15.0)
(2024-08-16)


### Features

* add time zone support for pool schedules
([#4063](https://github.com/philips-labs/terraform-aws-github-runner/issues/4063))
([b8f9eb4](philips-labs/terraform-aws-github-runner@b8f9eb4))
@janslow
* scale up for long waiting jobs (job retry)
([#4064](https://github.com/philips-labs/terraform-aws-github-runner/issues/4064))
([6120571](philips-labs/terraform-aws-github-runner@6120571))


### Bug Fixes

* **lambda:** bump axios from 1.7.2 to 1.7.4 in /lambdas
([#4071](https://github.com/philips-labs/terraform-aws-github-runner/issues/4071))
([2f32195](philips-labs/terraform-aws-github-runner@2f32195))
* **lambda:** bump the aws group in /lambdas with 5 updates
([#4057](https://github.com/philips-labs/terraform-aws-github-runner/issues/4057))
([5ecdbad](philips-labs/terraform-aws-github-runner@5ecdbad))
* **lambda:** bump the aws-powertools group in /lambdas with 3 updates
([#4058](https://github.com/philips-labs/terraform-aws-github-runner/issues/4058))
([f9533f3](philips-labs/terraform-aws-github-runner@f9533f3))
* **lambda:** Prevent scale-up lambda from starting runner for user repo
if org level runners is enabled
([#3909](https://github.com/philips-labs/terraform-aws-github-runner/issues/3909))
([98b1560](philips-labs/terraform-aws-github-runner@98b1560))
@PerGon

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: forest-releaser[bot] <80285352+forest-releaser[bot]@users.noreply.github.com>
@mariusfilipowski
Copy link
Contributor

mariusfilipowski commented Sep 11, 2024

Thanks for this interesting option.

  1. What do you mean with small fleets?
  2. I assume that these rate limits won't apply to GHES, will they?
  3. Can this also be used with idle/pooled runners or do they have to be turned off? We have certain problems in the night, where we set the pool to 0, where we would expect that this approach could help us to get more stability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants