Skip to content

Commit

Permalink
feat: scale up for long waiting jobs (job retry) (#4064)
Browse files Browse the repository at this point in the history
## Description

This feature add the capability to retry scaling a runner when a job is
still queued after a defined delay. This feature is added to avoid pool
for ephemeral runners.

## Implementation

The module is extended with configuration top optional enable one or
more retries. Once enabled the scale-up lambda will publish the same
message as it recieves extend with a counter on a retry-job-queueu with
a delay. A new lambda will pick the message from this queue and checks
if the job is still queued (via GitHub API). In case it is still queued
it is published again on je the job queue, incoming queue of the
scale-up lambda

## Consequences

- This feature is meant for small fleets with ephemeral runners. Each
retry check is casuing a GitHub API which can trigger a rate limit for
the app.
- This feature should make ephemerla runners more resposnive without
having a pool to pick up missed jobs.
- The module allows you to force a job check before scaling, this check
should be disabled.
- The delay should be set to a time that is higher than the normal
boottime of a runner.

## Testing

Testing can be done as follow
- Trigger a workflow
- Terminate the created instance before the job starts
- Wait, after the delay the retry job should publish the message again
which triggers a new instance creation.

- [x] Multi runners.
- [x] Default runners, not enabled requires configuraton update

## Tasks

- [x] Update docs
- [x] Update multi-runner
- [x] Check CMK keys for SQS
- [x] Limit delay to max delay of a queue.
- [x] Add optional metric for retry
- [x] Update issue with more details

---------

Co-authored-by: forest-pr|bot <forest-pr[bot]@users.noreply.github.com>
Co-authored-by: Brend Smits <[email protected]>
  • Loading branch information
3 people authored Aug 16, 2024
1 parent 9086a29 commit 6120571
Show file tree
Hide file tree
Showing 53 changed files with 5,742 additions and 2,055 deletions.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,7 @@ Talk to the forestkeepers in the `runners-channel` on Slack.
| <a name="input_enable_jit_config"></a> [enable\_jit\_config](#input\_enable\_jit\_config) | Overwrite the default behavior for JIT configuration. By default JIT configuration is enabled for ephemeral runners and disabled for non-ephemeral runners. In case of GHES check first if the JIT config API is avaialbe. In case you upgradeing from 3.x to 4.x you can set `enable_jit_config` to `false` to avoid a breaking change when having your own AMI. | `bool` | `null` | no |
| <a name="input_enable_job_queued_check"></a> [enable\_job\_queued\_check](#input\_enable\_job\_queued\_check) | Only scale if the job event received by the scale up lambda is in the queued state. By default enabled for non ephemeral runners and disabled for ephemeral. Set this variable to overwrite the default behavior. | `bool` | `null` | no |
| <a name="input_enable_managed_runner_security_group"></a> [enable\_managed\_runner\_security\_group](#input\_enable\_managed\_runner\_security\_group) | Enables creation of the default managed security group. Unmanaged security groups can be specified via `runner_additional_security_group_ids`. | `bool` | `true` | no |
| <a name="input_enable_metrics_control_plane"></a> [enable\_metrics\_control\_plane](#input\_enable\_metrics\_control\_plane) | (Experimental) Enable or disable the metrics for the module. Feature can change or renamed without a major release. | `bool` | `false` | no |
| <a name="input_enable_organization_runners"></a> [enable\_organization\_runners](#input\_enable\_organization\_runners) | Register runners to organization, instead of repo level | `bool` | `false` | no |
| <a name="input_enable_runner_binaries_syncer"></a> [enable\_runner\_binaries\_syncer](#input\_enable\_runner\_binaries\_syncer) | Option to disable the lambda to sync GitHub runner distribution, useful when using a pre-build AMI. | `bool` | `true` | no |
| <a name="input_enable_runner_detailed_monitoring"></a> [enable\_runner\_detailed\_monitoring](#input\_enable\_runner\_detailed\_monitoring) | Should detailed monitoring be enabled for the runner. Set this to true if you want to use detailed monitoring. See https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-cloudwatch-new.html for details. | `bool` | `false` | no |
Expand All @@ -167,6 +168,7 @@ Talk to the forestkeepers in the `runners-channel` on Slack.
| <a name="input_instance_termination_watcher"></a> [instance\_termination\_watcher](#input\_instance\_termination\_watcher) | Configuration for the instance termination watcher. This feature is Beta, changes will not trigger a major release as long in beta.<br><br>`enable`: Enable or disable the spot termination watcher.<br>`enable_metrics`: Enable or disable the metrics for the spot termination watcher.<br>`memory_size`: Memory size linit in MB of the lambda.<br>`s3_key`: S3 key for syncer lambda function. Required if using S3 bucket to specify lambdas.<br>`s3_object_version`: S3 object version for syncer lambda function. Useful if S3 versioning is enabled on source bucket.<br>`timeout`: Time out of the lambda in seconds.<br>`zip`: File location of the lambda zip file. | <pre>object({<br> enable = optional(bool, false)<br> enable_metric = optional(object({<br> spot_warning = optional(bool, false)<br> }))<br> memory_size = optional(number, null)<br> s3_key = optional(string, null)<br> s3_object_version = optional(string, null)<br> timeout = optional(number, null)<br> zip = optional(string, null)<br> })</pre> | `{}` | no |
| <a name="input_instance_types"></a> [instance\_types](#input\_instance\_types) | List of instance types for the action runner. Defaults are based on runner\_os (al2023 for linux and Windows Server Core for win). | `list(string)` | <pre>[<br> "m5.large",<br> "c5.large"<br>]</pre> | no |
| <a name="input_job_queue_retention_in_seconds"></a> [job\_queue\_retention\_in\_seconds](#input\_job\_queue\_retention\_in\_seconds) | The number of seconds the job is held in the queue before it is purged. | `number` | `86400` | no |
| <a name="input_job_retry"></a> [job\_retry](#input\_job\_retry) | Experimental! Can be removed / changed without trigger a major release.Configure job retries. The configuration enables job retries (for ephemeral runners). After creating the insances a message will be published to a job retry queue. The job retry check lambda is checking after a delay if the job is queued. If not the message will be published again on the scale-up (build queue). Using this feature can impact the reate limit of the GitHub app.<br><br>`enable`: Enable or disable the job retry feature.<br>`delay_in_seconds`: The delay in seconds before the job retry check lambda will check the job status.<br>`delay_backoff`: The backoff factor for the delay.<br>`lambda_memory_size`: Memory size limit in MB for the job retry check lambda.<br>`lambda_timeout`: Time out of the job retry check lambda in seconds.<br>`max_attempts`: The maximum number of attempts to retry the job. | <pre>object({<br> enable = optional(bool, false)<br> delay_in_seconds = optional(number, 300)<br> delay_backoff = optional(number, 2)<br> lambda_memory_size = optional(number, 256)<br> lambda_timeout = optional(number, 30)<br> max_attempts = optional(number, 1)<br> })</pre> | `{}` | no |
| <a name="input_key_name"></a> [key\_name](#input\_key\_name) | Key pair name | `string` | `null` | no |
| <a name="input_kms_key_arn"></a> [kms\_key\_arn](#input\_kms\_key\_arn) | Optional CMK Key ARN to be used for Parameter Store. This key must be in the current account. | `string` | `null` | no |
| <a name="input_lambda_architecture"></a> [lambda\_architecture](#input\_lambda\_architecture) | AWS Lambda architecture. Lambda functions using Graviton processors ('arm64') tend to have better price/performance than 'x86\_64' functions. | `string` | `"arm64"` | no |
Expand Down
1 change: 0 additions & 1 deletion docs/architecture.drawio

This file was deleted.

3 changes: 0 additions & 3 deletions docs/architecture.svg

This file was deleted.

Binary file modified docs/assets/aws-architecture.dark.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/assets/aws-architecture.light.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 6120571

Please sign in to comment.