Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: mark orphan runners before removing them #4001

Merged
merged 9 commits into from
Aug 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ The "Scale Up Runner" Lambda actively monitors the SQS queue, processing incomin

The Lambda first requests a JIT configuration or registration token from GitHub, which is needed later by the runner to register itself. This avoids the case that the EC2 instance, which later in the process will install the agent, needs administration permissions to register the runner. Next, the EC2 spot instance is created via the launch template. The launch template defines the specifications of the required instance and contains a [`user_data`](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html) script. This script will install the required software and configure it. The configuration for the runner is shared via EC2 tags and the parameter store (SSM), from which the user data script will fetch it and delete it once it has been retrieved. Once the user data script is finished, the action runner should be online, and the workflow will start in seconds.

The current method for scaling down runners employs a straightforward approach: at predefined intervals, the Lambda conducts a thorough examination of each runner (instance) to assess its activity. If a runner is found to be idle, it is deregistered from GitHub, and the associated AWS instance is terminated. For ephemeral runners the the instance is terminated immediately after the workflow is finished. To avoid orphaned runners the scale down lambda is active in this cae as well.
The current method for scaling down runners employs a straightforward approach: at predefined intervals, the Lambda conducts a thorough examination of each runner (instance) to assess its activity. If a runner is found to be idle, it is deregistered from GitHub, and the associated AWS instance is terminated. For ephemeral runners the the instance is terminated immediately after the workflow is finished. Instances not registered in GitHub as a runner after a minimal boot time will be marked orphan and removed in a next cycle. To avoid orphaned runners the scale down lambda is active in this cae as well.

### Pool

Expand All @@ -68,7 +68,7 @@ The AMI cleaner is a lambda that will clean up AMIs that are older than a config

> This feature is Beta, changes will not trigger a major release as long in beta.

The Instance Termination Watcher is creating log and optional metrics for termination of instances. Currently only spot termination warnings are watched. See [configuration](configuration/) for more details.
The Instance Termination Watcher is creating log and optional metrics for termination of instances. Currently only spot termination warnings are watched. See [configuration](configuration/) for more details.

### Security

Expand Down
8 changes: 4 additions & 4 deletions lambdas/functions/control-plane/jest.config.ts
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@ const config: Config = {
...defaultConfig,
coverageThreshold: {
global: {
statements: 97.79,
branches: 96.13,
functions: 95.4,
lines: 98.06,
statements: 98.01,
branches: 97.28,
functions: 95.6,
lines: 97.94,
},
},
};
Expand Down
2 changes: 1 addition & 1 deletion lambdas/functions/control-plane/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
"test": "NODE_ENV=test nx test",
"test:watch": "NODE_ENV=test nx test --watch",
"lint": "yarn eslint src",
"watch": "ts-node-dev --respawn --exit-child --files src/local.ts",
"watch": "ts-node-dev --respawn --exit-child --files src/local-down.ts",
"build": "ncc build src/lambda.ts -o dist",
"dist": "yarn build && cd dist && zip ../runners.zip index.js",
"format": "prettier --write \"**/*.ts\"",
Expand Down
2 changes: 2 additions & 0 deletions lambdas/functions/control-plane/src/aws/runners.d.ts
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ export interface RunnerList {
type?: string;
repo?: string;
org?: string;
orphan?: boolean;
}

export interface RunnerInfo {
Expand All @@ -22,6 +23,7 @@ export interface ListRunnerFilters {
runnerType?: RunnerType;
runnerOwner?: string;
environment?: string;
orphan?: boolean;
statuses?: string[];
}

Expand Down
57 changes: 56 additions & 1 deletion lambdas/functions/control-plane/src/aws/runners.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ import {
CreateFleetCommandInput,
CreateFleetInstance,
CreateFleetResult,
CreateTagsCommand,
DefaultTargetCapacityType,
DescribeInstancesCommand,
DescribeInstancesResult,
Expand All @@ -16,7 +17,7 @@ import { mockClient } from 'aws-sdk-client-mock';
import 'aws-sdk-client-mock-jest';

import ScaleError from './../scale-runners/ScaleError';
import { createRunner, listEC2Runners, terminateRunner } from './runners';
import { createRunner, listEC2Runners, tag, terminateRunner } from './runners';
import { RunnerInfo, RunnerInputParameters, RunnerType } from './runners.d';

process.env.AWS_REGION = 'eu-east-1';
Expand Down Expand Up @@ -67,6 +68,23 @@ describe('list instances', () => {
launchTime: new Date('2020-10-10T14:48:00.000+09:00'),
type: 'Org',
owner: 'CoderToCat',
orphan: false,
});
});

it('check orphan tag.', async () => {
const instances: DescribeInstancesResult = mockRunningInstances;
instances.Reservations![0].Instances![0].Tags!.push({ Key: 'ghr:orphan', Value: 'true' });
mockEC2Client.on(DescribeInstancesCommand).resolves(instances);

const resp = await listEC2Runners();
expect(resp.length).toBe(1);
expect(resp).toContainEqual({
instanceId: instances.Reservations![0].Instances![0].InstanceId!,
launchTime: instances.Reservations![0].Instances![0].LaunchTime!,
type: 'Org',
owner: 'CoderToCat',
orphan: true,
});
});

Expand Down Expand Up @@ -114,6 +132,23 @@ describe('list instances', () => {
});
});

it('filters instances on environment and orphan', async () => {
mockRunningInstances.Reservations![0].Instances![0].Tags!.push({
Key: 'ghr:orphan',
Value: 'true',
});
mockEC2Client.on(DescribeInstancesCommand).resolves(mockRunningInstances);
await listEC2Runners({ environment: ENVIRONMENT, orphan: true });
expect(mockEC2Client).toHaveReceivedCommandWith(DescribeInstancesCommand, {
Filters: [
{ Name: 'instance-state-name', Values: ['running', 'pending'] },
{ Name: 'tag:ghr:environment', Values: [ENVIRONMENT] },
{ Name: 'tag:ghr:orphan', Values: ['true'] },
{ Name: 'tag:ghr:Application', Values: ['github-action-runner'] },
],
});
});

it('No instances, undefined reservations list.', async () => {
const noInstances: DescribeInstancesResult = {
Reservations: undefined,
Expand Down Expand Up @@ -182,6 +217,26 @@ describe('terminate runner', () => {
});
});

describe('tag runner', () => {
beforeEach(() => {
jest.clearAllMocks();
});
it('adding extra tag', async () => {
mockEC2Client.on(CreateTagsCommand).resolves({});
const runner: RunnerInfo = {
instanceId: 'instance-2',
owner: 'owner-2',
type: 'Repo',
};
await tag(runner.instanceId, [{ Key: 'ghr:orphan', Value: 'truer' }]);

expect(mockEC2Client).toHaveReceivedCommandWith(CreateTagsCommand, {
Resources: [runner.instanceId],
Tags: [{ Key: 'ghr:orphan', Value: 'truer' }],
});
});
});

describe('create runner', () => {
const defaultRunnerConfig: RunnerConfig = {
allocationStrategy: SpotAllocationStrategy.CAPACITY_OPTIMIZED,
Expand Down
16 changes: 14 additions & 2 deletions lambdas/functions/control-plane/src/aws/runners.ts
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
import {

Check notice on line 1 in lambdas/functions/control-plane/src/aws/runners.ts

View check run for this annotation

CodeScene Delta Analysis / CodeScene Cloud Delta Analysis (main)

✅ Getting better: Overall Code Complexity

The mean cyclomatic complexity decreases from 4.91 to 4.83, threshold = 4. This file has many conditional statements (e.g. if, for, while) across its implementation, leading to lower code health. Avoid adding more conditionals.
CreateFleetCommand,
CreateFleetResult,
CreateTagsCommand,
DescribeInstancesCommand,
DescribeInstancesResult,
EC2Client,
FleetLaunchTemplateOverridesRequest,
Tag,
TerminateInstancesCommand,
_InstanceType,
} from '@aws-sdk/client-ec2';
Expand Down Expand Up @@ -46,6 +48,9 @@
ec2FiltersBase.push({ Name: `tag:ghr:Type`, Values: [filters.runnerType] });
ec2FiltersBase.push({ Name: `tag:ghr:Owner`, Values: [filters.runnerOwner] });
}
if (filters.orphan) {
ec2FiltersBase.push({ Name: 'tag:ghr:orphan', Values: ['true'] });
npalm marked this conversation as resolved.
Show resolved Hide resolved
}

Check warning on line 53 in lambdas/functions/control-plane/src/aws/runners.ts

View check run for this annotation

CodeScene Delta Analysis / CodeScene Cloud Delta Analysis (main)

❌ New issue: Complex Method

constructFilters has a cyclomatic complexity of 9, threshold = 9. This function has many conditional statements (e.g. if, for, while), leading to lower code health. Avoid adding more conditionals and code to it without refactoring.
}

for (const key of ['tag:ghr:Application']) {
Expand Down Expand Up @@ -85,6 +90,7 @@
type: i.Tags?.find((e) => e.Key === 'ghr:Type')?.Value as string,
repo: i.Tags?.find((e) => e.Key === 'ghr:Repo')?.Value as string,
org: i.Tags?.find((e) => e.Key === 'ghr:Org')?.Value as string,
orphan: i.Tags?.find((e) => e.Key === 'ghr:orphan')?.Value === 'true',

Check warning on line 93 in lambdas/functions/control-plane/src/aws/runners.ts

View check run for this annotation

CodeScene Delta Analysis / CodeScene Cloud Delta Analysis (main)

❌ Getting worse: Complex Method

getRunnerInfo increases in cyclomatic complexity from 13 to 15, threshold = 9. This function has many conditional statements (e.g. if, for, while), leading to lower code health. Avoid adding more conditionals and code to it without refactoring.
npalm marked this conversation as resolved.
Show resolved Hide resolved
});
}
}
Expand All @@ -94,10 +100,16 @@
}

export async function terminateRunner(instanceId: string): Promise<void> {
logger.info(`Runner '${instanceId}' will be terminated.`);
logger.debug(`Runner '${instanceId}' will be terminated.`);
const ec2 = getTracedAWSV3Client(new EC2Client({ region: process.env.AWS_REGION }));
await ec2.send(new TerminateInstancesCommand({ InstanceIds: [instanceId] }));
logger.info(`Runner ${instanceId} has been terminated.`);
logger.debug(`Runner ${instanceId} has been terminated.`);
}

export async function tag(instanceId: string, tags: Tag[]): Promise<void> {
logger.debug(`Tagging '${instanceId}'`, { tags });
const ec2 = getTracedAWSV3Client(new EC2Client({ region: process.env.AWS_REGION }));
await ec2.send(new CreateTagsCommand({ Resources: [instanceId], Tags: tags }));
}

function generateFleetOverrides(
Expand Down
Loading