Skip to content

Commit

Permalink
feat(batch): jobStateTimeLimitActions property added (#30158)
Browse files Browse the repository at this point in the history
### Issue # (if applicable)

Closes #30142 .

### Reason for this change

Missing property in the L2 Construct.


### Description of changes
Add  jobStateTimeLimitActions property to the JobQueue Construct.


### Description of how you validated changes
Add unit tests and integ tests.



### Checklist
- [x] My code adheres to the [CONTRIBUTING GUIDE](https://github.com/aws/aws-cdk/blob/main/CONTRIBUTING.md) and [DESIGN GUIDELINES](https://github.com/aws/aws-cdk/blob/main/docs/DESIGN_GUIDELINES.md)

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
mazyu36 authored May 17, 2024

Verified

This commit was signed with the committer’s verified signature.
Ana06 Ana María Martínez Gómez
1 parent 4549cdf commit 411a58c
Showing 14 changed files with 2,525 additions and 5 deletions.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Large diffs are not rendered by default.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
import { Vpc } from 'aws-cdk-lib/aws-ec2';
import { App, Stack, Duration } from 'aws-cdk-lib/core';
import * as integ from '@aws-cdk/integ-tests-alpha';
import * as batch from 'aws-cdk-lib/aws-batch';

const app = new App();
const stack = new Stack(app, 'batch-stack-job-queue');
const vpc = new Vpc(stack, 'vpc');

// WHEN
new batch.JobQueue(stack, 'joBBQ', {
computeEnvironments: [{
computeEnvironment: new batch.ManagedEc2EcsComputeEnvironment(stack, 'CE', {
vpc,
}),
order: 1,
}],
jobStateTimeLimitActions: [
{
action: batch.JobStateTimeLimitActionsAction.CANCEL,
maxTime: Duration.minutes(10),
reason: batch.JobStateTimeLimitActionsReason.INSUFFICIENT_INSTANCE_CAPACITY,
state: batch.JobStateTimeLimitActionsState.RUNNABLE,
},
{
action: batch.JobStateTimeLimitActionsAction.CANCEL,
maxTime: Duration.minutes(10),
reason: batch.JobStateTimeLimitActionsReason.COMPUTE_ENVIRONMENT_MAX_RESOURCE,
state: batch.JobStateTimeLimitActionsState.RUNNABLE,
},
{
maxTime: Duration.minutes(10),
reason: batch.JobStateTimeLimitActionsReason.JOB_RESOURCE_REQUIREMENT,
},
],
});

new integ.IntegTest(app, 'BatchEcsJobDefinitionTest', {
testCases: [stack],
});

app.synth();
21 changes: 20 additions & 1 deletion packages/aws-cdk-lib/aws-batch/README.md
Original file line number Diff line number Diff line change
@@ -178,7 +178,7 @@ const computeEnv = new batch.ManagedEc2EcsComputeEnvironment(this, 'myEc2Compute
You can specify the maximum and minimum vCPUs a managed `ComputeEnvironment` can have at any given time.
Batch will *always* maintain `minvCpus` worth of instances in your ComputeEnvironment, even if it is not executing any jobs,
and even if it is disabled. Batch will scale the instances up to `maxvCpus` worth of instances as
jobs exit the JobQueue and enter the ComputeEnvironment. If you use `AllocationStrategy.BEST_FIT_PROGRESSIVE`,
jobs exit the JobQueue and enter the ComputeEnvironment. If you use `AllocationStrategy.BEST_FIT_PROGRESSIVE`,
`AllocationStrategy.SPOT_PRICE_CAPACITY_OPTIMIZED`, or `AllocationStrategy.SPOT_CAPACITY_OPTIMIZED`,
batch may exceed `maxvCpus`; it will never exceed `maxvCpus` by more than a single instance type. This example configures a
`minvCpus` of 10 and a `maxvCpus` of 100:
@@ -234,6 +234,25 @@ lowPriorityQueue.addComputeEnvironment(sharedComputeEnv, 1);
highPriorityQueue.addComputeEnvironment(sharedComputeEnv, 1);
```

### React to jobs stuck in RUNNABLE state

You can react to jobs stuck in RUNNABLE state by setting a `jobStateTimeLimitActions` in `JobQueue`.
Specifies actions that AWS Batch will take after the job has remained at the head of the queue in the
specified state for longer than the specified time.

```ts
new batch.JobQueue(this, 'JobQueue', {
jobStateTimeLimitActions: [
{
action: batch.JobStateTimeLimitActionsAction.CANCEL,
maxTime: cdk.Duration.minutes(10),
reason: batch.JobStateTimeLimitActionsReason.INSUFFICIENT_INSTANCE_CAPACITY,
state: batch.JobStateTimeLimitActionsState.RUNNABLE,
},
]
});
```

### Fairshare Scheduling

Batch `JobQueue`s execute Jobs submitted to them in FIFO order unless you specify a `SchedulingPolicy`.
118 changes: 117 additions & 1 deletion packages/aws-cdk-lib/aws-batch/lib/job-queue.ts
Original file line number Diff line number Diff line change
@@ -2,7 +2,7 @@ import { Construct } from 'constructs';
import { CfnJobQueue } from './batch.generated';
import { IComputeEnvironment } from './compute-environment-base';
import { ISchedulingPolicy } from './scheduling-policy';
import { ArnFormat, IResource, Lazy, Resource, Stack } from '../../core';
import { ArnFormat, Duration, IResource, Lazy, Resource, Stack } from '../../core';

/**
* Represents a JobQueue
@@ -117,6 +117,14 @@ export interface JobQueueProps {
* @default - no scheduling policy
*/
readonly schedulingPolicy?: ISchedulingPolicy;

/**
* The set of actions that AWS Batch perform on jobs that remain at the head of the job queue in
* the specified state longer than specified times.
*
* @default - no actions
*/
readonly jobStateTimeLimitActions?: JobStateTimeLimitAction[];
}

/**
@@ -135,6 +143,85 @@ export interface OrderedComputeEnvironment {
readonly order: number;
}

/**
* Specifies an action that AWS Batch will take after the job has remained at
* the head of the queue in the specified state for longer than the specified time.
*/
export interface JobStateTimeLimitAction {
/**
* The action to take when a job is at the head of the job queue in the specified state
* for the specified period of time.
*
* @default JobStateTimeLimitActionsAction.CANCEL
*/
readonly action?: JobStateTimeLimitActionsAction;

/**
* The approximate amount of time, that must pass with the job in the specified
* state before the action is taken.
*
* The minimum value is 10 minutes and the maximum value is 24 hours.
*/
readonly maxTime: Duration;

/**
* The reason to log for the action being taken.
*
* @see https://docs.aws.amazon.com/batch/latest/userguide/troubleshooting.html#job_stuck_in_runnable
*/
readonly reason: JobStateTimeLimitActionsReason;

/**
* The state of the job needed to trigger the action.
*
* @default JobStateTimeLimitActionsState.RUNNABLE
*/
readonly state?: JobStateTimeLimitActionsState;
}

/**
* The action to take when a job is at the head of the job queue in the specified state
* for the specified period of time.
*/
export enum JobStateTimeLimitActionsAction {
/**
* Cancel the job.
*/
CANCEL = 'CANCEL',
}

/**
* The reason to log for the action being taken.
*
* @see https://docs.aws.amazon.com/batch/latest/userguide/troubleshooting.html#job_stuck_in_runnable
*/
export enum JobStateTimeLimitActionsReason {
/**
* All connected compute environments have insufficient capacity errors.
*/
INSUFFICIENT_INSTANCE_CAPACITY = 'CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY',

/**
* All compute environments have a maxvCpus parameter that is smaller than the job requirements.
*/
COMPUTE_ENVIRONMENT_MAX_RESOURCE = 'MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE',

/**
* None of the compute environments have instances that meet the job requirements.
*/
JOB_RESOURCE_REQUIREMENT = 'MISCONFIGURATION:JOB_RESOURCE_REQUIREMENT',
}

/**
* The state of the job needed to trigger the action.
*/
export enum JobStateTimeLimitActionsState {
/**
* RUNNABLE state triggers the action.
*/
RUNNABLE = 'RUNNABLE',
}

/**
* JobQueues can receive Jobs, which are removed from the queue when
* sent to the linked ComputeEnvironment(s) to be executed.
@@ -191,6 +278,7 @@ export class JobQueue extends Resource implements IJobQueue {
jobQueueName: props?.jobQueueName,
state: (this.enabled ?? true) ? 'ENABLED' : 'DISABLED',
schedulingPolicyArn: this.schedulingPolicy?.schedulingPolicyArn,
jobStateTimeLimitActions: this.renderJobStateTimeLimitActions(props?.jobStateTimeLimitActions),
});

this.jobQueueArn = this.getResourceArnAttribute(resource.attrJobQueueArn, {
@@ -209,6 +297,34 @@ export class JobQueue extends Resource implements IJobQueue {
order,
});
}

private renderJobStateTimeLimitActions(
jobStateTimeLimitActions?: JobStateTimeLimitAction[],
): CfnJobQueue.JobStateTimeLimitActionProperty[] | undefined {
if (!jobStateTimeLimitActions || jobStateTimeLimitActions.length === 0) {
return;
}

return jobStateTimeLimitActions.map((action, index) => renderJobStateTimeLimitAction(action, index));

function renderJobStateTimeLimitAction(
jobStateTimeLimitAction: JobStateTimeLimitAction,
index: number,
): CfnJobQueue.JobStateTimeLimitActionProperty {
const maxTimeSeconds = jobStateTimeLimitAction.maxTime.toSeconds();

if (maxTimeSeconds < 600 || maxTimeSeconds > 86400) {
throw new Error(`maxTime must be between 600 and 86400 seconds, got ${maxTimeSeconds} seconds at jobStateTimeLimitActions[${index}]`);
}

return {
action: jobStateTimeLimitAction.action ?? JobStateTimeLimitActionsAction.CANCEL,
maxTimeSeconds,
reason: jobStateTimeLimitAction.reason,
state: jobStateTimeLimitAction.state ?? JobStateTimeLimitActionsState.RUNNABLE,
};
}
}
}

function validateOrderedComputeEnvironments(computeEnvironments: OrderedComputeEnvironment[]): string[] {
101 changes: 98 additions & 3 deletions packages/aws-cdk-lib/aws-batch/test/job-queue.test.ts
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import { Template } from '../../assertions';
import { Match, Template } from '../../assertions';
import * as ec2 from '../../aws-ec2';
import { DefaultTokenResolver, Stack, StringConcat, Tokenization } from '../../core';
import { FairshareSchedulingPolicy, JobQueue, ManagedEc2EcsComputeEnvironment } from '../lib';
import { DefaultTokenResolver, Duration, Stack, StringConcat, Tokenization } from '../../core';
import { FairshareSchedulingPolicy, JobQueue, ManagedEc2EcsComputeEnvironment, JobStateTimeLimitActionsAction, JobStateTimeLimitActionsReason, JobStateTimeLimitActionsState } from '../lib';

test('JobQueue respects computeEnvironments', () => {
// GIVEN
@@ -259,3 +259,98 @@ test('JobQueue throws when there are no linked ComputeEnvironments', () => {
Template.fromStack(stack);
}).toThrow(/This JobQueue does not link any ComputeEnvironments/);
});

test('JobQueue with JobStateTimeLimitActions', () => {
// GIVEN
const stack = new Stack();
const vpc = new ec2.Vpc(stack, 'vpc');

// WHEN
new JobQueue(stack, 'joBBQ', {
computeEnvironments: [{
computeEnvironment: new ManagedEc2EcsComputeEnvironment(stack, 'CE', {
vpc,
}),
order: 1,
}],
jobStateTimeLimitActions: [
{
action: JobStateTimeLimitActionsAction.CANCEL,
maxTime: Duration.minutes(10),
reason: JobStateTimeLimitActionsReason.INSUFFICIENT_INSTANCE_CAPACITY,
state: JobStateTimeLimitActionsState.RUNNABLE,
},
{
action: JobStateTimeLimitActionsAction.CANCEL,
maxTime: Duration.minutes(10),
reason: JobStateTimeLimitActionsReason.COMPUTE_ENVIRONMENT_MAX_RESOURCE,
state: JobStateTimeLimitActionsState.RUNNABLE,
},
{
maxTime: Duration.minutes(10),
reason: JobStateTimeLimitActionsReason.JOB_RESOURCE_REQUIREMENT,
},
],
});

// THEN
Template.fromStack(stack).hasResourceProperties('AWS::Batch::JobQueue', {
JobStateTimeLimitActions: [
{
Action: 'CANCEL',
MaxTimeSeconds: 600,
Reason: 'CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY',
State: 'RUNNABLE',
},
{
Action: 'CANCEL',
MaxTimeSeconds: 600,
Reason: 'MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE',
State: 'RUNNABLE',
},
{
Action: 'CANCEL',
MaxTimeSeconds: 600,
Reason: 'MISCONFIGURATION:JOB_RESOURCE_REQUIREMENT',
State: 'RUNNABLE',
},
],
});
});

test('JobQueue with JobStateTimeLimitActions throws when maxTime has an illegal value', () => {
const stack = new Stack();

expect(() => new JobQueue(stack, 'joBBQ', {
jobStateTimeLimitActions: [
{
action: JobStateTimeLimitActionsAction.CANCEL,
maxTime: Duration.seconds(90000),
reason: JobStateTimeLimitActionsReason.COMPUTE_ENVIRONMENT_MAX_RESOURCE,
state: JobStateTimeLimitActionsState.RUNNABLE,
},
],
})).toThrow('maxTime must be between 600 and 86400 seconds, got 90000 seconds at jobStateTimeLimitActions[0]');
});

test('JobQueue with an empty array of JobStateTimeLimitActions', () => {
// GIVEN
const stack = new Stack();
const vpc = new ec2.Vpc(stack, 'vpc');

// WHEN
new JobQueue(stack, 'joBBQ', {
computeEnvironments: [{
computeEnvironment: new ManagedEc2EcsComputeEnvironment(stack, 'CE', {
vpc,
}),
order: 1,
}],
jobStateTimeLimitActions: [],
});

// THEN
Template.fromStack(stack).hasResourceProperties('AWS::Batch::JobQueue', {
JobStateTimeLimitActions: Match.absent(),
});
});

0 comments on commit 411a58c

Please sign in to comment.