Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per-job Instance Types and AMIs #2390

Merged
merged 42 commits into from
Aug 29, 2024
Merged
Show file tree
Hide file tree
Changes from 40 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
38dca68
sketching out per-job compute environments
AndrewPlayer3 Aug 23, 2024
fcad9dc
notes in important locations for per-instance types and amis
AndrewPlayer3 Aug 23, 2024
8396463
rendering per-job compute environments
AndrewPlayer3 Aug 23, 2024
e4611e9
make compute env field with name
AndrewPlayer3 Aug 23, 2024
18bf683
create a shared compute environment so jobs dont NEED to specify anyt…
AndrewPlayer3 Aug 23, 2024
46f6554
refactor for single step function
AndrewPlayer3 Aug 26, 2024
f6a7117
SharedComputeEnvironment to Shared
AndrewPlayer3 Aug 26, 2024
0491dd6
changelog update
AndrewPlayer3 Aug 26, 2024
457e40d
removed Batched from names
AndrewPlayer3 Aug 26, 2024
ef0a8e5
ImageId to AmiId
AndrewPlayer3 Aug 26, 2024
cec5376
SharedStepFunctionArn to StepFunctionArn
AndrewPlayer3 Aug 26, 2024
8024452
added correct ami id
AndrewPlayer3 Aug 26, 2024
df2f077
correct image id pt 2
AndrewPlayer3 Aug 26, 2024
dc22adb
fixed tabs
AndrewPlayer3 Aug 26, 2024
6b519c5
SharedStepFunction to StepFunction
AndrewPlayer3 Aug 26, 2024
ee3869f
arn to Arn
AndrewPlayer3 Aug 26, 2024
54ec15a
removed extra stepfunctionpolicies
AndrewPlayer3 Aug 26, 2024
a28ba15
add custom launch template user data commands to job_spec
AndrewPlayer3 Aug 26, 2024
2fe012a
removed todos
AndrewPlayer3 Aug 26, 2024
3b0b0cf
removed todos
AndrewPlayer3 Aug 26, 2024
68b2d3b
defaults and variables
AndrewPlayer3 Aug 27, 2024
4ce4abe
shorter comment
AndrewPlayer3 Aug 27, 2024
8ff4a7d
shorter jinja variable names
AndrewPlayer3 Aug 27, 2024
7654d3e
removed added newline
AndrewPlayer3 Aug 27, 2024
b9f8d56
remove todo
AndrewPlayer3 Aug 27, 2024
7a385ce
variables
AndrewPlayer3 Aug 27, 2024
28a66fb
gpu tag support
AndrewPlayer3 Aug 27, 2024
0f6e778
eof newlines
AndrewPlayer3 Aug 27, 2024
e694d61
match gpu_support
AndrewPlayer3 Aug 27, 2024
cc675fd
removed shared in names + 'Shared' to 'Default'
AndrewPlayer3 Aug 27, 2024
fb1acad
not in to !=
AndrewPlayer3 Aug 27, 2024
5a7cea4
Merge branch 'develop' into multi_instance_type
jtherrmann Aug 27, 2024
7df649b
move compute env before tasks
AndrewPlayer3 Aug 28, 2024
6d5fdca
refactoring
AndrewPlayer3 Aug 28, 2024
aafeba1
Log Group correction
AndrewPlayer3 Aug 28, 2024
f9afc1a
removed spaces
AndrewPlayer3 Aug 28, 2024
8c39fd0
removed dupe
AndrewPlayer3 Aug 28, 2024
a446d03
ability to specify allocation type/strat
AndrewPlayer3 Aug 28, 2024
ddb2797
Add missing whitespace
jtherrmann Aug 28, 2024
cb8fbb8
Update CHANGELOG.md
jtherrmann Aug 28, 2024
a141a47
Update CHANGELOG.md
jtherrmann Aug 28, 2024
c943c0c
minor fixes based on local rendering
jtherrmann Aug 28, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,13 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).


## [7.8.0]

### Added
- Allow overriding certain AWS Batch compute environment parameters (including instance types and AMI) within a job spec.
- Allow job spec tasks to require GPU resource requirements.
jtherrmann marked this conversation as resolved.
Show resolved Hide resolved


## [7.7.2]

### Change
Expand Down
57 changes: 55 additions & 2 deletions apps/compute-cf.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,16 @@ Outputs:
JobQueueArn:
Value: !Ref BatchJobQueue

{% for job_type, job_spec in job_types.items() if 'Default' != job_spec['compute_environment']['name'] %}
{% set name = job_spec['compute_environment']['name'] %}
{{ name }}ComputeEnvironmentArn:
Value: !Ref {{ name }}ComputeEnvironment

{{ name }}JobQueueArn:
Value: !Ref {{ name }}JobQueue

{% endfor %}

TaskRoleArn:
Value: !GetAtt TaskRole.Arn

Expand Down Expand Up @@ -85,17 +95,60 @@ Resources:
Tags:
Name: !Ref AWS::StackName

BatchJobQueue:
Type: AWS::Batch::JobQueue
Properties:
Priority: 1
ComputeEnvironmentOrder:
- ComputeEnvironment: !Ref ComputeEnvironment
Order: 1
SchedulingPolicyArn: !Ref SchedulingPolicy

SchedulingPolicy:
Type: AWS::Batch::SchedulingPolicy

BatchJobQueue:
{% for job_type, job_spec in job_types.items() if 'Default' != job_spec['compute_environment']['name'] %}
{% set env = job_spec['compute_environment'] %}
{% set name = env['name'] %}
{% set instance_types = env['instance_types'] if 'instance_types' in env else ['!Ref InstanceTypes'] %}
{% set ami_id = env['ami_id'] if 'ami_id' in env else '!Ref AmiId' %}
AndrewPlayer3 marked this conversation as resolved.
Show resolved Hide resolved
{% set type = env['allocation_type'] if 'allocation_type' in env else 'SPOT' %}
{% set strategy = env['allocation_strategy'] if 'allocation_strategy' in env else 'SPOT_PRICE_CAPACITY_OPTIMIZED' %}
{{ name }}ComputeEnvironment:
Type: AWS::Batch::ComputeEnvironment
Properties:
ServiceRole: !GetAtt BatchServiceRole.Arn
Type: MANAGED
ComputeResources:
Type: {{ type }}
AllocationStrategy: {{ strategy }}
MinvCpus: 0
MaxvCpus: !Ref MaxvCpus
InstanceTypes:
{% for instance_type in instance_types %}
- {{ instance_type }}
{% endfor %}
ImageId: {{ ami_id }}
Subnets: !Ref SubnetIds
InstanceRole: !Ref InstanceProfile
SecurityGroupIds:
- !Ref SecurityGroup
LaunchTemplate:
LaunchTemplateId: !Ref LaunchTemplate
Version: !GetAtt LaunchTemplate.LatestVersionNumber
Tags:
Name: !Ref AWS::StackName

{{ name }}JobQueue:
Type: AWS::Batch::JobQueue
Properties:
Priority: 1
ComputeEnvironmentOrder:
- ComputeEnvironment: !Ref ComputeEnvironment
- ComputeEnvironment: !Ref {{ name }}ComputeEnvironment
Order: 1
SchedulingPolicyArn: !Ref SchedulingPolicy

{% endfor %}

TaskRole:
Type: {{ 'Custom::JplRole' if security_environment in ('JPL', 'JPL-public') else 'AWS::IAM::Role' }}
Expand Down
8 changes: 8 additions & 0 deletions apps/handle-batch-event/handle-batch-event-cf.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,11 @@ Parameters:
JobQueueArn:
Type: String

{% for job_type, job_spec in job_types.items() if 'Default' != job_spec['compute_environment']['name'] %}
{{ job_spec['compute_environment']['name'] }}JobQueueArn:
Type: String
{% endfor %}

JobsTable:
Type: String

Expand Down Expand Up @@ -95,6 +100,9 @@ Resources:
detail:
jobQueue:
- !Ref JobQueueArn
{% for job_type, job_spec in job_types.items() if 'Default' != job_spec['compute_environment']['name'] %}
- !Ref {{ job_spec['compute_environment']['name'] }}JobQueueArn
{% endfor %}
status:
- RUNNING
Targets:
Expand Down
12 changes: 12 additions & 0 deletions apps/main-cf.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,10 @@ Resources:
Properties:
Parameters:
ComputeEnvironmentArn: !GetAtt Cluster.Outputs.ComputeEnvironmentArn
{% for job_type, job_spec in job_types.items() if 'Default' != job_spec['compute_environment']['name'] %}
jtherrmann marked this conversation as resolved.
Show resolved Hide resolved
{% set name = job_spec['compute_environment']['name'] %}
{{ name }}ComputeEnvironmentArn: !GetAtt Cluster.Outputs.{{ name }}ComputeEnvironmentArn
{% endfor %}
DefaultMaxvCpus: !Ref DefaultMaxvCpus
ExpandedMaxvCpus: !Ref ExpandedMaxvCpus
MonthlyBudget: !Ref MonthlyBudget
Expand All @@ -169,6 +173,10 @@ Resources:
Properties:
Parameters:
JobQueueArn: !GetAtt Cluster.Outputs.JobQueueArn
{% for job_type, job_spec in job_types.items() if 'Default' != job_spec['compute_environment']['name'] %}
{% set name = job_spec['compute_environment']['name'] %}
{{ name }}JobQueueArn: !GetAtt Cluster.Outputs.{{ name }}JobQueueArn
{% endfor %}
JobsTable: !Ref JobsTable
{% if security_environment == 'EDC' %}
SecurityGroupId: !GetAtt Cluster.Outputs.SecurityGroupId
Expand All @@ -181,6 +189,10 @@ Resources:
Properties:
Parameters:
JobQueueArn: !GetAtt Cluster.Outputs.JobQueueArn
{% for job_type, job_spec in job_types.items() if 'Default' != job_spec['compute_environment']['name'] %}
{% set name = job_spec['compute_environment']['name'] %}
{{ name }}JobQueueArn: !GetAtt Cluster.Outputs.{{ name }}JobQueueArn
{% endfor %}
TaskRoleArn: !GetAtt Cluster.Outputs.TaskRoleArn
JobsTable: !Ref JobsTable
Bucket: !Ref ContentBucket
Expand Down
64 changes: 63 additions & 1 deletion apps/scale-cluster/scale-cluster-cf.yml.j2
AndrewPlayer3 marked this conversation as resolved.
Show resolved Hide resolved
jtherrmann marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,12 @@ Parameters:
ComputeEnvironmentArn:
Type: String

{% for job_type, job_spec in job_types.items() if 'Default' != job_spec['compute_environment']['name'] %}
{% set name = job_spec['compute_environment']['name'] %}
{{ name }}ComputeEnvironmentArn:
Type: String
{% endfor %}

DefaultMaxvCpus:
Type: Number
MinValue: 0
Expand Down Expand Up @@ -79,7 +85,11 @@ Resources:
Resource: "*"
- Effect: Allow
Action: batch:UpdateComputeEnvironment
Resource: !Ref ComputeEnvironmentArn
Resource:
- !Ref ComputeEnvironmentArn
{% for job_type, job_spec in job_types.items() if 'Default' != job_spec['compute_environment']['name'] %}
- !Ref {{ job_spec['compute_environment']['name'] }}ComputeEnvironmentArn
{% endfor %}

Lambda:
Type: AWS::Lambda::Function
Expand Down Expand Up @@ -118,6 +128,11 @@ Resources:
Targets:
- Arn: !GetAtt Lambda.Arn
Id: lambda
{% for job_type, job_spec in job_types.items() if 'Default' != job_spec['compute_environment']['name'] %}
{% set name = job_spec['compute_environment']['name'] %}
- Arn: !GetAtt {{ name }}Lambda.Arn
Id: {{ name }}lambda
{% endfor %}

EventPermission:
Type: AWS::Lambda::Permission
Expand All @@ -126,3 +141,50 @@ Resources:
Action: lambda:InvokeFunction
Principal: events.amazonaws.com
SourceArn: !GetAtt Schedule.Arn

{% for job_type, job_spec in job_types.items() if 'Default' != job_spec['compute_environment']['name'] %}
{% set name = job_spec['compute_environment']['name'] %}
{{ name }}LogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: !Sub {{ "/aws/lambda/${" + name + "Lambda}" }}
RetentionInDays: 90

{{ name }}Lambda:
Type: AWS::Lambda::Function
Properties:
Environment:
Variables:
COMPUTE_ENVIRONMENT_ARN: !Ref {{ name }}ComputeEnvironmentArn
MONTHLY_BUDGET: !Ref MonthlyBudget
DEFAULT_MAX_VCPUS: !Ref DefaultMaxvCpus
EXPANDED_MAX_VCPUS: !Ref ExpandedMaxvCpus
REQUIRED_SURPLUS: !Ref RequiredSurplus
Code: src/
Handler: scale_cluster.lambda_handler
MemorySize: 128
Role: !GetAtt Role.Arn
Runtime: python3.9
Timeout: 30
{% if security_environment == 'EDC' %}
VpcConfig:
SecurityGroupIds:
- !Ref SecurityGroupId
SubnetIds: !Ref SubnetIds
{% endif %}

{{ name }}EventInvokeConfig:
Type: AWS::Lambda::EventInvokeConfig
Properties:
FunctionName: !Ref {{ name }}Lambda
Qualifier: $LATEST
MaximumRetryAttempts: 0

{{ name }}EventPermission:
Type: AWS::Lambda::Permission
Properties:
FunctionName: !GetAtt {{ name }}Lambda.Arn
Action: lambda:InvokeFunction
Principal: events.amazonaws.com
SourceArn: !GetAtt Schedule.Arn
{% endfor %}
4 changes: 3 additions & 1 deletion apps/step-function.json.j2
jtherrmann marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -207,7 +207,9 @@
"Parameters": {
"JobDefinition": "{{ '${'+ snake_to_pascal_case(task['name']) + '}' }}",
"JobName.$": "$.job_id",
"JobQueue": "${JobQueueArn}",
{% set name = job_spec['compute_environment']['name'] %}
{% set job_queue = name + 'JobQueueArn' if 'Default' != name else 'JobQueueArn' %}
"JobQueue": "{{ '${' + job_queue + '}' }}",
"ShareIdentifier": "default",
"SchedulingPriorityOverride.$": "$.priority",
"Parameters.$": "$.job_parameters",
Expand Down
17 changes: 17 additions & 0 deletions apps/workflow-cf.yml.j2
jtherrmann marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,11 @@ Parameters:
JobQueueArn:
Type: String

{% for job_type, job_spec in job_types.items() if 'Default' != job_spec['compute_environment']['name'] %}
{{ job_spec['compute_environment']['name'] }}JobQueueArn:
Type: String
{% endfor %}

JobsTable:
Type: String

Expand Down Expand Up @@ -36,6 +41,7 @@ Outputs:
StepFunctionArn:
Value: !Ref StepFunction


Resources:
{% for job_type, job_spec in job_types.items() %}
{% for task in job_spec['tasks'] %}
Expand All @@ -60,6 +66,10 @@ Resources:
Value: "{{ task['vcpu'] }}"
- Type: MEMORY
Value: "{{ task['memory'] }}"
{% if 'gpu' in task %}
- Type: GPU
Value: "{{ task['gpu'] }}"
{% endif %}
Command:
{% for command in task['command'] %}
- {{ command }}
Expand All @@ -83,6 +93,10 @@ Resources:
DefinitionS3Location: step-function.json
DefinitionSubstitutions:
JobQueueArn: !Ref JobQueueArn
{% for job_type, job_spec in job_types.items() if 'Default' != job_spec['compute_environment']['name'] %}
{% set name = job_spec['compute_environment']['name'] %}
{{ name }}JobQueueArn: !Ref {{ name }}JobQueueArn
{% endfor %}
{% for job_type, job_spec in job_types.items() %}
{% for task in job_spec['tasks'] %}
{{ snake_to_pascal_case(task['name']) }}: !Ref {{ snake_to_pascal_case(task['name']) }}
Expand Down Expand Up @@ -124,6 +138,9 @@ Resources:
Action: batch:SubmitJob
Resource:
- !Ref JobQueueArn
{% for job_type, job_spec in job_types.items() if 'Default' != job_spec['compute_environment']['name'] %}
- !Ref {{ job_spec['compute_environment']['name'] }}JobQueueArn
{% endfor %}
{% for job_type, job_spec in job_types.items() %}
{% for task in job_spec['tasks'] %}
- !Ref {{ snake_to_pascal_case(task['name']) }}
Expand Down
2 changes: 2 additions & 0 deletions job_spec/ARIA_AUTORIFT.yml
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,8 @@ AUTORIFT:
DEFAULT:
cost: 1.0
validators: []
compute_environment:
name: 'Default'
tasks:
- name: ''
image: ghcr.io/asfhyp3/hyp3-autorift
Expand Down
2 changes: 2 additions & 0 deletions job_spec/ARIA_RAIDER.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ ARIA_RAIDER:
DEFAULT:
cost: 1.0
validators: []
compute_environment:
name: 'Default'
tasks:
- name: ''
image: ghcr.io/dbekaert/raider
Expand Down
2 changes: 2 additions & 0 deletions job_spec/AUTORIFT.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@ AUTORIFT:
DEFAULT:
cost: 1.0
validators: []
compute_environment:
name: 'Default'
tasks:
- name: ''
image: ghcr.io/asfhyp3/hyp3-autorift
Expand Down
2 changes: 2 additions & 0 deletions job_spec/AUTORIFT_ITS_LIVE.yml
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,8 @@ AUTORIFT:
DEFAULT:
cost: 1.0
validators: []
compute_environment:
name: 'Default'
tasks:
- name: ''
image: ghcr.io/asfhyp3/hyp3-autorift
Expand Down
2 changes: 2 additions & 0 deletions job_spec/INSAR_GAMMA.yml
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,8 @@ INSAR_GAMMA:
cost: 1.0
validators:
- check_dem_coverage
compute_environment:
name: 'Default'
tasks:
- name: ''
image: 845172464411.dkr.ecr.us-west-2.amazonaws.com/hyp3-gamma
Expand Down
2 changes: 2 additions & 0 deletions job_spec/INSAR_ISCE.yml
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,8 @@ INSAR_ISCE:
DEFAULT:
cost: 1.0
validators: []
compute_environment:
name: 'Default'
tasks:
- name: ''
image: ghcr.io/access-cloud-based-insar/dockerizedtopsapp
Expand Down
2 changes: 2 additions & 0 deletions job_spec/INSAR_ISCE_BURST.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@ INSAR_ISCE_BURST:
- check_valid_polarizations
- check_same_burst_ids
- check_not_antimeridian
compute_environment:
name: 'Default'
tasks:
- name: ''
image: ghcr.io/asfhyp3/hyp3-isce2
Expand Down
2 changes: 2 additions & 0 deletions job_spec/RTC_GAMMA.yml
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,8 @@ RTC_GAMMA:
cost: 1.0
validators:
- check_dem_coverage
compute_environment:
name: 'Default'
tasks:
- name: ''
image: 845172464411.dkr.ecr.us-west-2.amazonaws.com/hyp3-gamma
Expand Down
2 changes: 2 additions & 0 deletions job_spec/S1_CORRECTION_ITS_LIVE.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@ S1_CORRECTION_TEST:
DEFAULT:
cost: 1.0
validators: []
compute_environment:
name: 'Default'
tasks:
- name: ''
image: ghcr.io/asfhyp3/hyp3-autorift
Expand Down
Loading
Loading