-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabling variables to control job batch limits #103
base: master
Are you sure you want to change the base?
Enabling variables to control job batch limits #103
Conversation
This parameter is specific to Blocked Job Queue events I would update the variables to follow the Batch API. E.g. instead of
Implementation note: Since |
Besides adding the variables to |
@delagoya agree |
description = "The time limit in seconds for the job to run before the action is taken" | ||
} | ||
|
||
variable "job_state_time_limit_reason" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Point to Batch docs for valid reasons
modules/computation/batch.tf
Outdated
job_state_time_limit_action { | ||
action= "CANCEL" | ||
max_time_seconds = var.job_state_time_limit_timeout | ||
reason=var.job_state_time_limit_reason |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This attribute set is only valid if both reason and time limit are defined. The Batch API will throw an error if not all attributes are defined, so a check is not strictly necessary but it would be nice to know before starting a deploy that this will fail if both are not defined.
We've been encountering recurring issues with some AWS Batch queues getting stuck in a Runnable state. This seems to be caused by resource availability constraints or potential misconfigurations in our workflows. To address this, I'm implementing a Job State Limit for these jobs in the metaflow-computation module. The AWS documentation allows us to extend the capabilities of this resource
Here’s the approach:
The benefits of this change are clear: