Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling variables to control job batch limits #103

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

mramireztgtg
Copy link

We've been encountering recurring issues with some AWS Batch queues getting stuck in a Runnable state. This seems to be caused by resource availability constraints or potential misconfigurations in our workflows. To address this, I'm implementing a Job State Limit for these jobs in the metaflow-computation module. The AWS documentation allows us to extend the capabilities of this resource
Here’s the approach:

  • Introduce an optional variable that allows us to configure the timeout and define the action to take when a job exceeds the allowed state duration.
    The benefits of this change are clear:
  • It would prevent jobs from running excessively long (In our case we had a run 40+ hours over weekends), which has been blocking new executions and impacting production workloads.
  • It addresses an issue several teams have mentioned in Slack threads, making it a valuable improvement for our broader community.

@delagoya
Copy link

delagoya commented Jan 24, 2025

This parameter is specific to Blocked Job Queue events

I would update the variables to follow the Batch API.

E.g. instead of job_state_limit_action_timeout it should be job_state_limit_action_max_time_seconds. The full set of variables should be:

  • job_state_limit_action - type should be either string CANCEL or null
  • job_state_limit_action_max_time_seconds - type number greater than 600
  • job_state_limit_action_reason - type enum string, one of the valid reasons in the Batch documentation
  • job_state_limit_action_state - type string, either RUNNABLE or null

Implementation note:

Since action and state each only allow a single value, it would be possible to omit them for now and just have max_time_seconds and reason but the API may change at a later date to add more possible values to these API parameters.

@delagoya
Copy link

Besides adding the variables to main.tf you will need to update the /modules/computation/batch.tf aws_batch_job_queue resource.

@mramireztgtg
Copy link
Author

@delagoya agree
I've pushed the suggested changes

description = "The time limit in seconds for the job to run before the action is taken"
}

variable "job_state_time_limit_reason" {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point to Batch docs for valid reasons

job_state_time_limit_action {
action= "CANCEL"
max_time_seconds = var.job_state_time_limit_timeout
reason=var.job_state_time_limit_reason
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This attribute set is only valid if both reason and time limit are defined. The Batch API will throw an error if not all attributes are defined, so a check is not strictly necessary but it would be nice to know before starting a deploy that this will fail if both are not defined.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants