Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SUP-3258 - Implement .spec.job.activeDeadlineSeconds #497

Merged
merged 2 commits into from
Feb 13, 2025

Conversation

petetomasik
Copy link
Contributor

@petetomasik petetomasik commented Feb 6, 2025

Implements .spec.job.activeDeadlineSeconds as a means of setting runtime boundaries for Kubernetes jobs that are unable to complete within a "reasonable" amount of time.

This is configured with a default value of 21600 seconds (6 hours) in the controller configuration. This value can be overridden by the user via the command line --job-active-deadline-seconds 43200, directly in the controller configuration job-active-deadline-seconds: 43200, or on a per Buildkite job basis via the kubernetes plugin:

steps:
  - label: "12 hour nap, override default 6 hour job-active-deadline-seconds config"
    command: sleep 43200
    plugins:
      - kubernetes:
          jobActiveDeadlineSeconds: 46800

At this time .spec.job.activeDeadlineSeconds will not be able to be disabled, as we agree that a job runtime boundary should be defined at all times.

Fixes #480
Fixes #479
Fixes #463

@petetomasik petetomasik requested a review from a team as a code owner February 6, 2025 14:59
Copy link
Contributor

@DrJosh9000 DrJosh9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Other than one comment below, ideally this would also have a test. The easiest way currently is probably an integration test that uses a pipeline YAML fixture with a small activeDeadlineSeconds and a command that sleeps longer than it. But it may be tricky to take image pull time into account, so the deadline and sleep would probably be at least 30 seconds to avoid flakiness.

@@ -75,6 +76,7 @@ func New(logger *zap.Logger, client kubernetes.Interface, cfg Config) *worker {
}

type KubernetesPlugin struct {
Job *batchv1.Job `json:"job,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using *batchv1.Job here could get confusing - someone might try to write:

plugins:
- kubernetes:
    job:
      template:
        spec:
          containers:
            ...

which wouldn't be rejected for any failure to parse, and the user might expect that defining containers that way would work but instead it would be silently ignored.

(Maybe one day the whole KubernetesPlugin should be changed to revolve around a *batchv1.Job instead of defining pieces of it. Needs thought.)

In the meantime I would suggest either just adding JobActiveDeadlineSeconds as a KubernetesPlugin field directly, or maybe wrapping it in a new Job struct that can be grown over time to include other job parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is step 1 of a long journey to fully transition to *batchv1.Job. I went back and forth on whether to implement it this way and possibly cause confusion for users that try to leverage the entire Job spec. There are additional tickets we have that will continue to build off of the Job spec (backoffLimit, podFailurePolicy, etc.) bit by bit. We could implement this (and the future additional fields) as fields in the plugin and eventually bring them back into the full Job spec when the time comes to more precisely control the feature set.

@petetomasik petetomasik marked this pull request as draft February 12, 2025 15:05
@petetomasik petetomasik force-pushed the SUP-3258-job-active-deadline-seconds branch from fd05068 to 7a13241 Compare February 12, 2025 22:00
@petetomasik petetomasik marked this pull request as ready for review February 12, 2025 22:21
Copy link
Contributor

@DrJosh9000 DrJosh9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ✅

@petetomasik petetomasik merged commit 835cc76 into main Feb 13, 2025
1 check passed
@petetomasik petetomasik deleted the SUP-3258-job-active-deadline-seconds branch February 13, 2025 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants