Controller should always set Job.spec.activeDeadlineSeconds #480

artem-zinnatullin · 2025-01-23T19:26:28Z

Follow-up for #479

I strongly believe that for such a critical dev-infrastructure software as CI/CD system controller all measures should be taken to prevent customer resource overuse and overspend in all or any of following error scenarios:

Buildkite API goes partially or fully down as on Jan 5-7 2025
Controller itself gets into CrashLoopBackOff or otherhow loses ability to communicate with K8S API
One of current 5 containers used in each Job: copy-agent, imagecheck-0, checkout, agent or container-0 gets into unrecoverable error-state such as Agent crashes with SIGSEGV at checkout stage agent#3149
Etc

K8S itself should be able to eventually kill the stuck CI Jobs from the cluster, otherwise as happened with us #479 we had 96 jobs stuck in partial-error state for over a week and we only found out due to billing overspend detection.

K8S Jobs have a fairly robust and simple mechanic: Job.spec.activeDeadlineSeconds which if set allows K8S itself to terminate Jobs that exceed active deadline which would have helped us avoid overspend.

Currently, https://github.com/buildkite/agent-stack-k8s only sets activeDeadlineSeconds as part of cleanupSidecars() which as far
I understand happens after controller detects job being done and as described here is an unreliable mechanic:

agent-stack-k8s/internal/controller/scheduler/completions.go

Line 97 in 734dfdd

job.Spec.ActiveDeadlineSeconds = ptr.To[int64](defaultTermGracePeriodSeconds)

I think https://github.com/buildkite/agent-stack-k8s must always set a good default Job.spec.activeDeadlineSeconds (say 3-6 hours) and allow users to override it on per Job basis for those customers who have long running jobs or other need to customize this behavior.

The text was updated successfully, but these errors were encountered:

petetomasik linked a pull request Feb 6, 2025 that will close this issue

SUP-3258 - Implement .spec.job.activeDeadlineSeconds #497

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Controller should always set Job.spec.activeDeadlineSeconds #480

Controller should always set Job.spec.activeDeadlineSeconds #480

artem-zinnatullin commented Jan 23, 2025

Controller should always set Job.spec.activeDeadlineSeconds #480

Controller should always set Job.spec.activeDeadlineSeconds #480

Comments

artem-zinnatullin commented Jan 23, 2025