You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I strongly believe that for such a critical dev-infrastructure software as CI/CD system controller all measures should be taken to prevent customer resource overuse and overspend in all or any of following error scenarios:
K8S itself should be able to eventually kill the stuck CI Jobs from the cluster, otherwise as happened with us #479 we had 96 jobs stuck in partial-error state for over a week and we only found out due to billing overspend detection.
K8S Jobs have a fairly robust and simple mechanic: Job.spec.activeDeadlineSeconds which if set allows K8S itself to terminate Jobs that exceed active deadline which would have helped us avoid overspend.
Currently, https://github.com/buildkite/agent-stack-k8s only sets activeDeadlineSeconds as part of cleanupSidecars() which as far
I understand happens after controller detects job being done and as described here is an unreliable mechanic:
I think https://github.com/buildkite/agent-stack-k8s must always set a good default Job.spec.activeDeadlineSeconds (say 3-6 hours) and allow users to override it on per Job basis for those customers who have long running jobs or other need to customize this behavior.
The text was updated successfully, but these errors were encountered:
Follow-up for #479
I strongly believe that for such a critical dev-infrastructure software as CI/CD system controller all measures should be taken to prevent customer resource overuse and overspend in all or any of following error scenarios:
CrashLoopBackOff
or otherhow loses ability to communicate with K8S APIcopy-agent
,imagecheck-0
,checkout
,agent
orcontainer-0
gets into unrecoverable error-state such as Agent crashes with SIGSEGV at checkout stage agent#3149K8S itself should be able to eventually kill the stuck CI Jobs from the cluster, otherwise as happened with us #479 we had 96 jobs stuck in partial-error state for over a week and we only found out due to billing overspend detection.
K8S Jobs have a fairly robust and simple mechanic:
Job.spec.activeDeadlineSeconds
which if set allows K8S itself to terminate Jobs that exceed active deadline which would have helped us avoid overspend.Currently, https://github.com/buildkite/agent-stack-k8s only sets
activeDeadlineSeconds
as part ofcleanupSidecars()
which as farI understand happens after controller detects job being done and as described here is an unreliable mechanic:
agent-stack-k8s/internal/controller/scheduler/completions.go
Line 97 in 734dfdd
I think https://github.com/buildkite/agent-stack-k8s must always set a good default
Job.spec.activeDeadlineSeconds
(say 3-6 hours) and allow users to override it on per Job basis for those customers who have long running jobs or other need to customize this behavior.The text was updated successfully, but these errors were encountered: