Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controller should always set Job.spec.activeDeadlineSeconds #480

Open
artem-zinnatullin opened this issue Jan 23, 2025 · 0 comments · May be fixed by #497
Open

Controller should always set Job.spec.activeDeadlineSeconds #480

artem-zinnatullin opened this issue Jan 23, 2025 · 0 comments · May be fixed by #497

Comments

@artem-zinnatullin
Copy link
Contributor

Follow-up for #479

I strongly believe that for such a critical dev-infrastructure software as CI/CD system controller all measures should be taken to prevent customer resource overuse and overspend in all or any of following error scenarios:

K8S itself should be able to eventually kill the stuck CI Jobs from the cluster, otherwise as happened with us #479 we had 96 jobs stuck in partial-error state for over a week and we only found out due to billing overspend detection.

K8S Jobs have a fairly robust and simple mechanic: Job.spec.activeDeadlineSeconds which if set allows K8S itself to terminate Jobs that exceed active deadline which would have helped us avoid overspend.

Currently, https://github.com/buildkite/agent-stack-k8s only sets activeDeadlineSeconds as part of cleanupSidecars() which as far
I understand happens after controller detects job being done and as described here is an unreliable mechanic:

job.Spec.ActiveDeadlineSeconds = ptr.To[int64](defaultTermGracePeriodSeconds)

I think https://github.com/buildkite/agent-stack-k8s must always set a good default Job.spec.activeDeadlineSeconds (say 3-6 hours) and allow users to override it on per Job basis for those customers who have long running jobs or other need to customize this behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant