-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not evict running jobs #701
Comments
Can you describe your use-case more? Would it not be possible to pass the annotation using the |
If you use Kyverno, there's an existing Kyverno policy to add the do-not-evict annotation to Job/CronJob pods automatically. https://kyverno.io/policies/karpenter/add-karpenter-donot-evict/add-karpenter-donot-evict/ |
Were considering this request as part of a batch of eviction based features. @njtran to follow up. |
Just ran into a similar issue, but with self-hosted CircleCI pods. They basically act the same as jobs (should only be run once). When Karpenter kills the node the pods are killed and don't come back. This issue was took a long time to figure out because I saw exit code 137 events from containerd, which normally indicates OOM, but in this case the culprit was actually Karpenter. This is worth a warning in the docs, IMO |
We create a PodDisruptionBudget which allows for zero disruptions and has all such jobs in its selector. |
@johngmyers , can you post an example of the PDB you're using? I'm a big fan of driving eviction behavior through the PDB API, if at all possible. |
I don't have access to it at the moment. It has a large |
We have the same problem with tekton pipline taskrun pods. They get evicted after karpenter has cordoned a node. The PipeLine has the annotation "karpenter.sh/do-not-evict: true" and passes it to the taskruns, but some taskrun pods got still evicted, after karapenter consolidates a node. |
We have the same problem with cronjobs. We're running jobs on some dedicated nodes, if we use "karpenter.sh/do-not-evict: true" or PDB, Karpenter just skips nodes and does nothing. Nodes are never deprovisioned. The deprovisioning flow for cronjobs should be something like this: cordon nodes -> waiting for cronjobs finish -> drain nodes. |
100% agree with this |
My understanding is that a voluntary eviction will succeed on a pod in Perhaps |
For comments in this issue regarding pods with For others who are experiencing continual behavior of job pods terminating and rescheduling with |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
We could try to set a label on Jobs (and their Pods) to declare whether or not the Pods are trivially interruptible. That allows people to opt in to the behavior where Job-owned Pods don't block a node replacement. Eventually, supplant this with a MutatingAdmissionPolicy (these are very new) or other mechanism, so that you get per-namespace control to select which Jobs do or don't get marked as interruptible. That feels like it'd fit Karpenter / node autoscaling well. With that architecture, Karpenter doesn't have to know about the Job API; it just has to look at the Pods bound to a node and check their labels.
|
If we want that label key defined, Kubernetes can help. |
Curious if a "disruption cost" annotation on the pod could help make progress towards this class of problem. I could see this helping for when consolidation makes an instance replacement for a marginal cost improvement, as well. |
/remove-lifecycle stale |
We have |
One way Karpenter could support that: have a NodePool-level setting that maps Pod deletion cost values to actual money. Either a scale factor (complex to teach), or a CEL expression (really complex to teach, if I'm honest). It needs to be NodePool level in case you have on-prem and cloud NodePools in the same cluster. Related to this, maybe we'd like to add a node deletion cost annotation (which, like Pod deletion cost, would be an abstract integer value). |
What is cost in units of for the current use? Isn't the node deletion cost the sum of the pods? I could potentially see a config value in NodePool like defaultDeletionCost, potentially with a pod selector (incl field selector to grab owners like job). |
Maybe I'm getting lost in the conversation, but don't we already leverage the pod disruption cost for ordering the nodes when we are considering which nodes should be considered first for any disruption operation: https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/disruption/helpers.go#L125? I'm also skeptical that this solves the problem that was originally raised around this issue. Isn't the proper fix here to ensure that Karpenter taints nodes before it validates which pods have scheduled to it so that we can reduce our chance of race conditions around this checkpoint? Our biggest problem today is that we grab the nodes and the pods and then determine if there are any If we taint ahead of checking this, it seems like we just circumvent our leaky behavior with pods today. |
@jonathan-innis I am also lost here is my original issue fixed with your ideas? I just cannot add the do not evict stuff to all jobs without introducing additional complexity like kubemod. |
@runningman84 if you're asking for Karpenter to solve this so you don't need to install a second tool, that game's not worth the candle (it's a lot of effort for Kubernetes contributors that isn't justified by the benefits to end users).
You can use a mutating admission webhook to achieve this; doing that is a very well known approach. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
We're seeing cases that look like the following with the Actions Runner Controller project to run GitHub Actions. It's reducing our confidence in our testing, and generally the actions are done within about 10 minutes or so. TerminationGracePeriodSeconds on the pods running the actions doesn't seem to help. One odd thing is that we see the following pattern, which I think is intended to be consolidation:
Note that the pod was Did the suggestion of a PodDisruptionBudget work / help others? It would be nice to have a common documented FAQ topic for this, as I imagine that using Karpenter to handle bursty "job-type" workloads is probably fairly common. |
/remove-lifecycle stale |
@evankanderson This sounds like the intended use case for the |
@jmdeal We have been facing similar race condition problem wherein our k8s job pods gets scheduled right before the tainting of nodes happens or right after the consolidation decision is made. An Inflight change will block this node terminations on the pods - What does that mean? How long will it take to have this fix implemented?
|
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
What is the current recommendation regarding cronjobs on Karpenter-managed nodes? We have some long-running cronjobs (30+ mins) that are often being disrupted. My understanding is that the |
@Makeshift GitHub is not the right place for support requests. If you're looking for help, try Server Fault. You can also post your question on the Kubernetes Slack or the Discuss Kubernetes forum. For Slack, the channels to use are |
Tell us about your request
Right now the consolidation replaces nodes with running jobs. This breaks a lot of deployments (doing db upgrades or other stuff). One way to handle this issue is putting do not eviction annotations on the corresponding pods. But it is difficult to ensure that if you have dozens of teams using a given cluster.
Could this be handled in a more generic way?
All these jobs look like this:
Controlled By: Job/xxx-yyy-zzz-28031590
Can karpenter optionally just use the do not evict behavior for running jobs?
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
It is difficult to ensure all pods have the do not evict annotation.
Are you currently working around this issue?
Manually set do not evict to all corresponding pods.
Additional Context
No response
Attachments
No response
Community Note
The text was updated successfully, but these errors were encountered: