Do not evict running jobs #701

runningman84 · 2023-04-19T09:17:09Z

Tell us about your request

Right now the consolidation replaces nodes with running jobs. This breaks a lot of deployments (doing db upgrades or other stuff). One way to handle this issue is putting do not eviction annotations on the corresponding pods. But it is difficult to ensure that if you have dozens of teams using a given cluster.

Could this be handled in a more generic way?

All these jobs look like this:
Controlled By: Job/xxx-yyy-zzz-28031590

Can karpenter optionally just use the do not evict behavior for running jobs?

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

It is difficult to ensure all pods have the do not evict annotation.

Are you currently working around this issue?

Manually set do not evict to all corresponding pods.

Additional Context

No response

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

engedaam · 2023-04-19T14:34:34Z

Can you describe your use-case more? Would it not be possible to pass the annotation using the spec.template in the deployment?

tzneal · 2023-04-20T00:33:32Z

If you use Kyverno, there's an existing Kyverno policy to add the do-not-evict annotation to Job/CronJob pods automatically. https://kyverno.io/policies/karpenter/add-karpenter-donot-evict/add-karpenter-donot-evict/

ellistarn · 2023-04-20T02:38:07Z

Were considering this request as part of a batch of eviction based features. @njtran to follow up.

Almenon · 2023-05-19T21:32:23Z

Just ran into a similar issue, but with self-hosted CircleCI pods. They basically act the same as jobs (should only be run once). When Karpenter kills the node the pods are killed and don't come back. This issue was took a long time to figure out because I saw exit code 137 events from containerd, which normally indicates OOM, but in this case the culprit was actually Karpenter.

This is worth a warning in the docs, IMO

ellistarn · 2023-06-05T19:06:55Z

Related: aws/karpenter-provider-aws#2391 (comment)

johngmyers · 2023-06-06T16:05:58Z

We create a PodDisruptionBudget which allows for zero disruptions and has all such jobs in its selector.

ellistarn · 2023-06-06T16:10:23Z

@johngmyers , can you post an example of the PDB you're using? I'm a big fan of driving eviction behavior through the PDB API, if at all possible.

johngmyers · 2023-06-06T16:19:11Z

I don't have access to it at the moment. It has a large spec.minAvailable and has a selector for a label that we put on all such jobs. It basically makes the label equivalent to the do-not-evict annotation, except being recognized by the voluntary eviction API.

HansBraun · 2023-06-28T11:26:27Z

We have the same problem with tekton pipline taskrun pods. They get evicted after karpenter has cordoned a node. The PipeLine has the annotation "karpenter.sh/do-not-evict: true" and passes it to the taskruns, but some taskrun pods got still evicted, after karapenter consolidates a node.

lzyli · 2023-08-16T04:17:17Z

We have the same problem with cronjobs. We're running jobs on some dedicated nodes, if we use "karpenter.sh/do-not-evict: true" or PDB, Karpenter just skips nodes and does nothing. Nodes are never deprovisioned. The deprovisioning flow for cronjobs should be something like this: cordon nodes -> waiting for cronjobs finish -> drain nodes.

yangwwei · 2023-08-28T21:54:11Z

We have the same problem with cronjobs. We're running jobs on some dedicated nodes, if we use "karpenter.sh/do-not-evict: true" or PDB, Karpenter just skips nodes and does nothing. Nodes are never deprovisioned. The deprovisioning flow for cronjobs should be something like this: cordon nodes -> waiting for cronjobs finish -> drain nodes.

100% agree with this

johngmyers · 2023-08-29T19:05:29Z

My understanding is that a voluntary eviction will succeed on a pod inSucceeded or Failed state. So a PDB would work as desired on pods from Jobs.

Perhaps karpenter.sh/do-not-evict: true is incorrectly failing to evict pods in those states?

njtran · 2023-11-13T21:42:29Z

For comments in this issue regarding pods with do-not-evict being ignored, it's possible you were running into an issue fixed in v0.31.1 (PR: #583).

For others who are experiencing continual behavior of job pods terminating and rescheduling with do-not-evict blocking node disruption, we're tracking this effort through the first two issues in the mega issue here: #624

k8s-triage-robot · 2024-02-11T21:56:22Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sftim · 2024-02-16T16:12:54Z

We could try to set a label on Jobs (and their Pods) to declare whether or not the Pods are trivially interruptible. That allows people to opt in to the behavior where Job-owned Pods don't block a node replacement.

Eventually, supplant this with a MutatingAdmissionPolicy (these are very new) or other mechanism, so that you get per-namespace control to select which Jobs do or don't get marked as interruptible.

That feels like it'd fit Karpenter / node autoscaling well. With that architecture, Karpenter doesn't have to know about the Job API; it just has to look at the Pods bound to a node and check their labels.

karpenter.sh/do-not-evict is not the same as example.kubernetes.io/interruptible, and it's not even the exact opposite in meaning. It would be weird to set both of them on a Pod, though.

sftim · 2024-02-16T16:13:12Z

If we want that label key defined, Kubernetes can help.

ellistarn · 2024-02-16T16:36:41Z

Curious if a "disruption cost" annotation on the pod could help make progress towards this class of problem. I could see this helping for when consolidation makes an instance replacement for a marginal cost improvement, as well.

Bryce-Soghigian · 2024-02-16T16:39:20Z

/remove-lifecycle stale

sftim · 2024-02-16T17:39:33Z

Curious if a "disruption cost" annotation on the pod could help make progress towards this class of problem. I could see this helping for when consolidation makes an instance replacement for a marginal cost improvement, as well.

We have controller.kubernetes.io/pod-deletion-cost supported in other parts of the project.

sftim · 2024-02-16T17:44:01Z

One way Karpenter could support that: have a NodePool-level setting that maps Pod deletion cost values to actual money.

Either a scale factor (complex to teach), or a CEL expression (really complex to teach, if I'm honest). It needs to be NodePool level in case you have on-prem and cloud NodePools in the same cluster.

Related to this, maybe we'd like to add a node deletion cost annotation (which, like Pod deletion cost, would be an abstract integer value).

ellistarn · 2024-02-16T21:34:05Z

What is cost in units of for the current use? Isn't the node deletion cost the sum of the pods? I could potentially see a config value in NodePool like defaultDeletionCost, potentially with a pod selector (incl field selector to grab owners like job).

jonathan-innis · 2024-02-16T21:59:57Z

I could see this helping for when consolidation makes an instance replacement for a marginal cost improvement, as well.

Maybe I'm getting lost in the conversation, but don't we already leverage the pod disruption cost for ordering the nodes when we are considering which nodes should be considered first for any disruption operation: https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/disruption/helpers.go#L125?

I'm also skeptical that this solves the problem that was originally raised around this issue. Isn't the proper fix here to ensure that Karpenter taints nodes before it validates which pods have scheduled to it so that we can reduce our chance of race conditions around this checkpoint? Our biggest problem today is that we grab the nodes and the pods and then determine if there are any karpenter.sh/do-not-disrupt pods on the node, but we might have missed a pod schedule during this time.

If we taint ahead of checking this, it seems like we just circumvent our leaky behavior with pods today.

runningman84 · 2024-02-18T09:20:02Z

@jonathan-innis I am also lost here is my original issue fixed with your ideas? I just cannot add the do not evict stuff to all jobs without introducing additional complexity like kubemod.

sftim · 2024-02-18T10:19:44Z

@runningman84 if you're asking for Karpenter to solve this so you don't need to install a second tool, that game's not worth the candle (it's a lot of effort for Kubernetes contributors that isn't justified by the benefits to end users).

It is difficult to ensure all pods have the do not evict annotation.

You can use a mutating admission webhook to achieve this; doing that is a very well known approach.

k8s-triage-robot · 2024-05-18T10:45:20Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

evankanderson · 2024-06-12T15:26:54Z

We're seeing cases that look like the following with the Actions Runner Controller project to run GitHub Actions. It's reducing our confidence in our testing, and generally the actions are done within about 10 minutes or so. TerminationGracePeriodSeconds on the pods running the actions doesn't seem to help.

One odd thing is that we see the following pattern, which I think is intended to be consolidation:

│ Events:                                                                                                                                                                                                                            │
│   Type     Reason            Age   From               Message                                                                                                                                                                      │
│   ----     ------            ----  ----               -------                                                                                                                                                                      │
│   Warning  FailedScheduling  83s   default-scheduler  0/9 nodes are available: 1 node(s) had untolerated taint {CriticalAddonsOnly: true}, 8 Insufficient cpu. preemption: 0/9 nodes are available: 1 Preemption is not helpful fo │
│ r scheduling, 8 No preemption victims found for incoming pod.                                                                                                                                                                      │
│   Normal   Scheduled         82s   default-scheduler  Successfully assigned arc-runners/arc-runner-set-ckxxm-runner-hfbfc to ip-10-0-1-69.ec2.internal                                                                             │
│   Normal   Pulling           81s   kubelet            Pulling image "XXXX"                                                                                                                            │
│   Normal   Pulled            81s   kubelet            Successfully pulled image "XXXX" in 226ms (226ms including waiting)                                                                             │
│   Normal   Created           81s   kubelet            Created container runner                                                                                                                                                     │
│   Normal   Started           81s   kubelet            Started container runner                                                                                                                                                     │
│   Normal   Nominated         4s    karpenter          Pod should schedule on: nodeclaim/default-tvqd9, node/ip-10-0-2-189.ec2.internal                                                                                             │
│   Normal   Killing           3s    kubelet            Stopping container runner

Note that the pod was Scheduled and then karpenter later came along and Nominated the pod to a different node.

Did the suggestion of a PodDisruptionBudget work / help others? It would be nice to have a common documented FAQ topic for this, as I imagine that using Karpenter to handle bursty "job-type" workloads is probably fairly common.

evankanderson · 2024-06-12T15:49:26Z

/remove-lifecycle stale

jmdeal · 2024-06-13T04:14:44Z

@evankanderson This sounds like the intended use case for the karpenter.sh/do-not-disrupt annotation (docs), is this occurring when job pods have this annoation? Karpenter shouldn't voluntarily disrupt a node if any bound pods have this annotation. There is a known race condition which can result in do-not-disrupt pods scheduling to nodes after the consolidation decision was validated. A fix was merged in v0.36.1 that largely mitigated this but it can still occur occasionally, but an inflight change will block node termination on these pods which should completely resolve this issue.

Hafeez-Aqfer · 2024-08-22T15:25:19Z

@jmdeal We have been facing similar race condition problem wherein our k8s job pods gets scheduled right before the tainting of nodes happens or right after the consolidation decision is made.

An Inflight change will block this node terminations on the pods - What does that mean? How long will it take to have this fix implemented?

Does it mean making the consolidation on the go once the decision is made, add a taint on to it.
Re-validate the node while the termination happens for pods which are running and with annotation
karpenter.sh/do-not-disrupt to not disrupt.

k8s-triage-robot · 2024-11-20T15:48:03Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sansmoraxz · 2024-11-21T12:24:08Z

/remove-lifecycle stale

Makeshift · 2025-01-20T12:05:00Z

What is the current recommendation regarding cronjobs on Karpenter-managed nodes? We have some long-running cronjobs (30+ mins) that are often being disrupted.

My understanding is that the do-not-disrupt annotation can cause nodes to simply never be consolidated and so should only be used for testing. Are PDBs preferred?

sftim · 2025-01-20T12:13:38Z

@Makeshift GitHub is not the right place for support requests.

If you're looking for help, try Server Fault.

You can also post your question on the Kubernetes Slack or the Discuss Kubernetes forum. For Slack, the channels to use are #karpenter and #kubernetes-users.

runningman84 added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 19, 2023

tzneal added the deprovisioning Issues related to node deprovisioning label Apr 20, 2023

njtran mentioned this issue Jun 5, 2023

Mega Issue: Deprovisioning Controls aws/karpenter-provider-aws#1738

Closed

18 tasks

njtran added the v1.x Issues prioritized for post-1.0 label Aug 14, 2023

njtran transferred this issue from aws/karpenter-provider-aws Nov 2, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 11, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 16, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 12, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 20, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not evict running jobs #701

Do not evict running jobs #701

runningman84 commented Apr 19, 2023

engedaam commented Apr 19, 2023

tzneal commented Apr 20, 2023

ellistarn commented Apr 20, 2023

Almenon commented May 19, 2023

ellistarn commented Jun 5, 2023

johngmyers commented Jun 6, 2023

ellistarn commented Jun 6, 2023 •

edited

Loading

johngmyers commented Jun 6, 2023

HansBraun commented Jun 28, 2023

lzyli commented Aug 16, 2023

yangwwei commented Aug 28, 2023

johngmyers commented Aug 29, 2023

njtran commented Nov 13, 2023

k8s-triage-robot commented Feb 11, 2024

sftim commented Feb 16, 2024 •

edited

Loading

sftim commented Feb 16, 2024

ellistarn commented Feb 16, 2024

Bryce-Soghigian commented Feb 16, 2024

sftim commented Feb 16, 2024

sftim commented Feb 16, 2024

ellistarn commented Feb 16, 2024

jonathan-innis commented Feb 16, 2024

runningman84 commented Feb 18, 2024

sftim commented Feb 18, 2024

k8s-triage-robot commented May 18, 2024

evankanderson commented Jun 12, 2024

evankanderson commented Jun 12, 2024

jmdeal commented Jun 13, 2024

Hafeez-Aqfer commented Aug 22, 2024

k8s-triage-robot commented Nov 20, 2024

sansmoraxz commented Nov 21, 2024

Makeshift commented Jan 20, 2025

sftim commented Jan 20, 2025

Do not evict running jobs #701

Do not evict running jobs #701

Comments

runningman84 commented Apr 19, 2023

Tell us about your request

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Are you currently working around this issue?

Additional Context

Attachments

Community Note

engedaam commented Apr 19, 2023

tzneal commented Apr 20, 2023

ellistarn commented Apr 20, 2023

Almenon commented May 19, 2023

ellistarn commented Jun 5, 2023

johngmyers commented Jun 6, 2023

ellistarn commented Jun 6, 2023 • edited Loading

johngmyers commented Jun 6, 2023

HansBraun commented Jun 28, 2023

lzyli commented Aug 16, 2023

yangwwei commented Aug 28, 2023

johngmyers commented Aug 29, 2023

njtran commented Nov 13, 2023

k8s-triage-robot commented Feb 11, 2024

sftim commented Feb 16, 2024 • edited Loading

sftim commented Feb 16, 2024

ellistarn commented Feb 16, 2024

Bryce-Soghigian commented Feb 16, 2024

sftim commented Feb 16, 2024

sftim commented Feb 16, 2024

ellistarn commented Feb 16, 2024

jonathan-innis commented Feb 16, 2024

runningman84 commented Feb 18, 2024

sftim commented Feb 18, 2024

k8s-triage-robot commented May 18, 2024

evankanderson commented Jun 12, 2024

evankanderson commented Jun 12, 2024

jmdeal commented Jun 13, 2024

Hafeez-Aqfer commented Aug 22, 2024

k8s-triage-robot commented Nov 20, 2024

sansmoraxz commented Nov 21, 2024

Makeshift commented Jan 20, 2025

sftim commented Jan 20, 2025

ellistarn commented Jun 6, 2023 •

edited

Loading

sftim commented Feb 16, 2024 •

edited

Loading