Add a gracePeriod for the `do-not-disrupt` pod annotation #752

stevehipwell · 2022-03-16T17:12:49Z

Tell us about your request
I'd like the ability to taint a node after a given period via ttlSecondsUntilTainted to allow running pods to finish before the node is terminated, this should still respect the ttlSecondsUntilExpired and ttlSecondsAfterEmpty.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
When running jobs such as CI/CD pipelines in K8s there are long running jobs which shouldn't be terminated due to the high cost of re-running them, by adding ttlSecondsUntilTainted we can have nodes that expire and are replaced without the cost of killing potentially long running jobs.

Are you currently working around this issue?
Not using Karpenter.

Additional context
n/a

Attachments
n/a

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

ellistarn · 2022-03-16T20:38:37Z

Does annotating the pods of your longrunning jobs with karpenter.sh/do-not-evict work for you?

stevehipwell · 2022-03-16T22:44:23Z

@ellistarn wouldn't this allow more jobs to schedule onto the node and basically make it unterminatable?

ellistarn · 2022-03-16T23:44:07Z

The node will still expire and cordon, but it won't start draining until the pod with those annotations are done. @njtran can you confirm?

njtran · 2022-03-16T23:50:36Z

Yep! If you have a pod with a do-not-evict annotation, Karpenter will not drain any of the pods on a node, but it will still cordon the node.

A node is not drained if it has a do-not-evict pod, and we will not terminate a node until that node is considered drained:

Does this fit your use case?

stevehipwell · 2022-03-17T09:26:06Z

@njtran this fits my use case happy path perfectly, but it opens up a separate concern that the do-not-evict annotation could keep a node running indefinitely? If there isn't a way to do this already I'm happy to close this issue as being a "didn't read the docs correctly" and open a new issue regarding a do-not-evict grace period?

njtran · 2022-04-15T01:10:08Z

Sorry for the delay! You're correct in that if a do-not-evict pod exists, then Karpenter would never try to scale down the node.

open a new issue regarding a do-not-evict grace period?

Could you expand on what you mean here? What's the use-case for having an additional grace period for a do-not-evict pod? You could programmatically delete these pods once they've completed, which in that case, the node would then be able to be deleted if there were no other do-not-evict pods.

stevehipwell · 2022-04-15T06:40:08Z

@njtran fundamentally it comes down to reducing the blast radius. If anyone who can start a container can keep a node running that starts to look like a potential problem. I'd like to know that after a given time, no matter what a user has done, that a node will terminate and be replaced.

ellistarn · 2022-04-15T16:44:05Z

additional grace period for a do-not-evict

Is there a difference between this and terminationGracePeriodSeconds?

tzneal · 2022-04-15T16:53:37Z

I'd like to know that after a given time, no matter what a user has done, that a node will terminate and be replaced.

Even if it would violate a pod disruption budget?

stevehipwell · 2022-04-15T18:23:03Z

Is there a difference between this and terminationGracePeriodSeconds?

@ellistarn this would terminate the node even if there was a pod with a karpenter.sh/do-not-evict annotation running. Effectively after the TTL expires the annotation would be ignored.

Even if it would violate a pod disruption budget?

@tzneal I think that's a separate concern, but pretty important. A TTL which guarantees the node is terminated would make sense, having pods stuck due to user error and blocking node maintenance is a common problem when running a platform.

njtran · 2022-11-07T19:02:49Z

Renamed this issue to be more accurate of the discussion above

alekhrycaiko · 2023-02-22T23:56:10Z

Yep! If you have a pod with a do-not-evict annotation, Karpenter will not drain any of the pods on a node, but it will still cordon the node.

@njtran Curious if this is still an expected behaviour. When enabling consolidation, I experienced Nodes being cordoned but not drained. I'm wondering if this annotation on a Pod is what's possibly happening.

A TTL which guarantees the node is terminated would make sense, having pods stuck due to user error and blocking node maintenance is a common problem when running a platform.

+1 a ttl would be helpful.

njtran · 2023-05-10T18:36:05Z

@alekhrycaiko sorry for the late response here, you should be able to see the kube-events that show if a node was unable to be deprovisioned due to a blocking pod.

And @stevehipwell sorry for the late response! This makes sense to me, as it essentially guarantees (besides involuntary disruptions) that a workload will be up for at least the duration it's decided to be. This would require some design, and would be great for anyone to take up implementing if they want some more experience in the code.

njtran · 2023-05-10T18:44:27Z

Thinking about it more, is there a reason you can't rely on activeDeadlineSeconds for this? https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup

If a pod is terminal, we ignore the do-not-evict pod as well. So is there something other than jobs that you want a do-not-evict TTL for?

stevehipwell · 2023-05-10T18:53:35Z

@njtran this functionality request is to taint a node so no further pods are scheduled on it allowing either the running pods to finish or the expired TTL to trigger the pod(s) to be killed. So a use case would be to have a grace period difference between the taint TTL and the expired TTL to support clean node terminations when pod age is within a given range which is lower than the grace period; for example a CI/CD system.

njtran · 2023-05-12T20:14:35Z

Ah, I think I fully understand now. So this TTLSecondsUntilTainted (or maybe better named TTLSecondsUntilCordoned) would act as a programmatic schedulable window for all nodes. If you combine this with ttlSecondsUntilExpired, then you get a grace period of time where you can guarantee your workloads won't be expired (although you can't guarantee it for other disruptions). Am I right?

Since the only disruption mechanism that is driven by time is expiration, this TTLUntilCordoned seems more likely to make it harder for Consolidation to help cost, and without any enabled disruption mechanisms, this could result in a lot of unused capacity. I think this would be better driven through at the application level, with using activeDeadlineSeconds for jobs with do-not-evict or adding a do-not-evict-ignore-ttl if you there's a reason for pods that aren't jobs.

stevehipwell · 2023-05-15T11:33:42Z

@njtran for the use case I'm referring to the pods are explicitly not jobs and are expensive to re-run, for example a CI/CD system running E2E tests where terminating the pod would have a high cost. The desired behaviour would be for Karpenter to be able to cordon a node to stop any new pods being scheduled while allowing existing pods to either complete successfully or timeout before the node is replaced. It might be best to drive this as a pod label such as cordon-timeout-seconds: "120". Consolidation could then de-prioritise nodes with pods labeled for this behaviour.

njtran · 2023-05-23T23:06:36Z

My apologies on the length of time on this discussion, I've lost track of this issue many times 😞.

If the pods are explicitly not jobs, where terminating the pod would have a high cost, how do you coordinate when the pod is okay to be evicted? Is there some special semantic for this CI/CD pod that signifies completion to another controller in your cluster that isn't represented by a kubernetes-native construct?

It seems odd to me to create a node-level time-based restriction on scheduling, which is actually motivated by a pod-level time-based restriction. While you can get into a situation where a node is not optimally utilized from unluckiness with how these pods get rescheduled, if you're eventually expecting the node to be expired, since Karpenter doesn't cordon the node, the node should still be utilized. Once no pods have scheduled to it, then Karpenter could execute expiration. But you're right that if do-not-evict pods unluckily always opt out the node of deprovisioning, Karpenter may never deprovision this node.

In my eyes, given the tight correlation with Consolidation and Expiration here, I'm not convinced that this is something that makes sense to include in the API. I think this would better be modeled as #622. Thoughts @stevehipwell ?

stevehipwell · 2023-05-24T16:02:20Z

I think that #622 would probably solve enough of this use case to be viable.

Architecturally speaking though I'm not 100% convinced that a pod (and therefore potentially a low permissioned user) should be able to impact and effectively own a node lifecycle, as is the case currently and even if #622 is implemented; this looks like a privilege escalation to me. By providing a mechanism for the cluster admin to set a max node ttl after it's lifecycle should have been ended this issue is addressed.

jonathan-innis · 2023-05-24T16:07:29Z

should be able to impact and effectively own a node lifecycle

Can you try controlling this with something like Kyverno or some other admission webhook mechanism that would block "normal" users from setting this annotation value on a pod to avoid the privilege escalation?

stevehipwell · 2023-05-24T16:11:57Z

Can you try controlling this with something like Kyverno or some other admission webhook mechanism that would block "normal" users from setting this annotation value on a pod to avoid the privilege escalation?

I can and likely will run OPA Gatekeeper to control this when we get Karpenter into production; but not everyone who runs Karpenter will have any policy capability.

ssoriche · 2023-11-02T14:28:24Z

I've been reading through the design document referenced above (#516) and read that to mean the controls for when a pod is going to be considered for eviction are placed on the NodePool. We have pods that take 20 minutes to start up and checkpoint every 5 minutes after that. The gracePeriod mentioned in this issue on the pod would allow us to block eviction of the pod for a significant time for the task to get traction before being considered for consolidation.
Having it on the NodePool level would mean that each workload that has a long start up time, and checkpoint would need its own NodePool. We experimented with separating our workloads into node pools and saw a more than 50% increase in our cluster costs immediately.

Please forgive me if this is not the right place for the above feedback, willing to move the conversation to the appropriate venue.

njtran · 2023-11-13T22:05:53Z

read that to mean the controls for when a pod is going to be considered for eviction are placed on the NodePool

Not necessarily, in the context of this issue, I could imagine the solution changing the value of the annotation to a time.Duration formatted string, which Karpenter could use with CreationTimestamp to know when it can be deleted. Yet, for NodePool-level controls, I could see something like a minNodeLifetime be included, which would opt out nodes that aren't old enough for disruption.

We experimented with separating our workloads into node pools and saw a more than 50% increase in our cluster costs immediately.

This seems wrong, or at least unexpected. Would be good to figure out if there's something wrong here. Can you open an issue for this? Or let's sync up on slack. I think another issue would be a better place to track this.

jonathan-innis · 2024-01-08T04:00:28Z

Copying a conversation over from #926 (comment). Playing around with passing this same grace period down to a node from the NodeClaim, since it's possible to use the karpenter.sh/do-not-disrupt annotation across all NodeClaims/Nodes in v1beta1 🤔

I wonder if there is room in the disruption controls for something like "do not disrupt within first 5m of life"

We have this as an annotation that can get propagated onto the Node from the NodeClaim today. Seems like this might be a really nice place to do #752 and push the karpenter.sh/do-not-disrupt annotation that can either be configured at the Pod spec level or at the NodeClaim level (which could be passed down from NodePool spec.template.metadata.annotations so you could do something like):
spec:
  template:
     metadata:
       annotations:
          karpenter.sh/do-not-disrupt: 10m

ellistarn · 2024-01-08T06:47:10Z

Love this. Applied at both pod and node level? Could make a new annotation alongside the existing once called disruptAfter

jonathan-innis · 2024-01-08T06:50:43Z

Could make a new annotation alongside the existing once called disruptAfter

Any reason you think that it should be a new annotation vs. a value for the existing one?

ellistarn · 2024-01-08T16:58:21Z

new annotation

Just conceptual alignment, at the cost of a new indefinitely deprecated annotation.

A few things to consider:

Should this be configured at the node pool level (in addition to, or exclusively)?
Should we consider making other of our controls (expiry ttl) have node level annotation overrides? I could see an argument that I user should be able to manually say "shut this down tonight" with an annotation.

It may not make sense to try to align all of these concepts, but worth exploring.

Bryce-Soghigian · 2024-01-08T22:07:21Z

I like the idea we can just set do-not-disrupt=5m on the nodepool under annotations and the idea of the do not disrupt annotation will be propagated to all nodes in the nodepool.

If the case is for more temporary actions one can set it on the nodes themselves manually. Disruption control at the node level is a very useful tool to have at your disposal, so forcing it into the nodepool api itself removes a bit of that freedom.

It gives more control on the annotation level vs just sticking somewhere in the api where that control is tied to the nodepool.

One could also argue that having something that controls disruption outside of the disruption controls leads to a fragmented experience. I can't just go to the nodepool disruption controls to understand the full behavior of disruption(there are other things like PDBS etc that may block consolidation, but I am referring to karpenter settings)

jonathan-innis · 2024-01-09T05:21:16Z

It gives more control on the annotation level vs just sticking somewhere in the api where that control is tied to the nodepool

There's an argument that you could make this field "part of the API" by making it part of the NodeClaim since that's basically the place where Karpenter is doing API configuration for the Node. Annotations (in a way) are an extension of the API when you don't have full control over the schema yourself. This makes sense in the context of pods since we don't control the Pod API, but make less sense with Nodes where we do have a representation of the thing already (NodeClaim)

To play devil's advocate, the counter-argument to this is using karpenter.sh/do-not-disrupt in both places means that there's a single concept that users can reason about everywhere rather than having fragmented concepts. I don't think there's necessarily a huge burden here if we decided not to go the same route with both api concepts, but food for thought.

ellistarn · 2024-01-09T05:27:40Z

We reconcile anything in the template as drift, though I believe we ignore any mutations of the underlying nodeclaim / node. Does it make sense to implement it at the NodePool level as a first class API, and then also as an annotation that can be manually applied to pods and nodes?

Bryce-Soghigian · 2024-01-09T06:10:38Z

When discussing the implementation at the node pool level, it's important to note that we already have some similar concepts in disruption controls. We have consolidateAfter, expireAfter, so disruptAfter seems to fit right into the group.

spec: # This is not a complete NodePool Spec.
  disruption:
    consolidationPolicy: WhenUnderutilized || WhenEmpty
    consolidateAfter: 10m || Never # metav1.Duration
    # add disruptAfter: 10m || Never
    expireAfter: 10m || Never # Equivalent to v1alpha5 TTLSecondsUntilExpired

To play devil's advocate, the counter-argument to this is using karpenter.sh/do-not-disrupt in both places means that there's a single concept that users can reason about everywhere rather than having fragmented concepts.

I like the idea of both, if not for drift we could potentially control disruption through the whole nodepool like so.

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    metadata:
      # Annotations are arbitrary key-values that are applied to all nodes
      annotations:
        karpenter.sh/consolidateAfter: 10m
        karpenter.sh/disruptAfter: "10m"

ConsolidateAfter, expireAfter, and potentially disruptAfter should all be annotations. It makes sense to have this concept of disruptionControls under the crd for controlling the disruption alongside the annotations still. Maybe its a bit redundant but thats ok, let users pick what they want to use if its not too high of a burden.

jonathan-innis · 2024-01-10T06:23:01Z

ConsolidateAfter, expireAfter, and potentially disruptAfter should all be annotations

I'm not sure I love the idea of spreading out the API like this. I definitely think that there's benefit to having karpenter.sh/do-not-disrupt for nodes for ad-hoc debugging. That was the reason that the feature was originally added in the first place. If we added support for the duration value on pods, we may just be stuck implementing it on the Node as well for parity and you just get the feature for free if you pass the annotations down like above.

From the perspective of defining minNodeLifetime at the NodePool level, I could see the argument for adding this into the NodePool so that it doesn't drift everything. We're actually having a similar discussion in #834 where there's a bit of a back-and-forth around whether we should avoid drifting on certain properties of the NodePool.

Similar to the terminationGracePeriod feature that we are talking about, this is another field that I could see living on the NodeClaim and be templated down from the NodePool. minLifetime really has a meaning in the context of stand-alone NodeClaims (really maxLifetime does as well, but we currently have this as expiry in the NodePool disruption block). We may be stuck with expiry in the disruption block at this point, most importantly because expiration has to be reasoned about for the disruption budgets.

It does raise a question though: Are there values on NodeClaims that can also be "control" mechanisms that don't cause drifts on Nodes but are updated in-place similar to the disruption block in the NodePool.

jcmcken · 2024-05-13T19:30:38Z

I'm relatively new to Karpenter so I may be mistaken here, but from my experience so far it feels like there are at least two kinds of disruptions that should be supported:

Disruptions by cluster administrators, e.g. scheduled maintenance events. Nodes NEED to be terminated and re-provisioned to accommodate security or other cluster updates. In this "mode", Karpenter should simply ignore do-not-disrupt annotations entirely -- workloads should not be able to block disruption. Cluster administrators are not frequently executing disruptions in this mode (e.g. once a month, once a quarter -- whatever the update policy is in the organization). But in any case, the disruption should be immediate (as if do-not-disrupt were not present) and not require any grace period (although I suppose it doesn't hurt to have one). Without this, cluster maintenance is frankly just annoying -- the cluster admin has to find all the annotated workloads and remove the annotation or else kill nodes by hand.
Disruptions by Karpenter itself, e.g. consolidation / cost-saving replacements, and so on. Karpenter should respect do-not-disrupt in all cases in this mode. In these cases, the workload owner has chosen that they don't care about saving money (or other considerations) and want their workload to run to completion on the existing infrastructure (e.g. a long-running ML job).

Right now, it seems like mode 1 is not a possibility. A workload owner can trivially and indefinitely block a node from de-provisioning, which is a strange power for a workload owner to have. Adding OPA rules to prevent labeling workloads is also not a solution since it's completely valid for workload owners to exercise this power under normal circumstances. Just not when a cluster admin is trying to perform maintenance.

Additionally, having a single TTL / ignore disruptions settings for both of these cases doesn't really seem to make sense. If I have a disruption TTL for cluster maintenance events, I strictly do not want this to apply to consolidation events. These are completely separate concerns. Consolidation events can have their own TTL, but I don't think it makes sense to have a single, global setting.

k8s-triage-robot · 2024-08-11T20:14:50Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

njtran · 2024-08-12T21:57:45Z

/lifecycle frozen

njtran · 2024-08-12T21:58:39Z

@jcmcken you should look to terminationGracePeriod, which should solve your use-case for case 1: https://github.com/kubernetes-sigs/karpenter/pull/916/files

stevehipwell added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 16, 2022

stevehipwell mentioned this issue Mar 16, 2022

Reconcile nodes that are out of spec aws/karpenter-provider-aws#1457

Closed

njtran mentioned this issue Apr 28, 2022

Mega Issue: Deprovisioning Controls aws/karpenter-provider-aws#1738

Closed

18 tasks

njtran changed the title ~~Add ttlSecondsUntilTainted for better node lifecycle control~~ Add a gracePeriod for the do-not-evict pod annotation Nov 7, 2022

billrayburn assigned njtran May 10, 2023

jonathan-innis mentioned this issue May 24, 2023

Provision to specify the max Drain / Pod evictions attempts in Provisioner aws/karpenter-provider-aws#3052

Closed

njtran added the deprovisioning Issues related to node deprovisioning label May 24, 2023

njtran removed their assignment Jun 27, 2023

ellistarn mentioned this issue Aug 12, 2023

feat: Respect DeletionTimestamp and DeletionGracePeriodSeconds for Node, Machine, and Pods #465

Closed

njtran mentioned this issue Oct 31, 2023

docs: add disruption controls design #516

Merged

njtran transferred this issue from aws/karpenter-provider-aws Nov 2, 2023

jukie mentioned this issue Nov 25, 2023

Cordon node with a do-not-evict pod when ttlSecondsUntilExpired is met #622

Open

This was referenced Dec 27, 2023

fix: issue#723 -- Do-not-evict only for running pods #778

Closed

Support for stacked/cooperative autoscaling or adoption of 3rd party created node claims #920

Open

This was referenced Jan 8, 2024

Configuration option to ignore the do-not-evict annotation aws/karpenter-provider-aws#4662

Closed

chore: add initialization buffer to disruptable nodes #926

Closed

jonathan-innis changed the title ~~Add a gracePeriod for the do-not-evict pod annotation~~ Add a gracePeriod for the do-not-disrupt pod annotation Feb 19, 2024

jonathan-innis mentioned this issue Feb 20, 2024

MinAge for NodeClaims #1030

Open

njtran mentioned this issue Mar 5, 2024

chore: improve distruption for underutilization #992

Closed

jonathan-innis mentioned this issue Mar 12, 2024

Allow Forced Disruption of Nodes after TTL (disruptionTolerationDuration) #1047

Closed

njtran mentioned this issue Jul 8, 2024

Consolidation ttl: spec.disruption.consolidateAfter #735

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 11, 2024

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 12, 2024

Add a gracePeriod for the do-not-disrupt pod annotation #752

Add a gracePeriod for the do-not-disrupt pod annotation #752

Comments

stevehipwell commented Mar 16, 2022

Community Note

ellistarn commented Mar 16, 2022 • edited Loading

stevehipwell commented Mar 16, 2022

ellistarn commented Mar 16, 2022

njtran commented Mar 16, 2022 • edited Loading

stevehipwell commented Mar 17, 2022

njtran commented Apr 15, 2022

stevehipwell commented Apr 15, 2022

ellistarn commented Apr 15, 2022

tzneal commented Apr 15, 2022

stevehipwell commented Apr 15, 2022

njtran commented Nov 7, 2022

alekhrycaiko commented Feb 22, 2023

njtran commented May 10, 2023

njtran commented May 10, 2023

stevehipwell commented May 10, 2023

njtran commented May 12, 2023

stevehipwell commented May 15, 2023

njtran commented May 23, 2023

stevehipwell commented May 24, 2023

jonathan-innis commented May 24, 2023

stevehipwell commented May 24, 2023

ssoriche commented Nov 2, 2023

njtran commented Nov 13, 2023

jonathan-innis commented Jan 8, 2024 • edited Loading

ellistarn commented Jan 8, 2024

jonathan-innis commented Jan 8, 2024 • edited Loading

ellistarn commented Jan 8, 2024

Bryce-Soghigian commented Jan 8, 2024 • edited Loading

jonathan-innis commented Jan 9, 2024

ellistarn commented Jan 9, 2024

Bryce-Soghigian commented Jan 9, 2024 • edited Loading

jonathan-innis commented Jan 10, 2024

jcmcken commented May 13, 2024

k8s-triage-robot commented Aug 11, 2024

njtran commented Aug 12, 2024

njtran commented Aug 12, 2024

Add a gracePeriod for the `do-not-disrupt` pod annotation #752

Add a gracePeriod for the `do-not-disrupt` pod annotation #752

ellistarn commented Mar 16, 2022 •

edited

Loading

njtran commented Mar 16, 2022 •

edited

Loading

jonathan-innis commented Jan 8, 2024 •

edited

Loading

jonathan-innis commented Jan 8, 2024 •

edited

Loading

Bryce-Soghigian commented Jan 8, 2024 •

edited

Loading

Bryce-Soghigian commented Jan 9, 2024 •

edited

Loading