Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a gracePeriod for the do-not-disrupt pod annotation #752

Open
stevehipwell opened this issue Mar 16, 2022 · 37 comments
Open

Add a gracePeriod for the do-not-disrupt pod annotation #752

stevehipwell opened this issue Mar 16, 2022 · 37 comments
Labels
deprovisioning Issues related to node deprovisioning kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@stevehipwell
Copy link

Tell us about your request
I'd like the ability to taint a node after a given period via ttlSecondsUntilTainted to allow running pods to finish before the node is terminated, this should still respect the ttlSecondsUntilExpired and ttlSecondsAfterEmpty.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
When running jobs such as CI/CD pipelines in K8s there are long running jobs which shouldn't be terminated due to the high cost of re-running them, by adding ttlSecondsUntilTainted we can have nodes that expire and are replaced without the cost of killing potentially long running jobs.

Are you currently working around this issue?
Not using Karpenter.

Additional context
n/a

Attachments
n/a

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@stevehipwell stevehipwell added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 16, 2022
@ellistarn
Copy link
Contributor

ellistarn commented Mar 16, 2022

Does annotating the pods of your longrunning jobs with karpenter.sh/do-not-evict work for you?

@stevehipwell
Copy link
Author

@ellistarn wouldn't this allow more jobs to schedule onto the node and basically make it unterminatable?

@ellistarn
Copy link
Contributor

The node will still expire and cordon, but it won't start draining until the pod with those annotations are done. @njtran can you confirm?

@njtran
Copy link
Contributor

njtran commented Mar 16, 2022

Yep! If you have a pod with a do-not-evict annotation, Karpenter will not drain any of the pods on a node, but it will still cordon the node.

A node is not drained if it has a do-not-evict pod, and we will not terminate a node until that node is considered drained:

Does this fit your use case?

@stevehipwell
Copy link
Author

@njtran this fits my use case happy path perfectly, but it opens up a separate concern that the do-not-evict annotation could keep a node running indefinitely? If there isn't a way to do this already I'm happy to close this issue as being a "didn't read the docs correctly" and open a new issue regarding a do-not-evict grace period?

@njtran
Copy link
Contributor

njtran commented Apr 15, 2022

Sorry for the delay! You're correct in that if a do-not-evict pod exists, then Karpenter would never try to scale down the node.

open a new issue regarding a do-not-evict grace period?

Could you expand on what you mean here? What's the use-case for having an additional grace period for a do-not-evict pod? You could programmatically delete these pods once they've completed, which in that case, the node would then be able to be deleted if there were no other do-not-evict pods.

@stevehipwell
Copy link
Author

@njtran fundamentally it comes down to reducing the blast radius. If anyone who can start a container can keep a node running that starts to look like a potential problem. I'd like to know that after a given time, no matter what a user has done, that a node will terminate and be replaced.

@ellistarn
Copy link
Contributor

additional grace period for a do-not-evict

Is there a difference between this and terminationGracePeriodSeconds?

@tzneal
Copy link
Contributor

tzneal commented Apr 15, 2022

I'd like to know that after a given time, no matter what a user has done, that a node will terminate and be replaced.

Even if it would violate a pod disruption budget?

@stevehipwell
Copy link
Author

Is there a difference between this and terminationGracePeriodSeconds?

@ellistarn this would terminate the node even if there was a pod with a karpenter.sh/do-not-evict annotation running. Effectively after the TTL expires the annotation would be ignored.

Even if it would violate a pod disruption budget?

@tzneal I think that's a separate concern, but pretty important. A TTL which guarantees the node is terminated would make sense, having pods stuck due to user error and blocking node maintenance is a common problem when running a platform.

@njtran njtran changed the title Add ttlSecondsUntilTainted for better node lifecycle control Add a gracePeriod for the do-not-evict pod annotation Nov 7, 2022
@njtran
Copy link
Contributor

njtran commented Nov 7, 2022

Renamed this issue to be more accurate of the discussion above

@alekhrycaiko
Copy link

Yep! If you have a pod with a do-not-evict annotation, Karpenter will not drain any of the pods on a node, but it will still cordon the node.

@njtran Curious if this is still an expected behaviour. When enabling consolidation, I experienced Nodes being cordoned but not drained. I'm wondering if this annotation on a Pod is what's possibly happening.

A TTL which guarantees the node is terminated would make sense, having pods stuck due to user error and blocking node maintenance is a common problem when running a platform.

+1 a ttl would be helpful.

@njtran
Copy link
Contributor

njtran commented May 10, 2023

@alekhrycaiko sorry for the late response here, you should be able to see the kube-events that show if a node was unable to be deprovisioned due to a blocking pod.

And @stevehipwell sorry for the late response! This makes sense to me, as it essentially guarantees (besides involuntary disruptions) that a workload will be up for at least the duration it's decided to be. This would require some design, and would be great for anyone to take up implementing if they want some more experience in the code.

@njtran
Copy link
Contributor

njtran commented May 10, 2023

Thinking about it more, is there a reason you can't rely on activeDeadlineSeconds for this? https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup

If a pod is terminal, we ignore the do-not-evict pod as well. So is there something other than jobs that you want a do-not-evict TTL for?

@stevehipwell
Copy link
Author

@njtran this functionality request is to taint a node so no further pods are scheduled on it allowing either the running pods to finish or the expired TTL to trigger the pod(s) to be killed. So a use case would be to have a grace period difference between the taint TTL and the expired TTL to support clean node terminations when pod age is within a given range which is lower than the grace period; for example a CI/CD system.

@njtran
Copy link
Contributor

njtran commented May 12, 2023

Ah, I think I fully understand now. So this TTLSecondsUntilTainted (or maybe better named TTLSecondsUntilCordoned) would act as a programmatic schedulable window for all nodes. If you combine this with ttlSecondsUntilExpired, then you get a grace period of time where you can guarantee your workloads won't be expired (although you can't guarantee it for other disruptions). Am I right?

Since the only disruption mechanism that is driven by time is expiration, this TTLUntilCordoned seems more likely to make it harder for Consolidation to help cost, and without any enabled disruption mechanisms, this could result in a lot of unused capacity. I think this would be better driven through at the application level, with using activeDeadlineSeconds for jobs with do-not-evict or adding a do-not-evict-ignore-ttl if you there's a reason for pods that aren't jobs.

@stevehipwell
Copy link
Author

@njtran for the use case I'm referring to the pods are explicitly not jobs and are expensive to re-run, for example a CI/CD system running E2E tests where terminating the pod would have a high cost. The desired behaviour would be for Karpenter to be able to cordon a node to stop any new pods being scheduled while allowing existing pods to either complete successfully or timeout before the node is replaced. It might be best to drive this as a pod label such as cordon-timeout-seconds: "120". Consolidation could then de-prioritise nodes with pods labeled for this behaviour.

@njtran
Copy link
Contributor

njtran commented May 23, 2023

My apologies on the length of time on this discussion, I've lost track of this issue many times 😞.

If the pods are explicitly not jobs, where terminating the pod would have a high cost, how do you coordinate when the pod is okay to be evicted? Is there some special semantic for this CI/CD pod that signifies completion to another controller in your cluster that isn't represented by a kubernetes-native construct?

It seems odd to me to create a node-level time-based restriction on scheduling, which is actually motivated by a pod-level time-based restriction. While you can get into a situation where a node is not optimally utilized from unluckiness with how these pods get rescheduled, if you're eventually expecting the node to be expired, since Karpenter doesn't cordon the node, the node should still be utilized. Once no pods have scheduled to it, then Karpenter could execute expiration. But you're right that if do-not-evict pods unluckily always opt out the node of deprovisioning, Karpenter may never deprovision this node.

In my eyes, given the tight correlation with Consolidation and Expiration here, I'm not convinced that this is something that makes sense to include in the API. I think this would better be modeled as #622. Thoughts @stevehipwell ?

@stevehipwell
Copy link
Author

I think that #622 would probably solve enough of this use case to be viable.

Architecturally speaking though I'm not 100% convinced that a pod (and therefore potentially a low permissioned user) should be able to impact and effectively own a node lifecycle, as is the case currently and even if #622 is implemented; this looks like a privilege escalation to me. By providing a mechanism for the cluster admin to set a max node ttl after it's lifecycle should have been ended this issue is addressed.

@jonathan-innis
Copy link
Member

should be able to impact and effectively own a node lifecycle

Can you try controlling this with something like Kyverno or some other admission webhook mechanism that would block "normal" users from setting this annotation value on a pod to avoid the privilege escalation?

@stevehipwell
Copy link
Author

Can you try controlling this with something like Kyverno or some other admission webhook mechanism that would block "normal" users from setting this annotation value on a pod to avoid the privilege escalation?

I can and likely will run OPA Gatekeeper to control this when we get Karpenter into production; but not everyone who runs Karpenter will have any policy capability.

@njtran njtran transferred this issue from aws/karpenter-provider-aws Nov 2, 2023
@ssoriche
Copy link

ssoriche commented Nov 2, 2023

I've been reading through the design document referenced above (#516) and read that to mean the controls for when a pod is going to be considered for eviction are placed on the NodePool. We have pods that take 20 minutes to start up and checkpoint every 5 minutes after that. The gracePeriod mentioned in this issue on the pod would allow us to block eviction of the pod for a significant time for the task to get traction before being considered for consolidation.
Having it on the NodePool level would mean that each workload that has a long start up time, and checkpoint would need its own NodePool. We experimented with separating our workloads into node pools and saw a more than 50% increase in our cluster costs immediately.

Please forgive me if this is not the right place for the above feedback, willing to move the conversation to the appropriate venue.

@njtran
Copy link
Contributor

njtran commented Nov 13, 2023

read that to mean the controls for when a pod is going to be considered for eviction are placed on the NodePool

Not necessarily, in the context of this issue, I could imagine the solution changing the value of the annotation to a time.Duration formatted string, which Karpenter could use with CreationTimestamp to know when it can be deleted. Yet, for NodePool-level controls, I could see something like a minNodeLifetime be included, which would opt out nodes that aren't old enough for disruption.

We experimented with separating our workloads into node pools and saw a more than 50% increase in our cluster costs immediately.

This seems wrong, or at least unexpected. Would be good to figure out if there's something wrong here. Can you open an issue for this? Or let's sync up on slack. I think another issue would be a better place to track this.

@jonathan-innis
Copy link
Member

jonathan-innis commented Jan 8, 2024

Copying a conversation over from #926 (comment). Playing around with passing this same grace period down to a node from the NodeClaim, since it's possible to use the karpenter.sh/do-not-disrupt annotation across all NodeClaims/Nodes in v1beta1 🤔

I wonder if there is room in the disruption controls for something like "do not disrupt within first 5m of life"

We have this as an annotation that can get propagated onto the Node from the NodeClaim today. Seems like this might be a really nice place to do #752 and push the karpenter.sh/do-not-disrupt annotation that can either be configured at the Pod spec level or at the NodeClaim level (which could be passed down from NodePool spec.template.metadata.annotations so you could do something like):

spec:
  template:
     metadata:
       annotations:
          karpenter.sh/do-not-disrupt: 10m

@ellistarn
Copy link
Contributor

Love this. Applied at both pod and node level? Could make a new annotation alongside the existing once called disruptAfter

@jonathan-innis
Copy link
Member

jonathan-innis commented Jan 8, 2024

Could make a new annotation alongside the existing once called disruptAfter

Any reason you think that it should be a new annotation vs. a value for the existing one?

@ellistarn
Copy link
Contributor

new annotation

Just conceptual alignment, at the cost of a new indefinitely deprecated annotation.

A few things to consider:

  1. Should this be configured at the node pool level (in addition to, or exclusively)?
  2. Should we consider making other of our controls (expiry ttl) have node level annotation overrides? I could see an argument that I user should be able to manually say "shut this down tonight" with an annotation.

It may not make sense to try to align all of these concepts, but worth exploring.

@Bryce-Soghigian
Copy link
Member

Bryce-Soghigian commented Jan 8, 2024

I like the idea we can just set do-not-disrupt=5m on the nodepool under annotations and the idea of the do not disrupt annotation will be propagated to all nodes in the nodepool.

If the case is for more temporary actions one can set it on the nodes themselves manually. Disruption control at the node level is a very useful tool to have at your disposal, so forcing it into the nodepool api itself removes a bit of that freedom.

It gives more control on the annotation level vs just sticking somewhere in the api where that control is tied to the nodepool.

One could also argue that having something that controls disruption outside of the disruption controls leads to a fragmented experience. I can't just go to the nodepool disruption controls to understand the full behavior of disruption(there are other things like PDBS etc that may block consolidation, but I am referring to karpenter settings)

@jonathan-innis
Copy link
Member

It gives more control on the annotation level vs just sticking somewhere in the api where that control is tied to the nodepool

There's an argument that you could make this field "part of the API" by making it part of the NodeClaim since that's basically the place where Karpenter is doing API configuration for the Node. Annotations (in a way) are an extension of the API when you don't have full control over the schema yourself. This makes sense in the context of pods since we don't control the Pod API, but make less sense with Nodes where we do have a representation of the thing already (NodeClaim)

To play devil's advocate, the counter-argument to this is using karpenter.sh/do-not-disrupt in both places means that there's a single concept that users can reason about everywhere rather than having fragmented concepts. I don't think there's necessarily a huge burden here if we decided not to go the same route with both api concepts, but food for thought.

@ellistarn
Copy link
Contributor

We reconcile anything in the template as drift, though I believe we ignore any mutations of the underlying nodeclaim / node. Does it make sense to implement it at the NodePool level as a first class API, and then also as an annotation that can be manually applied to pods and nodes?

@Bryce-Soghigian
Copy link
Member

Bryce-Soghigian commented Jan 9, 2024

When discussing the implementation at the node pool level, it's important to note that we already have some similar concepts in disruption controls. We have consolidateAfter, expireAfter, so disruptAfter seems to fit right into the group.

spec: # This is not a complete NodePool Spec.
  disruption:
    consolidationPolicy: WhenUnderutilized || WhenEmpty
    consolidateAfter: 10m || Never # metav1.Duration
    # add disruptAfter: 10m || Never
    expireAfter: 10m || Never # Equivalent to v1alpha5 TTLSecondsUntilExpired

To play devil's advocate, the counter-argument to this is using karpenter.sh/do-not-disrupt in both places means that there's a single concept that users can reason about everywhere rather than having fragmented concepts.

I like the idea of both, if not for drift we could potentially control disruption through the whole nodepool like so.

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    metadata:
      # Annotations are arbitrary key-values that are applied to all nodes
      annotations:
        karpenter.sh/consolidateAfter: 10m
        karpenter.sh/disruptAfter: "10m" 

ConsolidateAfter, expireAfter, and potentially disruptAfter should all be annotations. It makes sense to have this concept of disruptionControls under the crd for controlling the disruption alongside the annotations still. Maybe its a bit redundant but thats ok, let users pick what they want to use if its not too high of a burden.

@jonathan-innis
Copy link
Member

ConsolidateAfter, expireAfter, and potentially disruptAfter should all be annotations

I'm not sure I love the idea of spreading out the API like this. I definitely think that there's benefit to having karpenter.sh/do-not-disrupt for nodes for ad-hoc debugging. That was the reason that the feature was originally added in the first place. If we added support for the duration value on pods, we may just be stuck implementing it on the Node as well for parity and you just get the feature for free if you pass the annotations down like above.

From the perspective of defining minNodeLifetime at the NodePool level, I could see the argument for adding this into the NodePool so that it doesn't drift everything. We're actually having a similar discussion in #834 where there's a bit of a back-and-forth around whether we should avoid drifting on certain properties of the NodePool.

Similar to the terminationGracePeriod feature that we are talking about, this is another field that I could see living on the NodeClaim and be templated down from the NodePool. minLifetime really has a meaning in the context of stand-alone NodeClaims (really maxLifetime does as well, but we currently have this as expiry in the NodePool disruption block). We may be stuck with expiry in the disruption block at this point, most importantly because expiration has to be reasoned about for the disruption budgets.

It does raise a question though: Are there values on NodeClaims that can also be "control" mechanisms that don't cause drifts on Nodes but are updated in-place similar to the disruption block in the NodePool.

@jonathan-innis jonathan-innis changed the title Add a gracePeriod for the do-not-evict pod annotation Add a gracePeriod for the do-not-disrupt pod annotation Feb 19, 2024
@jcmcken
Copy link

jcmcken commented May 13, 2024

I'm relatively new to Karpenter so I may be mistaken here, but from my experience so far it feels like there are at least two kinds of disruptions that should be supported:

  1. Disruptions by cluster administrators, e.g. scheduled maintenance events. Nodes NEED to be terminated and re-provisioned to accommodate security or other cluster updates. In this "mode", Karpenter should simply ignore do-not-disrupt annotations entirely -- workloads should not be able to block disruption. Cluster administrators are not frequently executing disruptions in this mode (e.g. once a month, once a quarter -- whatever the update policy is in the organization). But in any case, the disruption should be immediate (as if do-not-disrupt were not present) and not require any grace period (although I suppose it doesn't hurt to have one). Without this, cluster maintenance is frankly just annoying -- the cluster admin has to find all the annotated workloads and remove the annotation or else kill nodes by hand.
  2. Disruptions by Karpenter itself, e.g. consolidation / cost-saving replacements, and so on. Karpenter should respect do-not-disrupt in all cases in this mode. In these cases, the workload owner has chosen that they don't care about saving money (or other considerations) and want their workload to run to completion on the existing infrastructure (e.g. a long-running ML job).

Right now, it seems like mode 1 is not a possibility. A workload owner can trivially and indefinitely block a node from de-provisioning, which is a strange power for a workload owner to have. Adding OPA rules to prevent labeling workloads is also not a solution since it's completely valid for workload owners to exercise this power under normal circumstances. Just not when a cluster admin is trying to perform maintenance.

Additionally, having a single TTL / ignore disruptions settings for both of these cases doesn't really seem to make sense. If I have a disruption TTL for cluster maintenance events, I strictly do not want this to apply to consolidation events. These are completely separate concerns. Consolidation events can have their own TTL, but I don't think it makes sense to have a single, global setting.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 11, 2024
@njtran
Copy link
Contributor

njtran commented Aug 12, 2024

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 12, 2024
@njtran
Copy link
Contributor

njtran commented Aug 12, 2024

@jcmcken you should look to terminationGracePeriod, which should solve your use-case for case 1: https://github.com/kubernetes-sigs/karpenter/pull/916/files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deprovisioning Issues related to node deprovisioning kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests