-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a gracePeriod for the do-not-disrupt
pod annotation
#752
Comments
Does annotating the pods of your longrunning jobs with |
@ellistarn wouldn't this allow more jobs to schedule onto the node and basically make it unterminatable? |
The node will still expire and cordon, but it won't start draining until the pod with those annotations are done. @njtran can you confirm? |
Yep! If you have a pod with a A node is not drained if it has a
Does this fit your use case? |
@njtran this fits my use case happy path perfectly, but it opens up a separate concern that the |
Sorry for the delay! You're correct in that if a
Could you expand on what you mean here? What's the use-case for having an additional grace period for a |
@njtran fundamentally it comes down to reducing the blast radius. If anyone who can start a container can keep a node running that starts to look like a potential problem. I'd like to know that after a given time, no matter what a user has done, that a node will terminate and be replaced. |
Is there a difference between this and |
Even if it would violate a pod disruption budget? |
@ellistarn this would terminate the node even if there was a pod with a
@tzneal I think that's a separate concern, but pretty important. A TTL which guarantees the node is terminated would make sense, having pods stuck due to user error and blocking node maintenance is a common problem when running a platform. |
do-not-evict
pod annotation
Renamed this issue to be more accurate of the discussion above |
@njtran Curious if this is still an expected behaviour. When enabling consolidation, I experienced Nodes being cordoned but not drained. I'm wondering if this annotation on a Pod is what's possibly happening.
+1 a ttl would be helpful. |
@alekhrycaiko sorry for the late response here, you should be able to see the kube-events that show if a node was unable to be deprovisioned due to a blocking pod. And @stevehipwell sorry for the late response! This makes sense to me, as it essentially guarantees (besides involuntary disruptions) that a workload will be up for at least the duration it's decided to be. This would require some design, and would be great for anyone to take up implementing if they want some more experience in the code. |
Thinking about it more, is there a reason you can't rely on If a pod is terminal, we ignore the do-not-evict pod as well. So is there something other than jobs that you want a |
@njtran this functionality request is to taint a node so no further pods are scheduled on it allowing either the running pods to finish or the expired TTL to trigger the pod(s) to be killed. So a use case would be to have a grace period difference between the taint TTL and the expired TTL to support clean node terminations when pod age is within a given range which is lower than the grace period; for example a CI/CD system. |
Ah, I think I fully understand now. So this TTLSecondsUntilTainted (or maybe better named TTLSecondsUntilCordoned) would act as a programmatic schedulable window for all nodes. If you combine this with Since the only disruption mechanism that is driven by time is expiration, this TTLUntilCordoned seems more likely to make it harder for Consolidation to help cost, and without any enabled disruption mechanisms, this could result in a lot of unused capacity. I think this would be better driven through at the application level, with using |
@njtran for the use case I'm referring to the pods are explicitly not jobs and are expensive to re-run, for example a CI/CD system running E2E tests where terminating the pod would have a high cost. The desired behaviour would be for Karpenter to be able to cordon a node to stop any new pods being scheduled while allowing existing pods to either complete successfully or timeout before the node is replaced. It might be best to drive this as a pod label such as |
My apologies on the length of time on this discussion, I've lost track of this issue many times 😞. If the pods are explicitly not jobs, where terminating the pod would have a high cost, how do you coordinate when the pod is okay to be evicted? Is there some special semantic for this CI/CD pod that signifies completion to another controller in your cluster that isn't represented by a kubernetes-native construct? It seems odd to me to create a node-level time-based restriction on scheduling, which is actually motivated by a pod-level time-based restriction. While you can get into a situation where a node is not optimally utilized from unluckiness with how these pods get rescheduled, if you're eventually expecting the node to be expired, since Karpenter doesn't cordon the node, the node should still be utilized. Once no pods have scheduled to it, then Karpenter could execute expiration. But you're right that if In my eyes, given the tight correlation with Consolidation and Expiration here, I'm not convinced that this is something that makes sense to include in the API. I think this would better be modeled as #622. Thoughts @stevehipwell ? |
I think that #622 would probably solve enough of this use case to be viable. Architecturally speaking though I'm not 100% convinced that a pod (and therefore potentially a low permissioned user) should be able to impact and effectively own a node lifecycle, as is the case currently and even if #622 is implemented; this looks like a privilege escalation to me. By providing a mechanism for the cluster admin to set a max node ttl after it's lifecycle should have been ended this issue is addressed. |
Can you try controlling this with something like Kyverno or some other admission webhook mechanism that would block "normal" users from setting this annotation value on a pod to avoid the privilege escalation? |
I can and likely will run OPA Gatekeeper to control this when we get Karpenter into production; but not everyone who runs Karpenter will have any policy capability. |
I've been reading through the design document referenced above (#516) and read that to mean the controls for when a pod is going to be considered for eviction are placed on the NodePool. We have pods that take 20 minutes to start up and checkpoint every 5 minutes after that. The gracePeriod mentioned in this issue on the pod would allow us to block eviction of the pod for a significant time for the task to get traction before being considered for consolidation. Please forgive me if this is not the right place for the above feedback, willing to move the conversation to the appropriate venue. |
Not necessarily, in the context of this issue, I could imagine the solution changing the value of the annotation to a
This seems wrong, or at least unexpected. Would be good to figure out if there's something wrong here. Can you open an issue for this? Or let's sync up on slack. I think another issue would be a better place to track this. |
Copying a conversation over from #926 (comment). Playing around with passing this same grace period down to a node from the NodeClaim, since it's possible to use the
|
Love this. Applied at both pod and node level? Could make a new annotation alongside the existing once called disruptAfter |
Any reason you think that it should be a new annotation vs. a value for the existing one? |
Just conceptual alignment, at the cost of a new indefinitely deprecated annotation. A few things to consider:
It may not make sense to try to align all of these concepts, but worth exploring. |
I like the idea we can just set If the case is for more temporary actions one can set it on the nodes themselves manually. Disruption control at the node level is a very useful tool to have at your disposal, so forcing it into the nodepool api itself removes a bit of that freedom. It gives more control on the annotation level vs just sticking somewhere in the api where that control is tied to the nodepool. One could also argue that having something that controls disruption outside of the disruption controls leads to a fragmented experience. I can't just go to the nodepool disruption controls to understand the full behavior of disruption(there are other things like PDBS etc that may block consolidation, but I am referring to karpenter settings) |
There's an argument that you could make this field "part of the API" by making it part of the NodeClaim since that's basically the place where Karpenter is doing API configuration for the Node. Annotations (in a way) are an extension of the API when you don't have full control over the schema yourself. This makes sense in the context of pods since we don't control the Pod API, but make less sense with Nodes where we do have a representation of the thing already (NodeClaim) To play devil's advocate, the counter-argument to this is using |
We reconcile anything in the template as drift, though I believe we ignore any mutations of the underlying nodeclaim / node. Does it make sense to implement it at the NodePool level as a first class API, and then also as an annotation that can be manually applied to pods and nodes? |
When discussing the implementation at the node pool level, it's important to note that we already have some similar concepts in disruption controls. We have consolidateAfter, expireAfter, so disruptAfter seems to fit right into the group.
I like the idea of both, if not for drift we could potentially control disruption through the whole nodepool like so.
ConsolidateAfter, expireAfter, and potentially disruptAfter should all be annotations. It makes sense to have this concept of disruptionControls under the crd for controlling the disruption alongside the annotations still. Maybe its a bit redundant but thats ok, let users pick what they want to use if its not too high of a burden. |
I'm not sure I love the idea of spreading out the API like this. I definitely think that there's benefit to having From the perspective of defining minNodeLifetime at the NodePool level, I could see the argument for adding this into the NodePool so that it doesn't drift everything. We're actually having a similar discussion in #834 where there's a bit of a back-and-forth around whether we should avoid drifting on certain properties of the NodePool. Similar to the It does raise a question though: Are there values on NodeClaims that can also be "control" mechanisms that don't cause drifts on Nodes but are updated in-place similar to the disruption block in the NodePool. |
do-not-evict
pod annotationdo-not-disrupt
pod annotation
I'm relatively new to Karpenter so I may be mistaken here, but from my experience so far it feels like there are at least two kinds of disruptions that should be supported:
Right now, it seems like mode 1 is not a possibility. A workload owner can trivially and indefinitely block a node from de-provisioning, which is a strange power for a workload owner to have. Adding OPA rules to prevent labeling workloads is also not a solution since it's completely valid for workload owners to exercise this power under normal circumstances. Just not when a cluster admin is trying to perform maintenance. Additionally, having a single TTL / ignore disruptions settings for both of these cases doesn't really seem to make sense. If I have a disruption TTL for cluster maintenance events, I strictly do not want this to apply to consolidation events. These are completely separate concerns. Consolidation events can have their own TTL, but I don't think it makes sense to have a single, global setting. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/lifecycle frozen |
@jcmcken you should look to terminationGracePeriod, which should solve your use-case for case 1: https://github.com/kubernetes-sigs/karpenter/pull/916/files |
Tell us about your request
I'd like the ability to taint a node after a given period via
ttlSecondsUntilTainted
to allow running pods to finish before the node is terminated, this should still respect thettlSecondsUntilExpired
andttlSecondsAfterEmpty
.Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
When running jobs such as CI/CD pipelines in K8s there are long running jobs which shouldn't be terminated due to the high cost of re-running them, by adding
ttlSecondsUntilTainted
we can have nodes that expire and are replaced without the cost of killing potentially long running jobs.Are you currently working around this issue?
Not using Karpenter.
Additional context
n/a
Attachments
n/a
Community Note
The text was updated successfully, but these errors were encountered: