-
Notifications
You must be signed in to change notification settings - Fork 979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Startup Taints #628
Comments
Hey @rustrial, this is a very interesting use case. Thanks for writing it up! We've recently been rethinking the For your use case, I think the problem is already solved, regardless of our behavior w.r.t taints. This behavior currently works with the AWS VPC CNI and Calico, though I haven't tested Cilium. Let me walk you through the provisioning process:
It's a bit complicated, but the key bit is the NoExecute taint created by the NodeLifecycleController. It's designed to be removed by some daemon pod (typically a CNI), which will cause other pods to execute. For any pods that don't rely on the CNI, you can always tolerate this taint. It's not clear to me why cilium doesn't use this taint, and instead relies on its own
Karpenter will apply this taint to all nodes for this provisioner, but it won't be used in scheduling calculations. I think this is roughly the semantic you want. Thoughts? |
Checking back in. I played with
It looks like the podlifecyclecontroller will evict pods immediately if a NoExecute taint exists on the node (unless it tolerates this taint, which by default, they don't), so this path isn't going to work unless we fundamentally change up our binding behavior, which I'm not excited about. I'm still curious if Cilium supports the node readiness workflow mentioned above. If not, we may need to consider something like |
@ellistarn Thank's for your feedback.
Can you please provide some pointers to the documentation of the
I think I have an idea for a minimal change to the current Binding behaviour and already started to draft a PR, just to figure out whether this might work (at all).
I like the term |
Heres an example node
Until this condition is marked ready,
these taints will be applied by the node lifecycle controller
|
Do you mind sharing the idea? I might be able to help. |
Did a little digging. The kubelet updates the runtime state for the network here: This then prevents the node status conditions |
@ellistarn I see, your are talking about https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions and specifically about the following two taints, which are set by the "node controller" based on the value of the node's
I also understand that the node's
Sure, see #631 |
Hey @rustrial , I put some comments on the PR. What happens if you simply don't use this taint w/ Cilium? IIUC, it's to prevent a race condition when two CNIs are installed. |
@ellistarn Actually, not using the taint we observe that there might occasionally be some Pods which are not properly managed by Cilium, but this could be caused by another race condition in Cilium and is not necessarily related to that taint. I am also curious what conclusion we get from the Slack thread you started on the Cilium #kubernetes channel, hopefully it will shed some light on this. |
Do you have any links to this? |
@ellistarn No, I have no link. I just assumed/hoped that this problem would be addressed by the Cilium's That said, I am convinced that even if it turns-out that the Cilium taint is not necessarily needed, sooner or later this request will come up again. There will always be another cluster add-on that needs some more elaborated control over node readiness and thus will not work (well) with karpenter (as it is today). |
I just discovered a new type of Taint Effect
If this is what I think it is, it means that if we bind the pods to the kubelet, the kubelet would not start the pod until the taints are removed. I think this is the exact behavior that we want. |
Looks promising (in theory), but I did not find any KEP or other detailed description of that Taint Effect. All I found are some comments on issues, the golang types and this design document. None of these sources make it clear what exactly the behaviour of this effect will be and also it looks like it is abandoned (might not be implemented at all). Need to do some more research, but I guess we have to ask around in the k8s community and find someone, who can shed some light on this.
|
I might be interested in picking up this KEP and driving it. It provides exactly the semantic we need without polluting Karpenter's API or requiring changes to the optimistic scheduling logic. There are a ton of great benefits of pre-binding pods, such as preloading configmaps and images. I want to make sure we can keep those benefits while also supporting the ordered execution. |
I think the |
I'd need to dig into this |
@ellistarn the docs say that tolerations on workloads will cause taints to be added? This toleration is quite widespread, take a look at the EBS CSI driver chart for example. |
Sorry if I'm being a bit dense, but what's the problem? If there are pending pods that need capacity (ebs csi driver, in this case), Karpenter will generate nodes with taints for those pods. Pods that do not have these tolerations will not be scheduled to these nodes. This ensures that ebs csi driver will only run alongside other workloads that have Note: The csi driver includes both a daemonset (which karpenter won't schedule) and controller pods (which karpenter will). |
@ellistarn the point is that this toleration is to allow scheduling not to drive isolation. It would be undesirable to create a node with this taint for this workload. It's also quite common to tolerate certain taints on workloads that aren't directly targeted at them, these wouldn't want to create the taint either. |
I hear ya.
Can you dig into this a little bit? Why don't you want this pod scheduled?
If there is existing capacity without taints, the kube scheduler schedule them without any taint generation. I definitely agree that it's not necessary for karpenter to add the taints for the pod to be scheduled. However, by generating taints, you can ensure isolation via the pod spec, which wouldn't be possible otherwise without creating multiple provisioners. This creates a tradeoff where nodes are packed less tightly (since pods without those tolerations could have been scheduled together), but enables the pod isolation feature. I could potentially see a pod annotation like |
Thanks for the writeup -- I'll think deeply on this. Do you mind porting it to a new issue, since it seems like we're outside of the scope of the original ticket. |
Thank you for moving this for me @ellistarn I had a busy weekend and didn't get a chance. |
Just wanted to check on the status of this. Another Cilium user here, same problem as the original issue. Currently we work around this by simply restarting all pods on a node when the CNI pod finishes starting up, but this is obviously a hacky workaround we'd like to avoid keeping. I noticed you specifically mentioned you don't want to complicate the API with new concepts and was just wondering if in the couple months since then if any better ideas had come to mind? |
Is it possible that this could be fixed in the cilium CNI? Karpenter expresses the intent that Pod A should run on node B. Other controllers should reconcile this intent. |
I don't think so. It looks like Cilium uses this taint specifically to block DaemonSets from getting scheduled since they tolerate specific taints - https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/#taints-and-tolerations Do provisioners in Karpenter use different launch templates in AWS? If so, could we expose some way for the user to add custom args to If there's a concern about it being too specific to AWS, the functionality could be exposed purely as a way to pass args to the provisoned nodes kubelet, and the implementation would be left up to the cloud provider |
Users can already add custom taints via |
Isn't the problem with this that Karpenter uses these taints for scheduling purposes? As in, the pod would have to tolerate these taints for a node to be provisioned. So we can't add the Cilium
This might be true, I honestly haven't tested this |
D'oh, yes. I lost the thread 🤦 . I remember from some months ago that a user was running Cilium without these taints and things were working for them 🤔 it was a while ago though, so I might be misremembering. |
I also stumbled upon this problem when using Cilium (in aws-cni chaining mode) yesterday. As a workaround, I ended up utilizing Kyverno to set the node taint with a mutation policy. I chose to set the taint for all nodes, but in theory you could also write a selector that will only mutate nodes with the apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: add-cilium-readiness-taint
spec:
rules:
- name: add-cilium-readiness-taint
match:
resources:
kinds:
- Node
preconditions:
any:
- key: '{{ request.operation }}'
operator: Equals
value: CREATE
mutate:
patchStrategicMerge:
spec:
taints:
- key: 'node.cilium.io/agent-not-ready'
value: 'true'
effect: NoSchedule Maybe this helps someone else, it's working out well for me. :) |
This looks like an awesome hack. Let me try doing this with OPA Gatekeeper |
I think there is a more generalized problem with daemonsets that need to be running, but that do not add or remove taints. The AWS controllers in particular - aws-vpc-cni, aws-ebs-csi-driver and aws-secretstore-csi-provider. None of them add/remove startup taints but if you have a pod that uses their features that is bound to a karpenter node there is a race condition on startup that may leave the pod |
Commenting again on this to add that when we performed an EKS upgrade from 1.21 to 1.22 this became particularly noticeable when we started draining/cordoning nodes en-masse to upgrade the karpenter AMIs to 1.22 compatible - a lot of pods were stuck in |
Why not just provide If a would-be-created node has a taint in the |
@ellistarn curious if you've found anything here. Specifically, if adding a I'm stuck on a project that will require me to solve this, or swap Karpenter out for CAS. My strong preference is to stick with a "fixed" Karpenter -- and I have some bandwidth to implement this -- if the proposed solution makes sense. |
would love feedback on this ^^^ proposed solution. it seems very simple (maybe I'm missing something?). |
We were chatting about this in standing this morning. We're playing with an idea that would enable us to not bind pods and allow the kube scheduler to do so. These startup taints could either be implemented as custom userdata or through something like With just |
I suppose I see "the problem" as Karpenter not adding nodes to the cluster. I believe Having said that, there's absolutely a combination of taints which will cause lots of node churn -- that's probably not what we want either (#1671 (comment)) I'd like to hear more about ideas on how to delay or skip binding (although, by skipping binding, we lose some of the sugar Karpenter provides). |
…date nodeSelector on pods (aws#628)
Tell us about your request
As of today, if taints are defined in
spec.Constraints.taints
, the provisioner will not provision nodes for pods that do not have matching tolerations.However, some cluster add-ons (usually CNI plugins) use temporary taints on new nodes to make sure no Pods are scheduled onto that nodes unless the CNI
DaemonSet
is ready on that node. Usually, those temporary node startup taints are then removed by theDaemonSet
or by a corresponding controller. A prominent example is Cilium which uses thenode.cilium.io/agent-not-ready
taint.Adding toleration to all Pods for such temporary node startup taints would of course void their purpose and might lead to disfunctional Pods as they are scheduled and started before CNI was properly setup on that node. With its current capabilities karpenter is not powerful enough to support such a setup.
What do you want us to build?
Add something like
spec.Constraints.provisioningTaints
, which contains taints that must be applied to newly created nodes, but might later be removed by workloads and are ignored by the provisioner for scheduling purposes. Specifically, karpenter should support the following behaviour:provisioningTaints
to newly created nodes and then forgets about them (does not re-add them if they are later removed by a CNI controller).provisioningTaints
are removed from a node before it schedules any Pods onto that node.provisioningTaints
when making scheduling decisions (i.e. Pods do not need a matching toleration).Of course, the proposed solution is just a (viable) placeholder and maybe somebody has a better ideas to resolve this problem, so feel free to propose other solutions.
Are you currently working around this issue?
Not using karpenter at all :-(
Community Note
The text was updated successfully, but these errors were encountered: