Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Startup Taints #628

Closed
rustrial opened this issue Aug 20, 2021 · 38 comments · Fixed by #1727
Closed

Support Startup Taints #628

rustrial opened this issue Aug 20, 2021 · 38 comments · Fixed by #1727
Labels
feature New feature or request kep this feature requires a KEP

Comments

@rustrial
Copy link

rustrial commented Aug 20, 2021

Tell us about your request

As of today, if taints are defined in spec.Constraints.taints, the provisioner will not provision nodes for pods that do not have matching tolerations.

Taints will be applied to every node launched by the Provisioner. If specified, the provisioner will not provision nodes for pods that do not have matching tolerations.

However, some cluster add-ons (usually CNI plugins) use temporary taints on new nodes to make sure no Pods are scheduled onto that nodes unless the CNI DaemonSet is ready on that node. Usually, those temporary node startup taints are then removed by the DaemonSet or by a corresponding controller. A prominent example is Cilium which uses the node.cilium.io/agent-not-ready taint.

Adding toleration to all Pods for such temporary node startup taints would of course void their purpose and might lead to disfunctional Pods as they are scheduled and started before CNI was properly setup on that node. With its current capabilities karpenter is not powerful enough to support such a setup.

What do you want us to build?

Add something like spec.Constraints.provisioningTaints, which contains taints that must be applied to newly created nodes, but might later be removed by workloads and are ignored by the provisioner for scheduling purposes. Specifically, karpenter should support the following behaviour:

  • provisioner adds all provisioningTaints to newly created nodes and then forgets about them (does not re-add them if they are later removed by a CNI controller).
  • provisioner waits until all provisioningTaints are removed from a node before it schedules any Pods onto that node.
  • but otherwise provisioner ignores provisioningTaints when making scheduling decisions (i.e. Pods do not need a matching toleration).

Of course, the proposed solution is just a (viable) placeholder and maybe somebody has a better ideas to resolve this problem, so feel free to propose other solutions.

Are you currently working around this issue?

Not using karpenter at all :-(

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@rustrial rustrial added the feature New feature or request label Aug 20, 2021
@ellistarn
Copy link
Contributor

ellistarn commented Aug 20, 2021

Hey @rustrial, this is a very interesting use case. Thanks for writing it up! We've recently been rethinking the provisioner.spec.taints functionality. Specifically, we're thinking about generating taints in response to tolerations, instead of doing nothing. The primary reason for this is to enable use cases where pods wish to run in isolation and would define custom taints to do so. Presently, this would require creating additional provisioners and putting the taints there, which is a bit awkward and high touch.

For your use case, I think the problem is already solved, regardless of our behavior w.r.t taints. This behavior currently works with the AWS VPC CNI and Calico, though I haven't tested Cilium. Let me walk you through the provisioning process:

  • User creates a typical workload pod PodA (e.g. pause) from a deployment scale out, etc
  • Karpenter sees the pod watch event and begins the provisioning process
  • Karpenter computes the resource requirements of daemonsets + PodA and launches a node
  • Karpenter immediately creates a node object and binds the pod to the node, before the instance is online
  • KCM Node Lifecycle. Contorller taints the node with NotReady: NoExecute/NoSchedule
  • KCM Daemonset Lifecycle Controller creates a CNI Pod (cilium/awsvpc/calico/etc) and binds it to the node
  • The node connects to the API Server, sees PodA and the CNI Pod, and pulls the images
  • PodA does not execute because of the NotReady: NoExecute taint on the node
  • CNI Pod does execute because it tolerates NotReady: NoExecute
  • CNI Pod does node initialization logic and removes the NotReady taint from the node
  • PodA executes because the node no longer has the NoExecute taint

It's a bit complicated, but the key bit is the NoExecute taint created by the NodeLifecycleController. It's designed to be removed by some daemon pod (typically a CNI), which will cause other pods to execute. For any pods that don't rely on the CNI, you can always tolerate this taint.

It's not clear to me why cilium doesn't use this taint, and instead relies on its own "node.cilium.io/agent-not-ready". I'm curious if it already does this. There may be a really good reason for this, in which case, I like your idea about applying taints at provisioning time. I wonder if we could simplify how this is modeled and use the NoExecute semantic.

spec:
  taints:
    - key: node.cilium.io/agent-not-ready
      effect: NoExecute

Karpenter will apply this taint to all nodes for this provisioner, but it won't be used in scheduling calculations. I think this is roughly the semantic you want.

Thoughts?

@ellistarn
Copy link
Contributor

ellistarn commented Aug 20, 2021

Checking back in. I played with

spec:
  taints:
    - key: node.cilium.io/agent-not-ready
      effect: NoExecute

It looks like the podlifecyclecontroller will evict pods immediately if a NoExecute taint exists on the node (unless it tolerates this taint, which by default, they don't), so this path isn't going to work unless we fundamentally change up our binding behavior, which I'm not excited about. I'm still curious if Cilium supports the node readiness workflow mentioned above.

If not, we may need to consider something like readinessTaints as you're suggesting.

@rustrial
Copy link
Author

@ellistarn Thank's for your feedback.

KCM Node Lifecycle. Contorller taints the node with NotReady: NoExecute/NoSchedule

Can you please provide some pointers to the documentation of the NotReady taint you mentioned. I am not aware of such a taint and would like to read-up on this to figure out whether we can get this working for Cilium. I am only aware of the common taints mentioned in https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-nodes-by-condition and those taints are not treated specially by EKS.


It looks like the podlifecyclecontroller will evict pods immediately if a NoExecute taint exists on the node (unless it tolerates this taint, which by default, they don't), so this path isn't going to work unless we fundamentally change up our binding behavior, which I'm not excited about.

I think I have an idea for a minimal change to the current Binding behaviour and already started to draft a PR, just to figure out whether this might work (at all).


If not, we may need to consider something like readinessTaints as you're suggesting.

I like the term readinessTaints, its more descriptive then my placeholder name provisioningTaints.

@ellistarn
Copy link
Contributor

ellistarn commented Aug 22, 2021

Heres an example node

apiVersion: v1
kind: Node
metadata:
  creationTimestamp: "2021-08-22T16:45:01Z"
  finalizers:
  - karpenter.sh/termination
  labels:
    karpenter.sh/provisioner-name: default
  name: ip-192-168-58-236.us-west-2.compute.internal
spec:
  providerID: aws:///us-west-2d/i-08429eea8403fbe47
  taints:
  - effect: NoSchedule
    key: karpenter.sh/not-ready
  - effect: NoSchedule
    key: node.kubernetes.io/unreachable
    timeAdded: "2021-08-22T16:45:01Z"
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    timeAdded: "2021-08-22T16:45:01Z"
status:
  allocatable:
    cpu: "8"
    memory: 30310Mi
    pods: "58"
  conditions:
  - lastHeartbeatTime: "2021-08-22T16:45:01Z"
    lastTransitionTime: "2021-08-22T16:45:01Z"
    message: Kubelet never posted node status.
    reason: NodeStatusNeverUpdated
    status: Unknown
    type: Ready
  - lastHeartbeatTime: "2021-08-22T16:45:01Z"
    lastTransitionTime: "2021-08-22T16:45:01Z"
    message: Kubelet never posted node status.
    reason: NodeStatusNeverUpdated
    status: Unknown
    type: MemoryPressure
  - lastHeartbeatTime: "2021-08-22T16:45:01Z"
    lastTransitionTime: "2021-08-22T16:45:01Z"
    message: Kubelet never posted node status.
    reason: NodeStatusNeverUpdated
    status: Unknown
    type: DiskPressure
  - lastHeartbeatTime: "2021-08-22T16:45:01Z"
    lastTransitionTime: "2021-08-22T16:45:01Z"
    message: Kubelet never posted node status.
    reason: NodeStatusNeverUpdated
    status: Unknown
    type: PIDPressure

Until this condition is marked ready,

  - lastHeartbeatTime: "2021-08-22T16:45:01Z"
    lastTransitionTime: "2021-08-22T16:45:01Z"
    message: Kubelet never posted node status.
    reason: NodeStatusNeverUpdated
    status: Unknown
    type: Ready

these taints will be applied by the node lifecycle controller

  - effect: NoSchedule
    key: node.kubernetes.io/unreachable
    timeAdded: "2021-08-22T16:45:01Z"
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    timeAdded: "2021-08-22T16:45:01Z"

See: https://github.com/kubernetes/kubernetes/blob/release-1.17/pkg/controller/nodelifecycle/node_lifecycle_controller.go#L87.

@ellistarn
Copy link
Contributor

ellistarn commented Aug 22, 2021

I think I have an idea for a minimal change to the current Binding behaviour and already started to draft a PR, just to figure out whether this might work (at all).

Do you mind sharing the idea? I might be able to help.

@ellistarn
Copy link
Contributor

Did a little digging. The kubelet updates the runtime state for the network here:
https://github.com/kubernetes/kubernetes/blob/3a26b864f4e17a6b7e0e673f583a551820a25b36/pkg/kubelet/kubelet.go#L2334.

This then prevents the node status conditions type: Ready from becoming true. So CNIs are required to cause this to be true, which will in turn cause the taints to be removed.

@rustrial
Copy link
Author

rustrial commented Aug 23, 2021

@ellistarn I see, your are talking about https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions and specifically about the following two taints, which are set by the "node controller" based on the value of the node's Ready condition:

  • node.kubernetes.io/not-ready (if Ready == False )
  • node.kubernetes.io/unreachable (if Ready == Unknown)
  • no taint (if Ready == True)

I also understand that the node's Ready condition depends on the runtime's NetworkReady condition. However, I still don't see how a CNI plugin like Cilium could influence (set) the node's NetworkReady condition. A short search on the Cilium code base reveals no reference to "NetworkReady" and the same is true for AWS CNI plugin. Are we confusing node and pod networking status (conditions) here? A node might have well working and ready networking (e.g. AWS VPC ENI attached ...) while the CNI plugin, which is responsible for pod networking on that node (and not for the node's own networking needs), is not yet ready.


I think I have an idea for a minimal change to the current Binding behaviour and already started to draft a PR, just to figure out whether this might work (at all).

Do you mind sharing the idea? I might be able to help.

Sure, see #631

@ellistarn
Copy link
Contributor

Hey @rustrial , I put some comments on the PR. What happens if you simply don't use this taint w/ Cilium? IIUC, it's to prevent a race condition when two CNIs are installed.

@bwagner5 bwagner5 added this to the v0.4.0 milestone Aug 23, 2021
@bwagner5 bwagner5 changed the title Support extra taints not considered when scheduling Pods Support Cilium Aug 23, 2021
@rustrial
Copy link
Author

@ellistarn Actually, not using the taint we observe that there might occasionally be some Pods which are not properly managed by Cilium, but this could be caused by another race condition in Cilium and is not necessarily related to that taint. I am also curious what conclusion we get from the Slack thread you started on the Cilium #kubernetes channel, hopefully it will shed some light on this.

@ellistarn
Copy link
Contributor

Actually, not using the taint we observe that there might occasionally be some Pods which are not properly managed by Cilium, but this could be caused by another race condition in Cilium and is not necessarily related to that taint.

Do you have any links to this?

@rustrial
Copy link
Author

Do you have any links to this?

@ellistarn No, I have no link. I just assumed/hoped that this problem would be addressed by the Cilium's node.cilium.io/agent-not-ready taint. I propose we focus on the Slack thread for now, trying to figure out when & why that taint is needed and whether we can live without it.

That said, I am convinced that even if it turns-out that the Cilium taint is not necessarily needed, sooner or later this request will come up again. There will always be another cluster add-on that needs some more elaborated control over node readiness and thus will not work (well) with karpenter (as it is today).

@ellistarn
Copy link
Contributor

I just discovered a new type of Taint Effect

	// NOT YET IMPLEMENTED. TODO: Uncomment field once it is implemented.
	// Like TaintEffectNoSchedule, but additionally do not allow pods submitted to
	// Kubelet without going through the scheduler to start.
	// Enforced by Kubelet and the scheduler.
	// TaintEffectNoScheduleNoAdmit TaintEffect = "NoScheduleNoAdmit"

If this is what I think it is, it means that if we bind the pods to the kubelet, the kubelet would not start the pod until the taints are removed. I think this is the exact behavior that we want.

@rustrial
Copy link
Author

Looks promising (in theory), but I did not find any KEP or other detailed description of that Taint Effect. All I found are some comments on issues, the golang types and this design document. None of these sources make it clear what exactly the behaviour of this effect will be and also it looks like it is abandoned (might not be implemented at all).

Need to do some more research, but I guess we have to ask around in the k8s community and find someone, who can shed some light on this.

  • is this still on the roadmap and what is the timeline?
  • what is the exact behaviour? (i.e. can we bind a Pod to a node, and it will not be rejected but only gets started once the taint is removed?)
  • ...

@ellistarn
Copy link
Contributor

ellistarn commented Aug 27, 2021

I might be interested in picking up this KEP and driving it. It provides exactly the semantic we need without polluting Karpenter's API or requiring changes to the optimistic scheduling logic. There are a ton of great benefits of pre-binding pods, such as preloading configmaps and images. I want to make sure we can keep those benefits while also supporting the ordered execution.

@bwagner5 bwagner5 changed the title Support Cilium Support Startup Taints Sep 23, 2021
@bwagner5 bwagner5 modified the milestones: v0.4.0, v0.5.0 Sep 23, 2021
@stevehipwell
Copy link
Contributor

I think the CriticalAddonsOnly taint that gets added to EKS nodes (I have no idea by what or where it's documented) and is present as tolerations in certain AWS charts/manifests might be another edge case that need considering? For this I think Karpenter just needs the ability to ignore tolerations so it doesn't add them to instances it schedules?

@ellistarn
Copy link
Contributor

I'd need to dig into this CriticalAddonsOnly detail and where it's coming from. What's the problem that you're trying to solve?

@stevehipwell
Copy link
Contributor

@ellistarn the docs say that tolerations on workloads will cause taints to be added? This toleration is quite widespread, take a look at the EBS CSI driver chart for example.

@ellistarn
Copy link
Contributor

ellistarn commented Nov 4, 2021

Sorry if I'm being a bit dense, but what's the problem? If there are pending pods that need capacity (ebs csi driver, in this case), Karpenter will generate nodes with taints for those pods. Pods that do not have these tolerations will not be scheduled to these nodes. This ensures that ebs csi driver will only run alongside other workloads that have CriticalAddonsOnly.

Note: The csi driver includes both a daemonset (which karpenter won't schedule) and controller pods (which karpenter will).

@stevehipwell
Copy link
Contributor

@ellistarn the point is that this toleration is to allow scheduling not to drive isolation. It would be undesirable to create a node with this taint for this workload. It's also quite common to tolerate certain taints on workloads that aren't directly targeted at them, these wouldn't want to create the taint either.

@ellistarn
Copy link
Contributor

ellistarn commented Nov 4, 2021

I hear ya.

It would be undesirable to create a node with this taint for this workload.

Can you dig into this a little bit? Why don't you want this pod scheduled?

It's also quite common to tolerate certain taints on workloads that aren't directly targeted at them, these wouldn't want to create the taint either.

If there is existing capacity without taints, the kube scheduler schedule them without any taint generation. I definitely agree that it's not necessary for karpenter to add the taints for the pod to be scheduled. However, by generating taints, you can ensure isolation via the pod spec, which wouldn't be possible otherwise without creating multiple provisioners.

This creates a tradeoff where nodes are packed less tightly (since pods without those tolerations could have been scheduled together), but enables the pod isolation feature.

I could potentially see a pod annotation like karpenter.sh/generate-taints=true. What do you think? Do you have any preference for polarity: whether or not it's enabled or disabled by default.

@ellistarn
Copy link
Contributor

ellistarn commented Nov 5, 2021

Thanks for the writeup -- I'll think deeply on this. Do you mind porting it to a new issue, since it seems like we're outside of the scope of the original ticket.

@stevehipwell
Copy link
Contributor

Thank you for moving this for me @ellistarn I had a busy weekend and didn't get a chance.

@mattsre
Copy link
Contributor

mattsre commented Feb 4, 2022

Just wanted to check on the status of this. Another Cilium user here, same problem as the original issue. Currently we work around this by simply restarting all pods on a node when the CNI pod finishes starting up, but this is obviously a hacky workaround we'd like to avoid keeping.

I noticed you specifically mentioned you don't want to complicate the API with new concepts and was just wondering if in the couple months since then if any better ideas had come to mind?

@ellistarn
Copy link
Contributor

ellistarn commented Feb 4, 2022

Is it possible that this could be fixed in the cilium CNI? Karpenter expresses the intent that Pod A should run on node B. Other controllers should reconcile this intent.

@mattsre
Copy link
Contributor

mattsre commented Feb 4, 2022

I don't think so. It looks like Cilium uses this taint specifically to block DaemonSets from getting scheduled since they tolerate specific taints - https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/#taints-and-tolerations
Relevant issue/proposal - cilium/cilium#16602

Do provisioners in Karpenter use different launch templates in AWS? If so, could we expose some way for the user to add custom args to boostrap.sh ... --kubelet-extra-args? Karpenter internally could merge the args provided by the user with those that Karpenter already sets. This would allow adding taints or custom labels users might want.

If there's a concern about it being too specific to AWS, the functionality could be exposed purely as a way to pass args to the provisoned nodes kubelet, and the implementation would be left up to the cloud provider

@ellistarn
Copy link
Contributor

ellistarn commented Feb 4, 2022

Users can already add custom taints via provisioner.taints which get merged in through kubelet args. This doesn't solve the problem of cilium failing to reconcile pod networking if it's installed after the pods are. IIUC, the kubelet doesn't even attempt to run the pods until the node is ready, and cilium is responsible for triggering node ready, so I want to dig a bit deeper into what the problem is 🤔

@mattsre
Copy link
Contributor

mattsre commented Feb 4, 2022

Users can already add custom taints via provisioner.taints which get merged in through kubelet args.

Isn't the problem with this that Karpenter uses these taints for scheduling purposes? As in, the pod would have to tolerate these taints for a node to be provisioned. So we can't add the Cilium not-ready taint here, as pods would then have to tolerate it, and make the taint a no-op essentially.

IIUC, the kubelet doesn't even attempt to run the pods until the node is ready, and cilium is responsible for triggering node ready

This might be true, I honestly haven't tested this

@ellistarn
Copy link
Contributor

ellistarn commented Feb 4, 2022

Isn't the problem with this that Karpenter uses these taints for scheduling purposes? As in, the pod would have to tolerate these taints for a node to be provisioned. So we can't add the Cilium not-ready taint here, as pods would then have to tolerate it, and make the taint a no-op essentially.

D'oh, yes. I lost the thread 🤦 . I remember from some months ago that a user was running Cilium without these taints and things were working for them 🤔 it was a while ago though, so I might be misremembering.

@mKeRix
Copy link
Contributor

mKeRix commented Feb 22, 2022

I also stumbled upon this problem when using Cilium (in aws-cni chaining mode) yesterday. As a workaround, I ended up utilizing Kyverno to set the node taint with a mutation policy. I chose to set the taint for all nodes, but in theory you could also write a selector that will only mutate nodes with the karpenter.sh/provisioner-name label. Note that for this to work you will need to remove [Node,*,*] from the resourceFilters in the default Kyverno config. My policy looks like this:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: add-cilium-readiness-taint
spec:
  rules:
  - name: add-cilium-readiness-taint
    match:
      resources:
        kinds:
        - Node
    preconditions:
      any:
      - key: '{{ request.operation }}'
        operator: Equals
        value: CREATE
    mutate:
      patchStrategicMerge:
        spec:
          taints:
          - key: 'node.cilium.io/agent-not-ready'
            value: 'true'
            effect: NoSchedule

Maybe this helps someone else, it's working out well for me. :)

@praveen-livspace
Copy link
Contributor

I also stumbled upon this problem when using Cilium (in aws-cni chaining mode) yesterday. As a workaround, I ended up utilizing Kyverno to set the node taint with a mutation policy. I chose to set the taint for all nodes, but in theory you could also write a selector that will only mutate nodes with the karpenter.sh/provisioner-name label. Note that for this to work you will need to remove [Node,*,*] from the resourceFilters in the default Kyverno config. My policy looks like this:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: add-cilium-readiness-taint
spec:
  rules:
  - name: add-cilium-readiness-taint
    match:
      resources:
        kinds:
        - Node
    preconditions:
      any:
      - key: '{{ request.operation }}'
        operator: Equals
        value: CREATE
    mutate:
      patchStrategicMerge:
        spec:
          taints:
          - key: 'node.cilium.io/agent-not-ready'
            value: 'true'
            effect: NoSchedule

Maybe this helps someone else, it's working out well for me. :)

This looks like an awesome hack. Let me try doing this with OPA Gatekeeper

@ellistarn ellistarn added the kep this feature requires a KEP label Mar 22, 2022
@zswanson
Copy link

zswanson commented Mar 31, 2022

I think there is a more generalized problem with daemonsets that need to be running, but that do not add or remove taints. The AWS controllers in particular - aws-vpc-cni, aws-ebs-csi-driver and aws-secretstore-csi-provider. None of them add/remove startup taints but if you have a pod that uses their features that is bound to a karpenter node there is a race condition on startup that may leave the pod Pending for a long time. We've caught pods in this state multiple times, stuck for 15-20 minutes on a new node complaining about resources/capabilities unavailable that the daemonsets provide to the node.

@zswanson
Copy link

zswanson commented Apr 7, 2022

Commenting again on this to add that when we performed an EKS upgrade from 1.21 to 1.22 this became particularly noticeable when we started draining/cordoning nodes en-masse to upgrade the karpenter AMIs to 1.22 compatible - a lot of pods were stuck in Pending waiting for vpc-cni IP assignment or Secret volumes that they couldn't mount because they started on new nodes before the relevant daemonsets were available.

@BryanStenson-okta
Copy link
Contributor

Why not just provide provisioner.spec.taintsToIgnore which, optionally, contains a list of taints the provisioner ignores when provisioning new nodes?

If a would-be-created node has a taint in the provisioner.spec.taintsToIgnore list, this taint is excluded when considering if the provisioner should create another node. Being on the ignore list, karpenter could ignore this taint when bin-packing the pod to the node too, right?

@BryanStenson-okta
Copy link
Contributor

@ellistarn curious if you've found anything here. Specifically, if adding a provisioner.spec.taintsToIgnore will make life difficult in other ways.

I'm stuck on a project that will require me to solve this, or swap Karpenter out for CAS. My strong preference is to stick with a "fixed" Karpenter -- and I have some bandwidth to implement this -- if the proposed solution makes sense.

@BryanStenson-okta
Copy link
Contributor

would love feedback on this ^^^ proposed solution. it seems very simple (maybe I'm missing something?).

@ellistarn
Copy link
Contributor

We were chatting about this in standing this morning. We're playing with an idea that would enable us to not bind pods and allow the kube scheduler to do so. These startup taints could either be implemented as custom userdata or through something like taintsToIgnore (I'd probably name this startupTaints).

With just taintsToIgnore and without something where we don't bind, I don't see how this solves the problem.

@BryanStenson-okta
Copy link
Contributor

I suppose I see "the problem" as Karpenter not adding nodes to the cluster. I believe taintsToIgnore itself solves this, as this will allow a provisioner to add more nodes to the cluster.

Having said that, there's absolutely a combination of taints which will cause lots of node churn -- that's probably not what we want either (#1671 (comment))

I'd like to hear more about ideas on how to delay or skip binding (although, by skipping binding, we lose some of the sugar Karpenter provides).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request kep this feature requires a KEP
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants