Karpenter adds hardcoded Node Name to Pod Spec for scheduling. #1000

eminalemdar · 2021-12-15T07:43:09Z

Version

Karpenter: v0.5.2

Kubernetes: v1.21

Expected Behavior

Karpenter scales nodes and schedules pods onto those without depending on the kube-scheduler. This reduces latency of the pod scheduling process. At the same time there should be a reasonable logic behind the scenes.

Actual Behavior

After scaling the nodes Karpenter adds node name's to the pods spec hardcoded. When you delete the node manually, pod enters the pending state and when new node comes in, Karpenter adds the new node's name to the pending pod's spec and schedule the pod.

Question

This approach doesn't seem like best practice. Can there be any other way to schedule the pods onto newly launched nodes?

seh · 2021-12-15T13:33:31Z

I raised a similar concern in the "eks" channel of the "Kubernetes" Slack team. Specifically: Creating a v1/Binding object during the scheduler's PreBind phase sounds like it violates that extension point's contract.

eminalemdar · 2021-12-15T15:26:33Z

@seh I believe that triggers adding the Node Name to pod's spec.

seh · 2021-12-15T15:29:13Z

That's one thing that the API server does in response to creating a v1/Binding, but that's not the problem here. The binding is occurring earlier than intended in the scheduler's interaction with plugins, and the binding is occurring while many facts about the node still remain unknown. For admission plugins that hook v1/Bindings, they will find that the referenced Node object is underpopulated at this point.

bwagner5 · 2021-12-15T15:36:54Z

Karpenter actually only performs scheduling if the kube-scheduler fails to find a node to place the pod (meaning the cluster needs to scale up nodes to accommodate the pods). As far as the scheduling portion of karpenter goes, it's not that different from how kube-scheduler works in terms of adding the node name to the pod. One optimization Karpenter makes is the optimistic binding of the pod to the node. When Karpenter launches new EC2 capacity, it will get the DNS name from EC2 and create a node resource in k8s and bind the pod to that node that is not Ready yet. But the vast majority of the time, the node will become ready.

Do you have an example of an admission plugin that should engage but doesn't when Karpenter binds a pod to a node? I may not fully understand the issue here but it seems like an admission plugin shouldn't deny a pod bind to a node since that would confuse the scheduler and taints should be used instead.

olemarkus · 2021-12-15T16:03:46Z

Taint tolerations is actually one of the plugins that is not entirely correctly followed. The issues around affinity, spread constraints etc also comes from this early binding.

Even DaemonSet relies on kube-scheduler now and only provide restrictive node affinities to ensure pod end up in the right place. Unless Karpenter plans on emulating all scheduler plugins (which is sort of what cluster autoscaler does), there will be a lot of unexpected behavior.

seh · 2021-12-15T16:20:54Z

Do you have an example of an admission plugin that should engage but doesn't when Karpenter binds a pod to a node?

The admission plugin is not blocking this binding; it is augmenting it.

We have such a Webhook in production use that reacts to creation of v1/Bindings. It inspects the target Node object, collects details like any external IP addresses that the node has, and adds annotations to the v1/Binding object—which the API server then copies onto the Pod object being bound. In this way, we can annotate our pods with details about the bound node, and allow containers to consume those details via the Downward API.

This annotating must occur at this time so that annotations are present when the containers first start. Trying to add the annotations when the pod is first created would only work for pods that are bound to a chosen node initially, bypassing scheduling, though it is possible then to bind them to a nonexistent node.

ellistarn · 2021-12-15T16:59:04Z

By design, karpenter owns the nodes it creates. It knows the properties (e.g. zone/arch) and the scheduling constraints (labels/taints) of the nodes it launches. It has to know this in order to be accurate about the scheduling decision it makes. If we didn't know these things, it's possible to get into situations where the node launched by karpenter isn't a viable node for the pod.

The fact that karpenter creates the node object with the properties it knows and binds immediately is an optimization that follows from the previous assumption. It allows us to avoid tracking state, causes the decision to be reflected in the API Server immediately. It also allows the kubelet to know the binding decision earlier, and get the container ready in parallel with node initialization.

Of course, as raised in this thread, systems that have taken dependencies on the implementation details of the CAS + Kubescheduler interaction may need to be adapted to work with Karpenter. I'm confident that we can find solutions, including, if necessary, not binding the pod. However, I have a strong preference for alternatives that allow us to keep the benefits mentioned above.

ellistarn · 2021-12-15T17:00:25Z

For example:

We have such a Webhook in production use that reacts to creation of v1/Bindings. It inspects the target Node object, collects details like any external IP addresses that the node has, and adds annotations to the v1/Binding object—which the API server then copies onto the Pod object being bound. In this way, we can annotate our pods with details about the bound node, and allow containers to consume those details via the Downward API.

Karpenter knows external IP addresses of the node, perhaps we can expand the amount of information that is decorated at creation time. Do you have a list of properties that would be useful to see?

ellistarn · 2021-12-15T17:07:59Z

Taint tolerations is actually one of the plugins that is not entirely correctly followed. The issues around affinity, spread constraints etc also comes from this early binding. Even DaemonSet relies on kube-scheduler now and only provide restrictive node affinities to ensure pod end up in the right place. Unless Karpenter plans on emulating all scheduler plugins (which is sort of what cluster autoscaler does), there will be a lot of unexpected behavior.

Karpenter currently implements affinity, spread, taints, and factors these in for both unschedulable pods and daemonsets. We chose an alternative approach of the cluster autoscaler, and decided to not use the kube scheduler's implementation. There are a number of benefits to this approach, but it does mean that we are responsible for implementing the scheduling spec.

eminalemdar · 2021-12-15T17:30:53Z

By design, karpenter owns the nodes it creates. It knows the properties (e.g. zone/arch) and the scheduling constraints (labels/taints) of the nodes it launches. It has to know this in order to be accurate about the scheduling decision it makes. If we didn't know these things, it's possible to get into situations where the node launched by karpenter isn't a viable node for the pod.

The fact that karpenter creates the node object with the properties it knows and binds immediately is an optimization that follows from the previous assumption. It allows us to avoid tracking state, causes the decision to be reflected in the API Server immediately. It also allows the kubelet to know the binding decision earlier, and get the container ready in parallel with node initialization.

Thank you for these informations @ellistarn. This actually explains a lot!

Karpenter actually only performs scheduling if the kube-scheduler fails to find a node to place the pod (meaning the cluster needs to scale up nodes to accommodate the pods). As far as the scheduling portion of karpenter goes, it's not that different from how kube-scheduler works in terms of adding the node name to the pod. One optimization Karpenter makes is the optimistic binding of the pod to the node. When Karpenter launches new EC2 capacity, it will get the DNS name from EC2 and create a node resource in k8s and bind the pod to that node that is not Ready yet. But the vast majority of the time, the node will become ready.

That means Karpenter doesn't inject the necessary information to pods earlier. That's great news actually! Thank you for this explanation @bwagner5.

seh · 2021-12-15T18:01:22Z

Karpenter knows external IP addresses of the node, perhaps we can expand the amount of information that is decorated at creation time. Do you have a list of properties that would be useful to see?

Per this Slack discussion—in which I pointed at the portion of code where you populate the nascent Node object—I thought that the EC2 instance doesn't have an external IP address bound yet at that point.

If that's possible, we'd expect to find these addresses in the "status.addresses" field where the "type" field's value is "ExternalIP." You set the AZ and instance type labels today, but not the one for the region.

olemarkus · 2021-12-15T18:39:34Z

status.addresses field of the Node object is handled by CCM. That's one of the reason for #972 . topology.kubernetes.io/region is also reconciled by CCM.

CCM acts on this pretty quickly, I think. It doens't need kubelet to be running as long as it can match the node to an instance.

I have CCM doing this:

  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          f:beta.kubernetes.io/instance-type: {}
          f:failure-domain.beta.kubernetes.io/region: {}
          f:failure-domain.beta.kubernetes.io/zone: {}
          f:node.kubernetes.io/instance-type: {}
          f:topology.kubernetes.io/region: {}
          f:topology.kubernetes.io/zone: {}
      f:spec:
        f:providerID: {}
    manager: aws-cloud-controller-manager
    operation: Update
    time: "2021-12-15T08:48:06Z"

and the Node has

  creationTimestamp: "2021-12-15T08:48:05Z"

So about a second.

I am wondering if Karpenter can hold back binding for some seconds for the initializaton taints to go away (if they were to be set).

seh · 2021-12-15T18:55:20Z

Yes, I've seen the CCM code in the past that handles those, and considered mentioning that, but my main point is that Karpenter is exposing Webhooks to Nodes in this partially populated state well before any other Kubernetes component normally would. Waiting on the CCM to have its turn at enriching the Node would help.

ellistarn · 2021-12-23T17:35:03Z

Waiting on the CCM to have its turn at enriching the Node would help.

I don't think there's any contract that the node object is fully populated for any given webhook invocation. This becomes challenging to reason about, since Karpenter would need to be aware of every use case that ever labels nodes. Instead, can't the webhook only act once the information is there?

olemarkus · 2021-12-23T17:55:22Z

There is no contract for a fully populated node. But for CCM, the contract is clearly documented here: https://kubernetes.io/docs/tasks/administer-cluster/running-cloud-controller/#running-cloud-controller-manager
New nodes are tainted upon creation with node.cloudprovider.kubernetes.io/uninitialized. Thus no Pods are scheduled on new nodes until CCM has removed that taint. That is, unless Pods explicitly tolerate them, which most Pods should not, just like most Pods should not tolerate the NotReady taint.
Since Karpenter replaces Kubelet here, it is only reasonable that Karpenter fullfills the Kubelet part of the contract. And if it does that, it also cannot bind a pod to a Node with a taint the Pod does not tolerate.

seh · 2021-12-23T18:13:26Z

Instead, can't the webhook only act once the information is there?

No, it can't. It has to augment the v1/Binding at time of creation, and that happens only once.

Also, for this Webhook in question, it needs the node's external IP addresses, not particular labels on the node.

ellistarn · 2021-12-23T19:59:35Z

No, it can't. It has to augment the v1/Binding at time of creation, and that happens only once.

What modification is happening to the binding object?

Since Karpenter replaces Kubelet here, it is only reasonable that Karpenter fullfills the Kubelet part of the contract. And if it does that, it also cannot bind a pod to a Node with a taint the Pod does not tolerate.

I understand this contract, however, this will cause Karpenter to race with the Kube scheduler, which is something I want to avoid unless absolutely necessary. There are a number of architectural advantages to binding early (see above), and I'd hate to lose them unless we didn't have another option. Fundamentally, Karpenter is a scheduler that coexists alongside the existing kube scheduler and operates in the default-scheduler scheduling namespace.

I think there's another path to enable the initialization taint use case, which is the KEPed but unimplemented NoAdmit taint. This taint is designed for the exact use case that Karpenter is trying to achieve, and tells the kubelet to not execute a bound pod until the taint is removed. See more here: #628 (comment)

seh · 2021-12-23T20:33:19Z

What modification is happening to the binding object?

Per #1000 (comment), we add annotations, which the API server transcribes onto the Pod being bound.

ellistarn · 2021-12-23T23:02:18Z

@seh, looking at some shorter term options

Is it possible to watch the node object and annotate the pods?
Could the aws provider for karpenter write the information into the instance? What is the set of data you need?

seh · 2021-12-24T11:51:30Z

Is it possible to watch the node object and annotate the pods?

No. There are only two happens-before edges in the API to supply Downward API content to all containers in a pod: pod creation and (for unbound pods) node binding. After the latter, the kubelet is free to start running init containers. We hook the latter during admission so that by then both the pod and the node exist, but no containers could have started yet.

Could the aws provider for karpenter write the information into the instance? What is the set of data you need?

Yes, I hope so. As mentioned earlier, we need the external IP addresses of the EC2 instance hosting the node.

github-actions · 2022-02-23T01:53:51Z

This issue is stale because it has been open 25 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions · 2022-03-01T02:03:23Z

This issue was closed because it has been stalled for 5 days with no activity.

eminalemdar added the bug Something isn't working label Dec 15, 2021

ellistarn added question Further information is requested and removed bug Something isn't working labels Dec 23, 2021

github-actions bot added the stale label Feb 23, 2022

github-actions bot closed this as completed Mar 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Karpenter adds hardcoded Node Name to Pod Spec for scheduling. #1000

Karpenter adds hardcoded Node Name to Pod Spec for scheduling. #1000

eminalemdar commented Dec 15, 2021

seh commented Dec 15, 2021

eminalemdar commented Dec 15, 2021

seh commented Dec 15, 2021 •

edited

Loading

bwagner5 commented Dec 15, 2021

olemarkus commented Dec 15, 2021

seh commented Dec 15, 2021

ellistarn commented Dec 15, 2021

ellistarn commented Dec 15, 2021

ellistarn commented Dec 15, 2021

eminalemdar commented Dec 15, 2021

seh commented Dec 15, 2021

olemarkus commented Dec 15, 2021 •

edited

Loading

seh commented Dec 15, 2021

ellistarn commented Dec 23, 2021 •

edited

Loading

olemarkus commented Dec 23, 2021

seh commented Dec 23, 2021 •

edited

Loading

ellistarn commented Dec 23, 2021 •

edited

Loading

seh commented Dec 23, 2021

ellistarn commented Dec 23, 2021

seh commented Dec 24, 2021

github-actions bot commented Feb 23, 2022

github-actions bot commented Mar 1, 2022

Karpenter adds hardcoded Node Name to Pod Spec for scheduling. #1000

Karpenter adds hardcoded Node Name to Pod Spec for scheduling. #1000

Comments

eminalemdar commented Dec 15, 2021

Version

Expected Behavior

Actual Behavior

Question

seh commented Dec 15, 2021

eminalemdar commented Dec 15, 2021

seh commented Dec 15, 2021 • edited Loading

bwagner5 commented Dec 15, 2021

olemarkus commented Dec 15, 2021

seh commented Dec 15, 2021

ellistarn commented Dec 15, 2021

ellistarn commented Dec 15, 2021

ellistarn commented Dec 15, 2021

eminalemdar commented Dec 15, 2021

seh commented Dec 15, 2021

olemarkus commented Dec 15, 2021 • edited Loading

seh commented Dec 15, 2021

ellistarn commented Dec 23, 2021 • edited Loading

olemarkus commented Dec 23, 2021

seh commented Dec 23, 2021 • edited Loading

ellistarn commented Dec 23, 2021 • edited Loading

seh commented Dec 23, 2021

ellistarn commented Dec 23, 2021

seh commented Dec 24, 2021

github-actions bot commented Feb 23, 2022

github-actions bot commented Mar 1, 2022

seh commented Dec 15, 2021 •

edited

Loading

olemarkus commented Dec 15, 2021 •

edited

Loading

ellistarn commented Dec 23, 2021 •

edited

Loading

seh commented Dec 23, 2021 •

edited

Loading

ellistarn commented Dec 23, 2021 •

edited

Loading