Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter adds hardcoded Node Name to Pod Spec for scheduling. #1000

Closed
eminalemdar opened this issue Dec 15, 2021 · 22 comments
Closed

Karpenter adds hardcoded Node Name to Pod Spec for scheduling. #1000

eminalemdar opened this issue Dec 15, 2021 · 22 comments
Labels
question Further information is requested stale

Comments

@eminalemdar
Copy link

Version

Karpenter: v0.5.2

Kubernetes: v1.21

Expected Behavior

Karpenter scales nodes and schedules pods onto those without depending on the kube-scheduler. This reduces latency of the pod scheduling process. At the same time there should be a reasonable logic behind the scenes.

Actual Behavior

After scaling the nodes Karpenter adds node name's to the pods spec hardcoded. When you delete the node manually, pod enters the pending state and when new node comes in, Karpenter adds the new node's name to the pending pod's spec and schedule the pod.

Question

This approach doesn't seem like best practice. Can there be any other way to schedule the pods onto newly launched nodes?

@eminalemdar eminalemdar added the bug Something isn't working label Dec 15, 2021
@seh
Copy link

seh commented Dec 15, 2021

I raised a similar concern in the "eks" channel of the "Kubernetes" Slack team. Specifically: Creating a v1/Binding object during the scheduler's PreBind phase sounds like it violates that extension point's contract.

@eminalemdar
Copy link
Author

@seh I believe that triggers adding the Node Name to pod's spec.

@seh
Copy link

seh commented Dec 15, 2021

That's one thing that the API server does in response to creating a v1/Binding, but that's not the problem here. The binding is occurring earlier than intended in the scheduler's interaction with plugins, and the binding is occurring while many facts about the node still remain unknown. For admission plugins that hook v1/Bindings, they will find that the referenced Node object is underpopulated at this point.

@bwagner5
Copy link
Contributor

Karpenter actually only performs scheduling if the kube-scheduler fails to find a node to place the pod (meaning the cluster needs to scale up nodes to accommodate the pods). As far as the scheduling portion of karpenter goes, it's not that different from how kube-scheduler works in terms of adding the node name to the pod. One optimization Karpenter makes is the optimistic binding of the pod to the node. When Karpenter launches new EC2 capacity, it will get the DNS name from EC2 and create a node resource in k8s and bind the pod to that node that is not Ready yet. But the vast majority of the time, the node will become ready.

Do you have an example of an admission plugin that should engage but doesn't when Karpenter binds a pod to a node? I may not fully understand the issue here but it seems like an admission plugin shouldn't deny a pod bind to a node since that would confuse the scheduler and taints should be used instead.

@olemarkus
Copy link
Contributor

Taint tolerations is actually one of the plugins that is not entirely correctly followed. The issues around affinity, spread constraints etc also comes from this early binding.

Even DaemonSet relies on kube-scheduler now and only provide restrictive node affinities to ensure pod end up in the right place. Unless Karpenter plans on emulating all scheduler plugins (which is sort of what cluster autoscaler does), there will be a lot of unexpected behavior.

@seh
Copy link

seh commented Dec 15, 2021

Do you have an example of an admission plugin that should engage but doesn't when Karpenter binds a pod to a node?

The admission plugin is not blocking this binding; it is augmenting it.

We have such a Webhook in production use that reacts to creation of v1/Bindings. It inspects the target Node object, collects details like any external IP addresses that the node has, and adds annotations to the v1/Binding object—which the API server then copies onto the Pod object being bound. In this way, we can annotate our pods with details about the bound node, and allow containers to consume those details via the Downward API.

This annotating must occur at this time so that annotations are present when the containers first start. Trying to add the annotations when the pod is first created would only work for pods that are bound to a chosen node initially, bypassing scheduling, though it is possible then to bind them to a nonexistent node.

@ellistarn
Copy link
Contributor

By design, karpenter owns the nodes it creates. It knows the properties (e.g. zone/arch) and the scheduling constraints (labels/taints) of the nodes it launches. It has to know this in order to be accurate about the scheduling decision it makes. If we didn't know these things, it's possible to get into situations where the node launched by karpenter isn't a viable node for the pod.

The fact that karpenter creates the node object with the properties it knows and binds immediately is an optimization that follows from the previous assumption. It allows us to avoid tracking state, causes the decision to be reflected in the API Server immediately. It also allows the kubelet to know the binding decision earlier, and get the container ready in parallel with node initialization.

Of course, as raised in this thread, systems that have taken dependencies on the implementation details of the CAS + Kubescheduler interaction may need to be adapted to work with Karpenter. I'm confident that we can find solutions, including, if necessary, not binding the pod. However, I have a strong preference for alternatives that allow us to keep the benefits mentioned above.

@ellistarn
Copy link
Contributor

For example:

We have such a Webhook in production use that reacts to creation of v1/Bindings. It inspects the target Node object, collects details like any external IP addresses that the node has, and adds annotations to the v1/Binding object—which the API server then copies onto the Pod object being bound. In this way, we can annotate our pods with details about the bound node, and allow containers to consume those details via the Downward API.

Karpenter knows external IP addresses of the node, perhaps we can expand the amount of information that is decorated at creation time. Do you have a list of properties that would be useful to see?

@ellistarn
Copy link
Contributor

Taint tolerations is actually one of the plugins that is not entirely correctly followed. The issues around affinity, spread constraints etc also comes from this early binding. Even DaemonSet relies on kube-scheduler now and only provide restrictive node affinities to ensure pod end up in the right place. Unless Karpenter plans on emulating all scheduler plugins (which is sort of what cluster autoscaler does), there will be a lot of unexpected behavior.

Karpenter currently implements affinity, spread, taints, and factors these in for both unschedulable pods and daemonsets. We chose an alternative approach of the cluster autoscaler, and decided to not use the kube scheduler's implementation. There are a number of benefits to this approach, but it does mean that we are responsible for implementing the scheduling spec.

@eminalemdar
Copy link
Author

By design, karpenter owns the nodes it creates. It knows the properties (e.g. zone/arch) and the scheduling constraints (labels/taints) of the nodes it launches. It has to know this in order to be accurate about the scheduling decision it makes. If we didn't know these things, it's possible to get into situations where the node launched by karpenter isn't a viable node for the pod.

The fact that karpenter creates the node object with the properties it knows and binds immediately is an optimization that follows from the previous assumption. It allows us to avoid tracking state, causes the decision to be reflected in the API Server immediately. It also allows the kubelet to know the binding decision earlier, and get the container ready in parallel with node initialization.

Thank you for these informations @ellistarn. This actually explains a lot!

Karpenter actually only performs scheduling if the kube-scheduler fails to find a node to place the pod (meaning the cluster needs to scale up nodes to accommodate the pods). As far as the scheduling portion of karpenter goes, it's not that different from how kube-scheduler works in terms of adding the node name to the pod. One optimization Karpenter makes is the optimistic binding of the pod to the node. When Karpenter launches new EC2 capacity, it will get the DNS name from EC2 and create a node resource in k8s and bind the pod to that node that is not Ready yet. But the vast majority of the time, the node will become ready.

That means Karpenter doesn't inject the necessary information to pods earlier. That's great news actually! Thank you for this explanation @bwagner5.

@seh
Copy link

seh commented Dec 15, 2021

Karpenter knows external IP addresses of the node, perhaps we can expand the amount of information that is decorated at creation time. Do you have a list of properties that would be useful to see?

Per this Slack discussion—in which I pointed at the portion of code where you populate the nascent Node object—I thought that the EC2 instance doesn't have an external IP address bound yet at that point.

If that's possible, we'd expect to find these addresses in the "status.addresses" field where the "type" field's value is "ExternalIP." You set the AZ and instance type labels today, but not the one for the region.

@olemarkus
Copy link
Contributor

olemarkus commented Dec 15, 2021

status.addresses field of the Node object is handled by CCM. That's one of the reason for #972 . topology.kubernetes.io/region is also reconciled by CCM.

CCM acts on this pretty quickly, I think. It doens't need kubelet to be running as long as it can match the node to an instance.

I have CCM doing this:

  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          f:beta.kubernetes.io/instance-type: {}
          f:failure-domain.beta.kubernetes.io/region: {}
          f:failure-domain.beta.kubernetes.io/zone: {}
          f:node.kubernetes.io/instance-type: {}
          f:topology.kubernetes.io/region: {}
          f:topology.kubernetes.io/zone: {}
      f:spec:
        f:providerID: {}
    manager: aws-cloud-controller-manager
    operation: Update
    time: "2021-12-15T08:48:06Z"

and the Node has

  creationTimestamp: "2021-12-15T08:48:05Z"

So about a second.

I am wondering if Karpenter can hold back binding for some seconds for the initializaton taints to go away (if they were to be set).

@seh
Copy link

seh commented Dec 15, 2021

Yes, I've seen the CCM code in the past that handles those, and considered mentioning that, but my main point is that Karpenter is exposing Webhooks to Nodes in this partially populated state well before any other Kubernetes component normally would. Waiting on the CCM to have its turn at enriching the Node would help.

@ellistarn
Copy link
Contributor

ellistarn commented Dec 23, 2021

Waiting on the CCM to have its turn at enriching the Node would help.

I don't think there's any contract that the node object is fully populated for any given webhook invocation. This becomes challenging to reason about, since Karpenter would need to be aware of every use case that ever labels nodes. Instead, can't the webhook only act once the information is there?

@ellistarn ellistarn added question Further information is requested and removed bug Something isn't working labels Dec 23, 2021
@olemarkus
Copy link
Contributor

There is no contract for a fully populated node. But for CCM, the contract is clearly documented here: https://kubernetes.io/docs/tasks/administer-cluster/running-cloud-controller/#running-cloud-controller-manager
New nodes are tainted upon creation with node.cloudprovider.kubernetes.io/uninitialized. Thus no Pods are scheduled on new nodes until CCM has removed that taint. That is, unless Pods explicitly tolerate them, which most Pods should not, just like most Pods should not tolerate the NotReady taint.
Since Karpenter replaces Kubelet here, it is only reasonable that Karpenter fullfills the Kubelet part of the contract. And if it does that, it also cannot bind a pod to a Node with a taint the Pod does not tolerate.

@seh
Copy link

seh commented Dec 23, 2021

Instead, can't the webhook only act once the information is there?

No, it can't. It has to augment the v1/Binding at time of creation, and that happens only once.

Also, for this Webhook in question, it needs the node's external IP addresses, not particular labels on the node.

@ellistarn
Copy link
Contributor

ellistarn commented Dec 23, 2021

No, it can't. It has to augment the v1/Binding at time of creation, and that happens only once.

What modification is happening to the binding object?

Since Karpenter replaces Kubelet here, it is only reasonable that Karpenter fullfills the Kubelet part of the contract. And if it does that, it also cannot bind a pod to a Node with a taint the Pod does not tolerate.

I understand this contract, however, this will cause Karpenter to race with the Kube scheduler, which is something I want to avoid unless absolutely necessary. There are a number of architectural advantages to binding early (see above), and I'd hate to lose them unless we didn't have another option. Fundamentally, Karpenter is a scheduler that coexists alongside the existing kube scheduler and operates in the default-scheduler scheduling namespace.

I think there's another path to enable the initialization taint use case, which is the KEPed but unimplemented NoAdmit taint. This taint is designed for the exact use case that Karpenter is trying to achieve, and tells the kubelet to not execute a bound pod until the taint is removed. See more here: #628 (comment)

@seh
Copy link

seh commented Dec 23, 2021

What modification is happening to the binding object?

Per #1000 (comment), we add annotations, which the API server transcribes onto the Pod being bound.

@ellistarn
Copy link
Contributor

@seh, looking at some shorter term options

  • Is it possible to watch the node object and annotate the pods?
  • Could the aws provider for karpenter write the information into the instance? What is the set of data you need?

@seh
Copy link

seh commented Dec 24, 2021

Is it possible to watch the node object and annotate the pods?

No. There are only two happens-before edges in the API to supply Downward API content to all containers in a pod: pod creation and (for unbound pods) node binding. After the latter, the kubelet is free to start running init containers. We hook the latter during admission so that by then both the pod and the node exist, but no containers could have started yet.

Could the aws provider for karpenter write the information into the instance? What is the set of data you need?

Yes, I hope so. As mentioned earlier, we need the external IP addresses of the EC2 instance hosting the node.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 25 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the stale label Feb 23, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Mar 1, 2022

This issue was closed because it has been stalled for 5 days with no activity.

@github-actions github-actions bot closed this as completed Mar 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested stale
Projects
None yet
Development

No branches or pull requests

5 participants