-
Notifications
You must be signed in to change notification settings - Fork 979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Karpenter adds hardcoded Node Name to Pod Spec for scheduling. #1000
Comments
I raised a similar concern in the "eks" channel of the "Kubernetes" Slack team. Specifically: Creating a v1/Binding object during the scheduler's PreBind phase sounds like it violates that extension point's contract. |
@seh I believe that triggers adding the Node Name to pod's spec. |
That's one thing that the API server does in response to creating a v1/Binding, but that's not the problem here. The binding is occurring earlier than intended in the scheduler's interaction with plugins, and the binding is occurring while many facts about the node still remain unknown. For admission plugins that hook v1/Bindings, they will find that the referenced Node object is underpopulated at this point. |
Karpenter actually only performs scheduling if the kube-scheduler fails to find a node to place the pod (meaning the cluster needs to scale up nodes to accommodate the pods). As far as the scheduling portion of karpenter goes, it's not that different from how kube-scheduler works in terms of adding the node name to the pod. One optimization Karpenter makes is the optimistic binding of the pod to the node. When Karpenter launches new EC2 capacity, it will get the DNS name from EC2 and create a node resource in k8s and bind the pod to that node that is not Ready yet. But the vast majority of the time, the node will become ready. Do you have an example of an admission plugin that should engage but doesn't when Karpenter binds a pod to a node? I may not fully understand the issue here but it seems like an admission plugin shouldn't deny a pod bind to a node since that would confuse the scheduler and taints should be used instead. |
Taint tolerations is actually one of the plugins that is not entirely correctly followed. The issues around affinity, spread constraints etc also comes from this early binding. Even DaemonSet relies on kube-scheduler now and only provide restrictive node affinities to ensure pod end up in the right place. Unless Karpenter plans on emulating all scheduler plugins (which is sort of what cluster autoscaler does), there will be a lot of unexpected behavior. |
The admission plugin is not blocking this binding; it is augmenting it. We have such a Webhook in production use that reacts to creation of v1/Bindings. It inspects the target Node object, collects details like any external IP addresses that the node has, and adds annotations to the v1/Binding object—which the API server then copies onto the Pod object being bound. In this way, we can annotate our pods with details about the bound node, and allow containers to consume those details via the Downward API. This annotating must occur at this time so that annotations are present when the containers first start. Trying to add the annotations when the pod is first created would only work for pods that are bound to a chosen node initially, bypassing scheduling, though it is possible then to bind them to a nonexistent node. |
By design, karpenter owns the nodes it creates. It knows the properties (e.g. zone/arch) and the scheduling constraints (labels/taints) of the nodes it launches. It has to know this in order to be accurate about the scheduling decision it makes. If we didn't know these things, it's possible to get into situations where the node launched by karpenter isn't a viable node for the pod. The fact that karpenter creates the node object with the properties it knows and binds immediately is an optimization that follows from the previous assumption. It allows us to avoid tracking state, causes the decision to be reflected in the API Server immediately. It also allows the kubelet to know the binding decision earlier, and get the container ready in parallel with node initialization. Of course, as raised in this thread, systems that have taken dependencies on the implementation details of the CAS + Kubescheduler interaction may need to be adapted to work with Karpenter. I'm confident that we can find solutions, including, if necessary, not binding the pod. However, I have a strong preference for alternatives that allow us to keep the benefits mentioned above. |
For example:
Karpenter knows external IP addresses of the node, perhaps we can expand the amount of information that is decorated at creation time. Do you have a list of properties that would be useful to see? |
Karpenter currently implements affinity, spread, taints, and factors these in for both unschedulable pods and daemonsets. We chose an alternative approach of the cluster autoscaler, and decided to not use the kube scheduler's implementation. There are a number of benefits to this approach, but it does mean that we are responsible for implementing the scheduling spec. |
Thank you for these informations @ellistarn. This actually explains a lot!
That means Karpenter doesn't inject the necessary information to pods earlier. That's great news actually! Thank you for this explanation @bwagner5. |
Per this Slack discussion—in which I pointed at the portion of code where you populate the nascent Node object—I thought that the EC2 instance doesn't have an external IP address bound yet at that point. If that's possible, we'd expect to find these addresses in the "status.addresses" field where the "type" field's value is "ExternalIP." You set the AZ and instance type labels today, but not the one for the region. |
CCM acts on this pretty quickly, I think. It doens't need kubelet to be running as long as it can match the node to an instance. I have CCM doing this:
and the Node has
So about a second. I am wondering if Karpenter can hold back binding for some seconds for the initializaton taints to go away (if they were to be set). |
Yes, I've seen the CCM code in the past that handles those, and considered mentioning that, but my main point is that Karpenter is exposing Webhooks to Nodes in this partially populated state well before any other Kubernetes component normally would. Waiting on the CCM to have its turn at enriching the Node would help. |
I don't think there's any contract that the node object is fully populated for any given webhook invocation. This becomes challenging to reason about, since Karpenter would need to be aware of every use case that ever labels nodes. Instead, can't the webhook only act once the information is there? |
There is no contract for a fully populated node. But for CCM, the contract is clearly documented here: https://kubernetes.io/docs/tasks/administer-cluster/running-cloud-controller/#running-cloud-controller-manager |
No, it can't. It has to augment the v1/Binding at time of creation, and that happens only once. Also, for this Webhook in question, it needs the node's external IP addresses, not particular labels on the node. |
What modification is happening to the binding object?
I understand this contract, however, this will cause Karpenter to race with the Kube scheduler, which is something I want to avoid unless absolutely necessary. There are a number of architectural advantages to binding early (see above), and I'd hate to lose them unless we didn't have another option. Fundamentally, Karpenter is a scheduler that coexists alongside the existing kube scheduler and operates in the I think there's another path to enable the initialization taint use case, which is the KEPed but unimplemented |
Per #1000 (comment), we add annotations, which the API server transcribes onto the Pod being bound. |
@seh, looking at some shorter term options
|
No. There are only two happens-before edges in the API to supply Downward API content to all containers in a pod: pod creation and (for unbound pods) node binding. After the latter, the kubelet is free to start running init containers. We hook the latter during admission so that by then both the pod and the node exist, but no containers could have started yet.
Yes, I hope so. As mentioned earlier, we need the external IP addresses of the EC2 instance hosting the node. |
This issue is stale because it has been open 25 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
This issue was closed because it has been stalled for 5 days with no activity. |
Version
Karpenter: v0.5.2
Kubernetes: v1.21
Expected Behavior
Karpenter scales nodes and schedules pods onto those without depending on the kube-scheduler. This reduces latency of the pod scheduling process. At the same time there should be a reasonable logic behind the scenes.
Actual Behavior
After scaling the nodes Karpenter adds node name's to the pods spec hardcoded. When you delete the node manually, pod enters the pending state and when new node comes in, Karpenter adds the new node's name to the pending pod's spec and schedule the pod.
Question
This approach doesn't seem like best practice. Can there be any other way to schedule the pods onto newly launched nodes?
The text was updated successfully, but these errors were encountered: