-
Notifications
You must be signed in to change notification settings - Fork 969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Karpenter on EKS fails to remove 'uninitialized' taint from nodes #7426
Comments
Hello, I just created an issue that I think is related to this problem this morning: #7424. We for example have at the moment a Nodeclaim in non-ready state (Ready False) for 40 minutes. As you described the node is correctly started (it has the daemonset running etc). A get nodeclaim returns: general-purpose-rdvzw m6i.xlarge on-demand eu-west-3b ip-redacted.eu-west-3.compute.internal False 46m Conditions:
As said in my issue we also don't understand why it happens because it seems random, for some nodes it works as expected. The node also has the NoSchedule taint:
We rollbacked to 0.37.6 (we were in 1.0.8) but the issue is still present. We didn't have this issue on 0.37.0 (that was the version we were running before upgrading) but now we have difficulties to rollback due to the CRD changes, various webhook issues etc. |
To give another example of the randomness of the behavior: I cleaned all Pending nodeclaim. Then Karpenter started new nodes:
15 minutes after we see that some nodes successfully joined the cluster and some are still Ready False:
|
I am facing the same problem. This issue is greatly complicating my situation. Version |
Hi @silvadavitor |
NodePool
EC2NodeClass
|
Hello @mcorbin,
Hi @konigpeter |
Hello, |
Hello @mcorbin, |
We'll test this, at the moment I don't see any pending nodeclaim since this morning but our clusters just started to autoscale with the week starting. |
@konigpeter Do you still experience the issue ? On our side we don't see it anymore but we don't know why. |
@mcorbin We have also not experienced this issue anymore. To ensure reliability, we conducted several tests by forcing 10 to 100 NodeClaims, and everything worked without errors. That said, we are also unsure about the root cause of the problem or why it stopped occurring. |
Same, we a did rollout of probably hundreds of nodes and we can't reproduce :/ |
Description
Observed Behavior:
Karpenter provisions a new node, and it successfully joins the EKS cluster. However, the taint "node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule" is not removed, preventing the node from being scheduled with pods from deployments. As a result, the node only runs the default add-on pods (e.g., kube-proxy, VPC CNI) but does not receive workload pods.
If the taint is manually removed from the node, everything works as expected, and the pods are scheduled without issues.
This behavior was observed in version 0.37 and persists after updating to version 1.0. It leads to availability issues, as pods remain in a Pending state despite the node being available.
Karpenter does not generate any logs about this issue, providing no insight into why the taint is not being removed.
K8s Events:
Node Clain Condition Message:
Node Taint:
Expected Behavior:
When Karpenter provisions a new node and it successfully joins the EKS cluster, the taint "node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule" should be automatically removed once the node is fully initialized. This would allow pods from deployments to be scheduled on the node alongside the default add-on pods (e.g., kube-proxy, VPC CNI).
Nodes should not require manual intervention to remove the taint, and workload pods should be scheduled as soon as the node becomes available, ensuring no pods remain in a Pending state due to this issue.
Reproduction Steps (Please include YAML):
This issue is intermittent and does not occur with every node provisioned by Karpenter, making it difficult to consistently reproduce. However, the following steps outline a typical setup where the issue has been observed:
Deploy Karpenter on an EKS cluster.
Create a provisioner with support for Spot and On-Demand Instances.
Trigger workloads that scale the cluster and provision new nodes.
Occasionally, newly provisioned nodes will join the cluster but retain the taint "node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule", preventing workload pods from being scheduled.
Versions:
EKS
): 1.30The text was updated successfully, but these errors were encountered: