-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Karpenter does not terminate instances in Pending state #5706
Comments
I just realised that the TTL for a node to bootstrap is set to 15minutes: https://github.com/kubernetes-sigs/karpenter/blob/46d3d646ea3784a885336b9c40fd22f406601441/pkg/controllers/nodeclaim/lifecycle/liveness.go#L40 This becomes interesting if we did not use Cilium: Would EKS attempt to schedule pods to this non-functioning node ? |
You're right, if the kubelet was reporting the node as healthy, even if Karpenter didn't consider it ready, the kube-scheduler would and would begin to schedule pods to it. Karpenter could do periodic health checks on nodes, though you could run into the issue of non-disruptible pods scheduling to the unhealthy nodes. If they are in-fact able to run in some cases I don't think Karpenter could safely disrupt the node. Thinking aloud, I'm wondering if it could make sense for Karpenter to apply a startup taint to nodes that it doesn't remove until Karpenter thinks the nodes are ready, with one of those conditions being that the node is in a running state. |
The TTL is only set to 15 minutes for nodes that never actually join the cluster; however, it sounds like you are seeing that nodes do join the cluster and just that the node never goes into a ready state due to other hardware issues. It seems like there is a lot of overlap here with kubernetes-sigs/karpenter#750 which we are tracking more directly. I'm going to close this issue in favor of that one. I'd encourage you to go check out that issue, +1 it, and see if there is any additional content you think would be relevant to the discussion there as we're thinking about how to solve this problem. We're currently prioritizing that issue in the |
Thats correct understood. The EC2 instance does indeed start, the kubelet joins the cluster and reports healthy, and karpenter does in fact consider it as working. But the node itself does lever leave the EC2 State: Pending. When this occurred, we did not have remote shell activated to see if at what level the instance what actually experiencing any hardware issues. This is not present, so if we see this again I'll re-open the issue if there is anything relevant to share.
I'm actually not sure if #750 is covering the use-case in my reported issue, as it seems that the use-cases there are for nodes that does not report Ready. My nodes did indeed report Ready, but the node had a startup-taint that was not removed, as the EC2 instance did have hardware issues making it unable to attach another ENI.
Well HW issues occur, so I don' think support can provide any details. The company is AWS Enterprise customer, and the TAM has been informed about this issue. |
@jonathan-innis We had another occurrence of hw issue on an a EC2 instance. We managed to establish an SSM session towards the instance. So from a networking perspective, this node seems to be working. The kubelet init script runs and all seems fine. Noticeable errors I've found:
That is the IP for the EKS control plane. It seems the kubelet is unable to obtain the certificate? That's strange, why would the kubelet report as working when it is ... not?
I manually approved the certificate, and the kubelet proceeds:
I don't get why this happens randomly. I'm assuming the hardware error/issue is real, but the fact that kubelet reports OK when it is clearly not seems like an EKS issue. Karpenter is still un-aware that this node is not working at all. The instance state is still Pending, and the API call ec2:AttachNetworkInterface continues to fail since the state is not Running|Stopped. |
I'e created a AWS Enterprise support case on the matter: 170841858601906 |
hi @toredash , any news here ? |
Yes, AWS Support said this is working correctly as is, and this issue should be filed against https://github.com/kubernetes/cloud-provider-aws if I believe this is an issue |
@jonathan-innis Is there a chance to revisit this issue? I would argue that Karpenter should only consider a node for fully joined if the Kubelet is returning Ready and the EC2 instance is in state Running. As it stands now Karpenter is not aware that an EC2 instance could have underlying issues, which would be identified by an instance not transitioning from pending to running state. I'm not that familiar with Golang, so I'm not sure where the logic for this should be placed. Could this be an enhancement of |
Description
Observed Behavior:
High-level: EC2 instances in Pending state are not removed by karpenter.
We are currently experiencing a higher-than-normal of EC2 instances which have hardware issues and are not functional. The instances are in forever Pending state after they have been provisioned by Karpenter. As the state of the EC2 instance never transitions from Pending-state, we assumed that karpenter would after a while mark the instance as not healthy and replace it.
Some background information:
When describing the instance, status fields are either pending or attaching. AWS support confirmed that the physical server had issues. Note the
State.Name
,BlockDeviceMappings[].EBS.Status
,NetworkInterfaces[].Attachment.Status
fields fromaws ec2 describe-instances
:(some data removed)
The nodeclaim:
Relevant logs for nodeclaim standard-instance-store-x6wxs:
The EC2 node in question in kubernetes:
Note that we are using Cilium as the CNI. Cilium will in normal operations remove the taint
node.cilium.io/agent-not-ready
on the node once the cilium-agent is running on the node. The Cilium operator attempts to attach an additional ENI on the host via ec2:AttachNetworkInterface. AWS Audit log entry below, notice theerrorMessage
:The strange thing is that the Pending instance seems to be, working, kinda. Pods that use hostNetwork:true are able to run on this instance, and they seem to work. Kubelet is reporting that it is ready. Fetching logs from a pod running on the node fails though:
Error from server: Get "https://10.209.146.79:10250/containerLogs/kube-system/cilium-operator-5695bfbb6b-gm9ch/cilium-operator": remote error: tls: internal error
Expected Behavior:
I'm not really sure to be honest. The NodeClaim is stuck in Ready:false as Cilium is not removing the taints, as the operator is not able to attach an ENI to the instance. As the EC2 API reports the instance as Pending, I would expect karpenter to mark the node as failed/not working a remove it.
So what I think should happen, is that karpenter would mark EC2 nodes that are in state Pending for more than 15minutes, to be marked as not ready and decommissioned
Reproduction Steps (Please include YAML):
Versions:
v0.34.0
kubectl version
):Server Version: v1.27.9-eks-5e0fdde
The text was updated successfully, but these errors were encountered: