-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nodes in Unknown State and Clarification on High CPU Usage and CPU Limit Settings #1585
Comments
This the current node condition: MemoryPressure Unknown Kubelet stopped posting node status. |
What configurations should I include to resolve or remove nodes in the "NotReady" state, and what are the common reasons behind nodes entering the "NotReady" state? |
Is there any configuration we need to add to handle nodes in the |
we are also facing this same issue with Nodes going to NotReady state quite often and remaining there |
I'm not exactly sure what the issue is why the Kubelet keeps stopping communication with the nodes. Restarting the stopped instances temporarily fixes the problem, but I have too many nodes in an unknown state to manually restart them all, and this issue is happening frequently can't everytime go and restart. I need urgent help with this @ellistarn @jigisha620 |
Does anyone help on this is as often my nodes are going into unknown state as condition status shows 'Kubelet stopped posting node status.' |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten We're also seeing this happen to nodes that had come up heathily but then have kubelet stop reporting node status, and then never recover. Karpenter tries to terminate pods but these just had in a 'terminating' status indefinitely until the node is manually deleted, at which point the pods are rescheduled. There could be some interaction with a Bottlerocket bug bottlerocket-os/bottlerocket#4075 and in our case it seems to be on smaller nodes running Kafka that have intermittent high network spikes. Something about the sudden network surge seems to break the kubelet reporting and Karpenter has no mechanism to recover from this. So it might not be a Karpenter issue per se, but rather handling for this condition which could be being caused by the underlying AMI. |
If the kubelet has stopped posting its status can you look into why that was happening? Was the kubelet on those nodes OOM-killed? Not sure if this is a Karpenter issue and sounds like limits may be misconfigured but the Node Auto Repair feature in Karpenter may be of interest to you. https://karpenter.sh/v1.2/concepts/disruption/#node-auto-repair |
/triage needs-information |
Description
What problem are you trying to solve?
I'm currently managing a large-scale deployment using Karpenter, and I am encountering two specific issues:
Nodes in Unknown State: Some of my nodes have intermittently transitioned to an unknown state. This is causing disruptions in workload scheduling and cluster stability. I need to understand potential reasons for this behavior and how to troubleshoot and resolve it.
Clarification on High CPU Usage and CPU Limit Settings: I have several high-CPU node pools, and I'm trying to understand how to optimize CPU utilization and properly set CPU limits. Specifically, I would like guidance on the following:
How to prevent CPU overutilization when scaling large node pools (e.g., 14,000 CPUs+).
Best practices for setting CPU limits in node pools to avoid resource exhaustion without underutilizing available capacity.
How important is this feature to you?
These issues are critical to the stability and efficiency of our cluster. The unknown node state issue directly impacts the reliability of our services, while high CPU usage and misconfigured CPU limits can lead to performance bottlenecks. Any guidance or fixes would significantly enhance the operation of our workloads at scale.
The text was updated successfully, but these errors were encountered: