Nodes in Unknown State and Clarification on High CPU Usage and CPU Limit Settings #1585

mahaveer08 · 2024-08-20T15:59:35Z

Description

What problem are you trying to solve?
I'm currently managing a large-scale deployment using Karpenter, and I am encountering two specific issues:

Nodes in Unknown State: Some of my nodes have intermittently transitioned to an unknown state. This is causing disruptions in workload scheduling and cluster stability. I need to understand potential reasons for this behavior and how to troubleshoot and resolve it.

Clarification on High CPU Usage and CPU Limit Settings: I have several high-CPU node pools, and I'm trying to understand how to optimize CPU utilization and properly set CPU limits. Specifically, I would like guidance on the following:

How to prevent CPU overutilization when scaling large node pools (e.g., 14,000 CPUs+).
Best practices for setting CPU limits in node pools to avoid resource exhaustion without underutilizing available capacity.
How important is this feature to you?
These issues are critical to the stability and efficiency of our cluster. The unknown node state issue directly impacts the reliability of our services, while high CPU usage and misconfigured CPU limits can lead to performance bottlenecks. Any guidance or fixes would significantly enhance the operation of our workloads at scale.

mahaveer08 · 2024-08-20T17:42:35Z

This the current node condition:

MemoryPressure Unknown Kubelet stopped posting node status.
DiskPressure Unknown Kubelet stopped posting node status.
PIDPressure Unknown Kubelet stopped posting node status.
Ready Unknown Kubelet stopped posting node status.

mahaveer08 · 2024-08-21T08:24:42Z

What configurations should I include to resolve or remove nodes in the "NotReady" state, and what are the common reasons behind nodes entering the "NotReady" state?

mahaveer08 · 2024-08-23T12:58:58Z

Is there any configuration we need to add to handle nodes in the NotReady state so they can be automatically restarted or drained? Currently, these nodes are getting stuck and remain in that state.

suraj2410 · 2024-08-26T12:25:16Z

we are also facing this same issue with Nodes going to NotReady state quite often and remaining there

mahaveer08 · 2024-08-26T14:33:01Z

I'm not exactly sure what the issue is why the Kubelet keeps stopping communication with the nodes. Restarting the stopped instances temporarily fixes the problem, but I have too many nodes in an unknown state to manually restart them all, and this issue is happening frequently can't everytime go and restart. I need urgent help with this @ellistarn @jigisha620

mahaveer08 · 2024-09-09T18:45:03Z

Does anyone help on this is as often my nodes are going into unknown state as condition status shows 'Kubelet stopped posting node status.'

k8s-triage-robot · 2024-12-08T19:05:59Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2025-01-07T19:39:03Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

petewilcock · 2025-01-14T13:29:10Z

/remove-lifecycle rotten

We're also seeing this happen to nodes that had come up heathily but then have kubelet stop reporting node status, and then never recover. Karpenter tries to terminate pods but these just had in a 'terminating' status indefinitely until the node is manually deleted, at which point the pods are rescheduled.

There could be some interaction with a Bottlerocket bug bottlerocket-os/bottlerocket#4075 and in our case it seems to be on smaller nodes running Kafka that have intermittent high network spikes. Something about the sudden network surge seems to break the kubelet reporting and Karpenter has no mechanism to recover from this.

So it might not be a Karpenter issue per se, but rather handling for this condition which could be being caused by the underlying AMI.

rschalo · 2025-02-03T23:49:56Z

If the kubelet has stopped posting its status can you look into why that was happening? Was the kubelet on those nodes OOM-killed? Not sure if this is a Karpenter issue and sounds like limits may be misconfigured but the Node Auto Repair feature in Karpenter may be of interest to you.

https://karpenter.sh/v1.2/concepts/disruption/#node-auto-repair

rschalo · 2025-02-03T23:50:38Z

/triage needs-information

mahaveer08 added the kind/feature Categorizes issue or PR as related to a new feature. label Aug 20, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Aug 20, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 8, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 7, 2025

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 14, 2025

k8s-ci-robot added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes in Unknown State and Clarification on High CPU Usage and CPU Limit Settings #1585

Nodes in Unknown State and Clarification on High CPU Usage and CPU Limit Settings #1585

mahaveer08 commented Aug 20, 2024

mahaveer08 commented Aug 20, 2024 •

edited

Loading

mahaveer08 commented Aug 21, 2024 •

edited

Loading

mahaveer08 commented Aug 23, 2024

suraj2410 commented Aug 26, 2024

mahaveer08 commented Aug 26, 2024 •

edited

Loading

mahaveer08 commented Sep 9, 2024

k8s-triage-robot commented Dec 8, 2024

k8s-triage-robot commented Jan 7, 2025

petewilcock commented Jan 14, 2025

rschalo commented Feb 3, 2025

rschalo commented Feb 3, 2025

Nodes in Unknown State and Clarification on High CPU Usage and CPU Limit Settings #1585

Nodes in Unknown State and Clarification on High CPU Usage and CPU Limit Settings #1585

Comments

mahaveer08 commented Aug 20, 2024

Description

mahaveer08 commented Aug 20, 2024 • edited Loading

mahaveer08 commented Aug 21, 2024 • edited Loading

mahaveer08 commented Aug 23, 2024

suraj2410 commented Aug 26, 2024

mahaveer08 commented Aug 26, 2024 • edited Loading

mahaveer08 commented Sep 9, 2024

k8s-triage-robot commented Dec 8, 2024

k8s-triage-robot commented Jan 7, 2025

petewilcock commented Jan 14, 2025

rschalo commented Feb 3, 2025

rschalo commented Feb 3, 2025

mahaveer08 commented Aug 20, 2024 •

edited

Loading

mahaveer08 commented Aug 21, 2024 •

edited

Loading

mahaveer08 commented Aug 26, 2024 •

edited

Loading