Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes in Unknown State and Clarification on High CPU Usage and CPU Limit Settings #1585

Open
mahaveer08 opened this issue Aug 20, 2024 · 11 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@mahaveer08
Copy link

Description

What problem are you trying to solve?
I'm currently managing a large-scale deployment using Karpenter, and I am encountering two specific issues:

Nodes in Unknown State: Some of my nodes have intermittently transitioned to an unknown state. This is causing disruptions in workload scheduling and cluster stability. I need to understand potential reasons for this behavior and how to troubleshoot and resolve it.

Clarification on High CPU Usage and CPU Limit Settings: I have several high-CPU node pools, and I'm trying to understand how to optimize CPU utilization and properly set CPU limits. Specifically, I would like guidance on the following:

How to prevent CPU overutilization when scaling large node pools (e.g., 14,000 CPUs+).
Best practices for setting CPU limits in node pools to avoid resource exhaustion without underutilizing available capacity.
How important is this feature to you?
These issues are critical to the stability and efficiency of our cluster. The unknown node state issue directly impacts the reliability of our services, while high CPU usage and misconfigured CPU limits can lead to performance bottlenecks. Any guidance or fixes would significantly enhance the operation of our workloads at scale.

@mahaveer08 mahaveer08 added the kind/feature Categorizes issue or PR as related to a new feature. label Aug 20, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Aug 20, 2024
@mahaveer08
Copy link
Author

mahaveer08 commented Aug 20, 2024

This the current node condition:

MemoryPressure Unknown Kubelet stopped posting node status.
DiskPressure Unknown Kubelet stopped posting node status.
PIDPressure Unknown Kubelet stopped posting node status.
Ready Unknown Kubelet stopped posting node status.

@mahaveer08
Copy link
Author

mahaveer08 commented Aug 21, 2024

What configurations should I include to resolve or remove nodes in the "NotReady" state, and what are the common reasons behind nodes entering the "NotReady" state?

@mahaveer08
Copy link
Author

Is there any configuration we need to add to handle nodes in the NotReady state so they can be automatically restarted or drained? Currently, these nodes are getting stuck and remain in that state.

@suraj2410
Copy link

we are also facing this same issue with Nodes going to NotReady state quite often and remaining there

@mahaveer08
Copy link
Author

mahaveer08 commented Aug 26, 2024

I'm not exactly sure what the issue is why the Kubelet keeps stopping communication with the nodes. Restarting the stopped instances temporarily fixes the problem, but I have too many nodes in an unknown state to manually restart them all, and this issue is happening frequently can't everytime go and restart. I need urgent help with this @ellistarn @jigisha620

@mahaveer08
Copy link
Author

Does anyone help on this is as often my nodes are going into unknown state as condition status shows 'Kubelet stopped posting node status.'

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 8, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 7, 2025
@petewilcock
Copy link

/remove-lifecycle rotten

We're also seeing this happen to nodes that had come up heathily but then have kubelet stop reporting node status, and then never recover. Karpenter tries to terminate pods but these just had in a 'terminating' status indefinitely until the node is manually deleted, at which point the pods are rescheduled.

There could be some interaction with a Bottlerocket bug bottlerocket-os/bottlerocket#4075 and in our case it seems to be on smaller nodes running Kafka that have intermittent high network spikes. Something about the sudden network surge seems to break the kubelet reporting and Karpenter has no mechanism to recover from this.

So it might not be a Karpenter issue per se, but rather handling for this condition which could be being caused by the underlying AMI.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 14, 2025
@rschalo
Copy link
Contributor

rschalo commented Feb 3, 2025

If the kubelet has stopped posting its status can you look into why that was happening? Was the kubelet on those nodes OOM-killed? Not sure if this is a Karpenter issue and sounds like limits may be misconfigured but the Node Auto Repair feature in Karpenter may be of interest to you.

https://karpenter.sh/v1.2/concepts/disruption/#node-auto-repair

@rschalo
Copy link
Contributor

rschalo commented Feb 3, 2025

/triage needs-information

@k8s-ci-robot k8s-ci-robot added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

6 participants