Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep getting CPUThrottlingHigh alert on the gc pod #1724

Closed
budimanjojo opened this issue May 28, 2024 · 5 comments · Fixed by #1728
Closed

Keep getting CPUThrottlingHigh alert on the gc pod #1724

budimanjojo opened this issue May 28, 2024 · 5 comments · Fixed by #1728
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@budimanjojo
Copy link
Contributor

What happened:
After updating to v0.16.0, I keep getting CPUThrottlingHigh alert on the garbage collection pod like this:

CPUThrottlingHigh (Info)
Description: 35.71% throttling of CPU in namespace kube-system for container gc in pod node-feature-discovery-gc-696b644f9-2rwql.

What you expected to happen: Everything should be running like it used to be. I have fairly default helm values:

master:
  extraLabelNs:
    - gpu.intel.com

How to reproduce it (as minimally and precisely as possible): Use the latest v0.16.0 with the values above.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): v1.30.0
  • Cloud provider or hardware configuration: baremetal
  • OS (e.g: cat /etc/os-release): Talos Linux
  • Kernel (e.g. uname -a): 6.6.29-talos
  • Install tools: Helm
  • Network plugin and version (if this is a network-related bug):
  • Others:
@budimanjojo budimanjojo added the kind/bug Categorizes issue or PR as related to a bug. label May 28, 2024
@budimanjojo
Copy link
Contributor Author

Maybe this line is set too low

Or maybe there's a bug in the garbage collection logic making it taking too many resources.

@marquiz
Copy link
Contributor

marquiz commented May 28, 2024

Thanks @budimanjojo for reporting this. How big is your cluster (ca. how many nodes)?

In retrospect, setting the cpu limits might not have been that good idea. We might want to remove those (and cut a patch release) 🧐

The most immediate fix for you would probably be to remove the cpu limits, i.e. do Helm install with --set gc.resources.limits.cpu=null

@budimanjojo
Copy link
Contributor Author

Hi @marquiz!
I have a 3 nodes cluster so it's a pretty small one.

Yeah I agree with having no CPU limits set at least in the gc pod by default. Should I open a PR or I'll just wait?

@marquiz
Copy link
Contributor

marquiz commented May 29, 2024

I have a 3 nodes cluster so it's a pretty small one.

OK, not a huge one, then. 😅 Looks like we need to investigate that a bit further 🤔

Yeah I agree with having no CPU limits set at least in the gc pod by default. Should I open a PR or I'll just wait?

Please do, more contributors -> better 😊 Let's remove cpu limits for all daemons. Also, we need to update the tables of parameters in docs/deployment/helm.md, accordingly (for the defaults)

@budimanjojo
Copy link
Contributor Author

@marquiz I just created the PR, please take a look. I removed CPU limits for all daemons instead of just the garbage collection pod according to your recommendation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants