-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scrapes hang when deadlocked #2413
Comments
👋 We're also seeing this on some very small clusters that we just upgraded to k8s For us....
Maybe we've just been getting lucky, until now? |
@Pokom - thanks for both the write up and addressing it in a PR! I'm still puzzled about why we only started seeing this on clusters after upgrading them from Anyway, just recording it here in case somebody else happens across it. Looking forward to your PR making it in! Update While our symptoms match up exactly with the issue described here, the frequency was much higher. We found that the cause of our hangs turned out to be different (a bit related, but different). During our |
@tkent Appreciate confirming that you're experiencing the same symptoms I am! I'm running dozens of |
/triage accepted |
/assign @Pokom |
What happened:
Occasionally in larger clusters,
kube-state-metrics
will fail to be scraped by both Prometheus and Grafana's Alloy. When attempting to curl the/metrics
endpoint we'll get to a certain point(usually pod disruption budgets) and then just hangs. The only way to recover from the scenario is to restart thekube-state-metric
pods impacted. This is also similar to #995 and #1028, but the difference is we're not having a high churn on pods. Our ksm deployment is sharded and not all of the shards will fail at the same time.What you expected to happen:
I would expect the server to eventually timeout the connections, and not block future scrapes from being blocked.
How to reproduce it (as minimally and precisely as possible):
Run the tip of
kube-state-metrics
and have it access a decently large cluster. Withkube-state-metrics
up and running, launch the following go program which is meant to emulate a client failing to fully fetch metrics:After a few clients fail to close their body, accessing the
/metrics
endpoint will stall and you'll get partial results from curl. If you look at/debug/pprof/goroutines
you'll notice that the number of goroutines will continue increasing. You'll find a single goroutine that's blocking all the write goroutines that looks like:Anything else we need to know?:
Here are goroutine dumps that show that all goroutines are stuck and blocked on reading.
ksm-groutine-dump-2024-06-04-regular.txt
ksm-groutine-dump-2024-06-04-debug-2.txt
Environment:
v2.12.0
kubectl version
):v1.28.7-gke.1026000
The text was updated successfully, but these errors were encountered: