-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repeated OOM'ing (perhaps due to a large number of namespaces) #493
Comments
This might be an instance of #461. I upped the CPU limit. |
Could you try removing the addon-resizer and just remove all resource limits and requests? I have a feeling that the resource recommendations that we have currently are off. They are from the scalability tests from around a year ago. |
I went the other way, and increased the limits until kube-state-metrics stopped hitting them. As alluded to with those graphs, when if I get the CPU allocation a bit higher, memory usage stops growing, and resource usage stabilizes. (As alluded to in the graph above) But. now prometheus reports Some additional counts:
|
You can try increasing your scrape interval to up to 2mins. The scrape interval configured the timeout, which is what you are seeing. Up to 2 mins is generally an accpetable upper bound for scrape intervals. |
Sorry that was incorrect, I meant it's generally safe to bump the timeout up to the scrape interval and the scrape interval is safe to do up to 2mins. |
I upped the resource allocations, and the script timeouts, and let it sit a couple days. I think something is up. scrape time is usually 10-20s, but it has hourlyish spikes to over 100s. CPU usage looks pretty constant. Memory usage shows a step function over time. I don't know much about the go pprof stuff, but if I look at I don't know if it's useful, but here's a collection of heap dumps.
alloc_objects.txt |
Thanks a lot for those numbers and profiles! From the profiles it looks that the majority comes from just producing the Prometheus metric output, which is arguably not as efficient as it should be. We should investigate more high performance solutions, also because kube-state-metrics only requires a small amount of features from the library. cc @mxinden The second largest usage seems to be json parsing, which will go away in v1.4.0 as we are using the proto for communication by default in v1.4.0. |
I had the same issue with oom crashloops and removing the addon-resizer fixed it. We have around 100+ namespaces. |
@directionless how many namespaces we are talking about ? |
@mrsiano 200 - 1000 per cluster. Some conversation elsewhere thought it might be more related to my pod/node ratio. I was running at close to 100 pods/node, and folks commented that a lot of the tuning expects more like 30 pods/node. I've since shutdown this bit of prometheus. so I can't easily test it. |
For anyone interested I am currently working on a performance optimization. Current effort and container images can be found here: #534. Feedback is very welcome. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@directionless we released some pretty significant improvements with v1.5.0, have you had a chance to try out that release yet? 🙂 |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
For anyone else that lands here investigating a similar issue, it seems like a large aggregate number of any/all resource tracked by this exporter can cause it to use a fair bit of memory. The simplest way is to query for counts of metrics like In my case I learned that Helm doesn't necessarily clean up old release revisions and I had 3600+ configmaps cluttering up the cluster. Once you get your house in order you can restart the exporter to check that its memory usage is within reason, and then bump the limit back down. |
/kind bug
What happened:
I'm running kube-state-metrics as part of kube-prometheus but it's repeatedly triggering an
OOMKilled
.I suspect this is because of the large number of namespaces we have. Some bits of information:
The resource request and limits are:
{ "cpu": "188m", "memory": "5290Mi" }
. (Unfortunately, I'm having trouble getting resource utilization before the oom)What you expected to happen:
Not OOM
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
kubectl version
):The text was updated successfully, but these errors were encountered: