Repeated OOM'ing (perhaps due to a large number of namespaces) #493

directionless · 2018-07-13T18:49:14Z

/kind bug

What happened:

I'm running kube-state-metrics as part of kube-prometheus but it's repeatedly triggering an OOMKilled.

I suspect this is because of the large number of namespaces we have. Some bits of information:

$ kubectl get ns | wc -l
     238

$ kubectl get nodes | wc -l
      47

$ kubectl get pods --all-namespaces | wc -l
    4008

$ kubectl get secrets --all-namespaces | wc -l
    8313

The resource request and limits are: { "cpu": "188m", "memory": "5290Mi" }. (Unfortunately, I'm having trouble getting resource utilization before the oom)

What you expected to happen:

Not OOM

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.7", GitCommit:"dd5e1a2978fd0b97d9b78e1564398aeea7e7fe92", GitTreeState:"clean", BuildDate:"2018-04-19T00:05:56Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"10+", GitVersion:"v1.10.4-gke.2", GitCommit:"eb2e43842aaa21d6f0bb65d6adf5a84bbdc62eaf", GitTreeState:"clean", BuildDate:"2018-06-15T21:48:39Z", GoVersion:"go1.9.3b4", Compiler:"gc", Platform:"linux/amd64"}

Kube-state-metrics image version

"quay.io/coreos/kube-state-metrics:v1.3.1"

The text was updated successfully, but these errors were encountered:

directionless · 2018-07-14T01:39:12Z

This might be an instance of #461. I upped the CPU limit.

brancz · 2018-07-16T19:15:54Z

Could you try removing the addon-resizer and just remove all resource limits and requests? I have a feeling that the resource recommendations that we have currently are off. They are from the scalability tests from around a year ago.

directionless · 2018-07-16T20:23:56Z

I went the other way, and increased the limits until kube-state-metrics stopped hitting them. As alluded to with those graphs, when if I get the CPU allocation a bit higher, memory usage stops growing, and resource usage stabilizes. (As alluded to in the graph above)

But. now prometheus reports context deadline exceeded when trying to scrape it.

Some additional counts:

$ kubectl get pods --all-namespaces -a | wc -l
    4732
$ kubectl get jobs --all-namespaces -a | wc -l
    4651
$ kubectl get cronjobs --all-namespaces -a | wc -l
No resources found.
       0

brancz · 2018-07-17T05:46:32Z

You can try increasing your scrape interval to up to 2mins. The scrape interval configured the timeout, which is what you are seeing. Up to 2 mins is generally an accpetable upper bound for scrape intervals.

brancz · 2018-07-17T09:41:56Z

Sorry that was incorrect, I meant it's generally safe to bump the timeout up to the scrape interval and the scrape interval is safe to do up to 2mins.

directionless · 2018-07-20T05:02:25Z

I upped the resource allocations, and the script timeouts, and let it sit a couple days. I think something is up.

scrape time is usually 10-20s, but it has hourlyish spikes to over 100s. CPU usage looks pretty constant. Memory usage shows a step function over time.

I don't know much about the go pprof stuff, but if I look at /debug/pprof the heap number grows over time. I don't know where that is in the graphed debug metrics. And I'm kind of guessing with pprof

I don't know if it's useful, but here's a collection of heap dumps.

for arg in inuse_space inuse_objects alloc_space alloc_objects; do 
  echo -e "\n$arg"; go tool pprof -$arg -top http://localhost:6060/debug/pprof/heap  \
    > /tmp/pprofs/$arg.txt
done

alloc_objects.txt
alloc_space.txt
inuse_objects.txt
inuse_space.txt

brancz · 2018-07-20T06:54:46Z

Thanks a lot for those numbers and profiles! From the profiles it looks that the majority comes from just producing the Prometheus metric output, which is arguably not as efficient as it should be. We should investigate more high performance solutions, also because kube-state-metrics only requires a small amount of features from the library. cc @mxinden

The second largest usage seems to be json parsing, which will go away in v1.4.0 as we are using the proto for communication by default in v1.4.0.

jakewarr8 · 2018-08-09T21:14:56Z

I had the same issue with oom crashloops and removing the addon-resizer fixed it. We have around 100+ namespaces.

mrsiano · 2018-10-03T10:14:10Z

@directionless how many namespaces we are talking about ?
/cc @mrsiano

directionless · 2018-10-04T17:30:16Z

@mrsiano 200 - 1000 per cluster.

Some conversation elsewhere thought it might be more related to my pod/node ratio. I was running at close to 100 pods/node, and folks commented that a lot of the tuning expects more like 30 pods/node.

I've since shutdown this bit of prometheus. so I can't easily test it.

mxinden · 2018-10-05T15:17:03Z

For anyone interested I am currently working on a performance optimization. Current effort and container images can be found here: #534. Feedback is very welcome.

fejta-bot · 2019-01-09T13:19:44Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-02-13T12:32:09Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

brancz · 2019-02-18T10:22:16Z

@directionless we released some pretty significant improvements with v1.5.0, have you had a chance to try out that release yet? 🙂

fejta-bot · 2019-03-20T11:12:51Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-03-20T11:12:58Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wrossmann · 2022-10-21T22:07:56Z

For anyone else that lands here investigating a similar issue, it seems like a large aggregate number of any/all resource tracked by this exporter can cause it to use a fair bit of memory. The simplest way is to query for counts of metrics like kube_*, but if that's fallen out of your history you can also bump up the memory limit on the exporter and then query it directly.

In my case I learned that Helm doesn't necessarily clean up old release revisions and I had 3600+ configmaps cluttering up the cluster.

Once you get your house in order you can restart the exporter to check that its memory usage is within reason, and then bump the limit back down.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 13, 2018

directionless mentioned this issue Jul 13, 2018

Configure kube-state-metrics prometheus-operator/prometheus-operator#1616

Merged

mxinden mentioned this issue Jul 23, 2018

Performance Optimization Proposal #498

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 9, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 13, 2019

k8s-ci-robot closed this as completed Mar 20, 2019

skhalash mentioned this issue May 17, 2022

Investigate kube-state-metrics out-of-memory kyma-project/kyma#14315

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repeated OOM'ing (perhaps due to a large number of namespaces) #493

Repeated OOM'ing (perhaps due to a large number of namespaces) #493

directionless commented Jul 13, 2018 •

edited

Loading

directionless commented Jul 14, 2018

brancz commented Jul 16, 2018 •

edited

Loading

directionless commented Jul 16, 2018 •

edited

Loading

brancz commented Jul 17, 2018

brancz commented Jul 17, 2018

directionless commented Jul 20, 2018

brancz commented Jul 20, 2018

jakewarr8 commented Aug 9, 2018

mrsiano commented Oct 3, 2018 •

edited

Loading

directionless commented Oct 4, 2018

mxinden commented Oct 5, 2018

fejta-bot commented Jan 9, 2019

fejta-bot commented Feb 13, 2019

brancz commented Feb 18, 2019

fejta-bot commented Mar 20, 2019

k8s-ci-robot commented Mar 20, 2019

wrossmann commented Oct 21, 2022

Repeated OOM'ing (perhaps due to a large number of namespaces) #493

Repeated OOM'ing (perhaps due to a large number of namespaces) #493

Comments

directionless commented Jul 13, 2018 • edited Loading

directionless commented Jul 14, 2018

brancz commented Jul 16, 2018 • edited Loading

directionless commented Jul 16, 2018 • edited Loading

brancz commented Jul 17, 2018

brancz commented Jul 17, 2018

directionless commented Jul 20, 2018

brancz commented Jul 20, 2018

jakewarr8 commented Aug 9, 2018

mrsiano commented Oct 3, 2018 • edited Loading

directionless commented Oct 4, 2018

mxinden commented Oct 5, 2018

fejta-bot commented Jan 9, 2019

fejta-bot commented Feb 13, 2019

brancz commented Feb 18, 2019

fejta-bot commented Mar 20, 2019

k8s-ci-robot commented Mar 20, 2019

wrossmann commented Oct 21, 2022

directionless commented Jul 13, 2018 •

edited

Loading

brancz commented Jul 16, 2018 •

edited

Loading

directionless commented Jul 16, 2018 •

edited

Loading

mrsiano commented Oct 3, 2018 •

edited

Loading