Memory leak? #3044

bradleyd · 2020-04-13T21:53:34Z

We are running CA 1.15.5 with k8s 1.15.7. We are seeing memory gradually grow over time. We have the limits set to 1G but it will eventually reach that in about a day then get oom'd.

Here is our config in the deployment

 44       - command:
 45         - ./cluster-autoscaler
 46         - --v=4
 47         - --stderrthreshold=info
 48         - --cloud-provider=aws
 49         - --skip-nodes-with-local-storage=false
 50         - --balance-similar-node-groups
 51         - --expander=random
 52         - --nodes=5:20:nodes-us-west-2b.cluster.foo.com
 53         - --nodes=5:20:nodes-us-west-2c.cluster.foo.com
 54         - --nodes=5:20:nodes-us-west-2d.cluster.foo.com
 55         - --nodes=1:18:pgpool-nodes.cluster.foo.com
 56         - --nodes=2:16:postgres-nodes.cluster.foo.com
 57         - --nodes=1:4:api-nodes-us-west-2b.cluster.foo.com
 58         - --nodes=1:4:api-nodes-us-west-2c.cluster.foo.com
 59         - --nodes=1:4:api-nodes-us-west-2d.cluster.foo.com
 60         - --nodes=0:5:cicd-nodes-us-west-2b.cluster.foo.com
 61         - --nodes=0:5:cicd-nodes-us-west-2c.cluster.foo.com
 62         - --nodes=0:5:cicd-nodes-us-west-2d.cluster.foo.com
 63         - --nodes=0:5:haproxy-nodes-us-west-2b.cluster.foo.com
 64         - --nodes=0:5:haproxy-nodes-us-west-2c.cluster.foo.com
 65         - --nodes=0:5:haproxy-nodes-us-west-2d.cluster.foo.com
 66         env:
 67         - name: AWS_REGION
 68           value: us-west-2
 69         image: k8s.gcr.io/cluster-autoscaler:v1.15.5
 70         imagePullPolicy: Always
 71         name: cluster-autoscaler
 72         resources:
 73           limits:
 74             cpu: 100m
 75             memory: 1Gi
 76           requests:
 77             cpu: 100m
 78             memory: 500Mi

Any thoughts?

The text was updated successfully, but these errors were encountered:

marwanad · 2020-04-13T22:08:35Z

curious about the workload you're running? How many pods are running and the number of pending pods you see during those spikes?

bradleyd · 2020-04-13T22:25:32Z

@marwanad thanks for responding. Those spikes take about a day to manifest. We have a pretty steady workflow so about 700+ pods up to 1900 if you count cron jobs.

marwanad · 2020-04-14T16:07:48Z

@bradleyd interesting, I've hit weird memory behaviour with the increasing number of unschedulable pods (as logged in CA). In most cases, around 3k pods or so would hit anywhere between 500-700Mi. Haven't got to running pprof yet.

From what @MaciekPytel mentioned on Slack, it seems like this is not unexpected given that watches cache all nodes and pods in memory.

MaciekPytel · 2020-04-14T16:18:47Z

I would definitely expect CA memory to grow with cluster size (#pods, #nodes and especially #ASGs), but the number of pods mentioned is not that large, this seems like more memory than I'd expect. Note that I only run clusters in GCP/GKE, so my intuition may be way off for memory use of AWS provider.

bradleyd · 2020-04-14T19:17:22Z

@MaciekPytel that was my thoughts too. I get that memory will grow in proportion to the cluster size, but this seems like a lot ™

marwanad · 2020-04-15T00:39:43Z

@bradleyd do you have pods using affinity/anti-affinity rules by any chance? I ran a quick pprof and it seems that MatchInterPodAffinity predicate has some heavy footprint.

bradleyd · 2020-04-16T15:08:37Z

Here is the change (after an oom). Two days went 400MB to over 900MB. This seems a lot like bloat or a leak to me.

srspnda · 2020-04-16T17:58:00Z

I'm seeing the same issue on a 6 node cluster, Amazon EKS v1.15, k8s.gcr.io/cluster-autoscaler:v1.15.5. CA runs into the memory limit, runs out of memory, and is then restarted. Looks like this behavior is on a ~7d schedule.

stefansedich · 2020-04-17T21:23:54Z

We just upgraded to 1.15.6 from 1.14.x and CA was OOMing on startup based on our requests and limits we had previously set, had to significantly increase these to get CA to start up.

Has anything changed to greatly increase the memory footprint?

fejta-bot · 2020-07-16T21:28:06Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

nitrag · 2020-07-29T06:22:08Z

/remove-lifecycle stale

nitrag · 2020-07-29T06:31:37Z

30-40 pods, 10-15 nodes (another 15 terminated nodes). AWS. cluster-autoscaler is using 800mb+ of memory.

My other two clusters which have 6-8 nodes use only 200mb.

nitrag · 2020-07-29T06:44:45Z

I have another cluster with 100+ pods, 20+ nodes where cluster-autoscaler uses <200mb, running 1.14.6.

infa-ddeore · 2020-10-14T05:16:34Z

any updates on this? 1.15.7 CA is getting killed by k8 due to high memory usage.
CA reached to 1.3G memory utilisation with cluster of ~80 nodes but was killed by k8

so the option we tried is --aws-use-static-instance-list=true (Should CA fetch instance types in runtime or use a static list. AWS only) and now the memory utilisation dropped down to ~250MB

there is no issue with 1.14 CA

MaciekPytel · 2020-10-14T09:02:15Z

@Jeffwan @jaypipes Based on comments above this seems like it may be a memory leak in AWS provider.

nitrag · 2020-10-15T14:30:17Z

so the option we tried is --aws-use-static-instance-list=true (Should CA fetch instance types in runtime or use a static list. AWS only) and now the memory utilisation dropped down to ~250MB

I tried @infa-ddeore suggestion of static instance list. On my two clusters with minimal load I get average around 250mb before/after. On my production cluster memory was ~650-750mb past few days but after enabling static instance list it jumped to 950mb.

nitrag · 2020-10-15T14:32:21Z

Well now after reverting it starts at 500mb and within 15 seconds it climbs to 950mb before k8s kills it for exceeding memory requests. HELP!

Logs

jaypipes · 2020-10-16T16:11:00Z

any updates on this? 1.15.7 CA is getting killed by k8 due to high memory usage.
CA reached to 1.3G memory utilisation with cluster of ~80 nodes but was killed by k8

so the option we tried is --aws-use-static-instance-list=true (Should CA fetch instance types in runtime or use a static list. AWS only) and now the memory utilisation dropped down to ~250MB

there is no issue with 1.14 CA

I'm going to have to bisect to see what changed between the 1.14 and 1.15 CAS w/AWS release. At this point, I'm just not sure; I didn't think there were many changes actually to the AWS-specific code between 1.14 and 1.15.

varunpalekar · 2020-10-21T20:50:47Z

We are also getting a similar issue, in starting memory will go high around 1 GB but after 5 min it's 100 MB only.

cluster autoscaler image: tried both 1.15.6 and 1.15.7
k8s version: v1.15.10
cloud-provider: aws
Cluster Size: approx 40 nodes

nitrag · 2020-11-18T20:17:41Z

--aws-use-static-instance-list=true
Didn't help us.

Holy crap. Container aws-cluster-autoscaler was using 1149444Ki, which exceeds its request of 900Mi.
@jaypipes I'm going to reach out to you via email so we can troubleshoot this 1:1.

nitrag · 2020-11-18T21:18:07Z

I upgraded cluster-autoscaler to 1.16.7 (from 1.15.7) and the memory 1min after boot is 375MiB. Will continue to monitor.

Simultaneously I was removing 55k completed Batch Jobs (cronjobs) that were just lingering on the API server as complete. Anyone else with this issue have similar latent storage? I don't see how this could be related but you never know. kubectl get jobs --all-namespaces

System Info:
 Boot ID:                    872ea78f-3818-48bc-a30b-a9c402c1bc04
 Kernel Version:             4.14.198-152.320.amzn2.x86_64
 OS Image:                   Amazon Linux 2
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://19.3.6
 Kubelet Version:            v1.16.13-eks-ec92d4
 Kube-Proxy Version:         v1.16.13-eks-ec92d4

Update 3/31/21:

We experienced this issue again with EKS 1.15 and amazon-eks-node-1.15-v20201126 AMI. Our EKS 1.16 clusters are doing fine as mentioned previously.

seamusabshere · 2020-12-09T14:45:31Z

--aws-use-static-instance-list=true

this does not fix it.

    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Wed, 09 Dec 2020 09:41:39 -0500
      Finished:     Wed, 09 Dec 2020 09:43:50 -0500
    Ready:          True
    Restart Count:  1
    Limits:
      memory:  6Gi

fejta-bot · 2021-03-09T15:33:14Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-04-30T17:13:28Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

mshivaramie22 · 2021-05-06T04:03:08Z

/remove-lifecycle rotten

nitrag · 2021-05-18T19:24:09Z

I'd like to point out an observation. We just upgraded the last of our production clusters last Friday (-4 days) and I noticed something:

EKS Upgrade to 1.18.
Cluster-Autoscaler to 1.17.4.

Memory usage for the entire weekend, pod reboots, different nodes, latest AMI, etc....aws-cluster-autoscaler and 30 seconds it would scale up to 1.6GB MEM USAGE via docker stats.

Looking at our logs we noticed a logs we had not seen before:

I0515 13:23:27.351494       1 trace.go:116] Trace[1286906120]: "Reflector ListAndWatch" name:k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:328 (started: 2021-05-15 13:23:13.91257067 +0000 UTC m=+37587.395469158) (total time: 13.43885801s):
Trace[1286906120]: [13.283242498s] [13.283242498s] Objects listed
Trace[1286906120]: [13.283243098s] [600ns] Resource version extracted
Trace[1286906120]: [13.330326445s] [47.083347ms] Objects extracted
Trace[1286906120]: [13.43885412s] [108.527675ms] SyncWith done
Trace[1286906120]: [13.43885526s] [1.14µs] Resource version updated
Trace[1286906120]: [13.43885801s] [2.75µs] END

This led me to investigate how large our job history was (because EKS does not clean up Jobs once they complete! you should enable this AWS team, I think it's the TTL variables)

kubectl get jobs --all-namespaces | wc -

answer: 83339

Eighty thousand!

It took me 2+ days to clean these up. I had to write a python program to do it via Kube API.

It's now down to a reasonable 4500 (I programmed it to keep -15 day history).

Fast forward to today: I checked aws-cluster-autoscaler pod today and it was still 1.6GB of memory usage (docker stats)... I deleted the pod. It re-provisioned the new pod on the same node and it has been running for 30+min at 330MB. WOW!

@jaypipes

edgan · 2021-07-22T12:13:14Z

I just saw this with Cluster Autoscaler 1.20.0. It is a jenkins/kube master running in AWS with an ASG. Two pods, jenkins slave agents, run jenkins jobs every half hour. So 2 jobs * 24 hours * 2(half hour) = 96 jobs daily. The instance in the ASG is fairly static given the frequency of the jenkins jobs.

The cluster-autoscaler is installed via helm, and had been running for about 15 days.

The pod had no resource limits, which I am fixing now. But this will just cause it to be auto-restarted. It did out of memory the host.

kernel: Out of memory: Killed process 2056769 (cluster-autosca) total-vm:1340256kB, anon-rss:383832kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:992kB oom_score_adj:1000

1340256kb / 1024 / 1024 = 1.27816772460937500000gb

The instance is a m5.large with 7749mb. The region is us-east-2. The cluster-autoscaler process was using 16.9% of memory.

Update: Interestingly, before I added the resource limits, the restart counter was at 68, and the cluster-autoscaler process had OOMed the box 65 times over the span of 4.5 hours all on July 22. So it had been mostly fine, then something changed in AWS, and memory usage blew up not just once, but repeatedly.

The total-vm counts across the 65 OOMs.

 12 1340512kB
 53 1340256kB

xavipanda · 2021-09-24T20:25:52Z

There is a leak for sure. we got OOMK at 1G limit...
our older version of autoscaler was fine, avg 250m~

fblgit · 2021-09-25T16:19:00Z

same here, using AWS. there is no way this could justify 1G of consumption...

there is a leak. it's related to the provider AWS. It can be reproduced with multiples ASG's, multiple nodes on each, and add/delete ASG's: nodes in and nodes out contribute to the effect.

Unfortunately, to reproduce and eventually fix this..u have to spend a reasonable amount of $ in AWS.

mdshoaib707 · 2021-09-27T15:07:56Z

I am also facing OOM issue.
Cluster version => v1.19.12
Cluster-autoscaler version => v1.20.0
Pod memory limits => 2046Mi

jaredhancock31 · 2021-11-22T18:41:00Z

Any updates on this? Or any established workarounds for large clusters?

k8s-triage-robot · 2022-02-20T18:44:10Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

dalvarezquiroga · 2022-02-21T12:47:28Z

I think there is an issue with same problem described here:

#3506

It looks like that is fixed with new version of Cluster Autoscaler.

Some message users:
I've deployed v1.22.1 into a cluster which was previously seeing an oomkill with a memory limit of 300Mi. It's fixed the problem for us.

k8s-triage-robot · 2022-03-23T13:38:09Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-04-22T13:43:10Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-04-22T13:43:26Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 16, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 29, 2020

seamusabshere mentioned this issue Dec 15, 2020

Cluster Autoscaler on AWS is OOM killed on startup in GenerateEC2InstanceTypes #3506

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 9, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 30, 2021

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 4, 2021

jbartosik added the area/cluster-autoscaler label Sep 15, 2021

jaypipes mentioned this issue Dec 2, 2021

CA - AWS CloudProvider - Fallback to Static EC2 list rather than fatal error #4480

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 20, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 23, 2022

k8s-ci-robot closed this as completed Apr 22, 2022

Memory leak? #3044

Memory leak? #3044

Comments

bradleyd commented Apr 13, 2020 • edited Loading

marwanad commented Apr 13, 2020

bradleyd commented Apr 13, 2020

marwanad commented Apr 14, 2020 • edited Loading

MaciekPytel commented Apr 14, 2020

bradleyd commented Apr 14, 2020

marwanad commented Apr 15, 2020

bradleyd commented Apr 16, 2020

srspnda commented Apr 16, 2020

stefansedich commented Apr 17, 2020

fejta-bot commented Jul 16, 2020

nitrag commented Jul 29, 2020

nitrag commented Jul 29, 2020

nitrag commented Jul 29, 2020

infa-ddeore commented Oct 14, 2020 • edited Loading

MaciekPytel commented Oct 14, 2020

nitrag commented Oct 15, 2020

nitrag commented Oct 15, 2020 • edited Loading

jaypipes commented Oct 16, 2020

varunpalekar commented Oct 21, 2020

nitrag commented Nov 18, 2020

nitrag commented Nov 18, 2020 • edited Loading

seamusabshere commented Dec 9, 2020

fejta-bot commented Mar 9, 2021

fejta-bot commented Apr 30, 2021

mshivaramie22 commented May 6, 2021

nitrag commented May 18, 2021

edgan commented Jul 22, 2021 • edited Loading

xavipanda commented Sep 24, 2021

fblgit commented Sep 25, 2021

mdshoaib707 commented Sep 27, 2021

jaredhancock31 commented Nov 22, 2021

k8s-triage-robot commented Feb 20, 2022

dalvarezquiroga commented Feb 21, 2022

k8s-triage-robot commented Mar 23, 2022

k8s-triage-robot commented Apr 22, 2022

k8s-ci-robot commented Apr 22, 2022

bradleyd commented Apr 13, 2020 •

edited

Loading

marwanad commented Apr 14, 2020 •

edited

Loading

infa-ddeore commented Oct 14, 2020 •

edited

Loading

nitrag commented Oct 15, 2020 •

edited

Loading

nitrag commented Nov 18, 2020 •

edited

Loading

edgan commented Jul 22, 2021 •

edited

Loading