Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak? #3044

Closed
bradleyd opened this issue Apr 13, 2020 · 36 comments
Closed

Memory leak? #3044

bradleyd opened this issue Apr 13, 2020 · 36 comments
Labels
area/cluster-autoscaler lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@bradleyd
Copy link

bradleyd commented Apr 13, 2020

We are running CA 1.15.5 with k8s 1.15.7. We are seeing memory gradually grow over time. We have the limits set to 1G but it will eventually reach that in about a day then get oom'd.

Screen Shot 2020-04-13 at 2 07 10 PM

Here is our config in the deployment

 44       - command:
 45         - ./cluster-autoscaler
 46         - --v=4
 47         - --stderrthreshold=info
 48         - --cloud-provider=aws
 49         - --skip-nodes-with-local-storage=false
 50         - --balance-similar-node-groups
 51         - --expander=random
 52         - --nodes=5:20:nodes-us-west-2b.cluster.foo.com
 53         - --nodes=5:20:nodes-us-west-2c.cluster.foo.com
 54         - --nodes=5:20:nodes-us-west-2d.cluster.foo.com
 55         - --nodes=1:18:pgpool-nodes.cluster.foo.com
 56         - --nodes=2:16:postgres-nodes.cluster.foo.com
 57         - --nodes=1:4:api-nodes-us-west-2b.cluster.foo.com
 58         - --nodes=1:4:api-nodes-us-west-2c.cluster.foo.com
 59         - --nodes=1:4:api-nodes-us-west-2d.cluster.foo.com
 60         - --nodes=0:5:cicd-nodes-us-west-2b.cluster.foo.com
 61         - --nodes=0:5:cicd-nodes-us-west-2c.cluster.foo.com
 62         - --nodes=0:5:cicd-nodes-us-west-2d.cluster.foo.com
 63         - --nodes=0:5:haproxy-nodes-us-west-2b.cluster.foo.com
 64         - --nodes=0:5:haproxy-nodes-us-west-2c.cluster.foo.com
 65         - --nodes=0:5:haproxy-nodes-us-west-2d.cluster.foo.com
 66         env:
 67         - name: AWS_REGION
 68           value: us-west-2
 69         image: k8s.gcr.io/cluster-autoscaler:v1.15.5
 70         imagePullPolicy: Always
 71         name: cluster-autoscaler
 72         resources:
 73           limits:
 74             cpu: 100m
 75             memory: 1Gi
 76           requests:
 77             cpu: 100m
 78             memory: 500Mi

Any thoughts?

@marwanad
Copy link
Member

curious about the workload you're running? How many pods are running and the number of pending pods you see during those spikes?

@bradleyd
Copy link
Author

@marwanad thanks for responding. Those spikes take about a day to manifest. We have a pretty steady workflow so about 700+ pods up to 1900 if you count cron jobs.

@marwanad
Copy link
Member

marwanad commented Apr 14, 2020

@bradleyd interesting, I've hit weird memory behaviour with the increasing number of unschedulable pods (as logged in CA). In most cases, around 3k pods or so would hit anywhere between 500-700Mi. Haven't got to running pprof yet.

From what @MaciekPytel mentioned on Slack, it seems like this is not unexpected given that watches cache all nodes and pods in memory.

@MaciekPytel
Copy link
Contributor

I would definitely expect CA memory to grow with cluster size (#pods, #nodes and especially #ASGs), but the number of pods mentioned is not that large, this seems like more memory than I'd expect. Note that I only run clusters in GCP/GKE, so my intuition may be way off for memory use of AWS provider.

@bradleyd
Copy link
Author

@MaciekPytel that was my thoughts too. I get that memory will grow in proportion to the cluster size, but this seems like a lot ™

@marwanad
Copy link
Member

@bradleyd do you have pods using affinity/anti-affinity rules by any chance? I ran a quick pprof and it seems that MatchInterPodAffinity predicate has some heavy footprint.

@bradleyd
Copy link
Author

image

Here is the change (after an oom). Two days went 400MB to over 900MB. This seems a lot like bloat or a leak to me.

@srspnda
Copy link

srspnda commented Apr 16, 2020

I'm seeing the same issue on a 6 node cluster, Amazon EKS v1.15, k8s.gcr.io/cluster-autoscaler:v1.15.5. CA runs into the memory limit, runs out of memory, and is then restarted. Looks like this behavior is on a ~7d schedule.

@stefansedich
Copy link
Contributor

We just upgraded to 1.15.6 from 1.14.x and CA was OOMing on startup based on our requests and limits we had previously set, had to significantly increase these to get CA to start up.

Has anything changed to greatly increase the memory footprint?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 16, 2020
@nitrag
Copy link
Contributor

nitrag commented Jul 29, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 29, 2020
@nitrag
Copy link
Contributor

nitrag commented Jul 29, 2020

30-40 pods, 10-15 nodes (another 15 terminated nodes). AWS. cluster-autoscaler is using 800mb+ of memory.

My other two clusters which have 6-8 nodes use only 200mb.

@nitrag
Copy link
Contributor

nitrag commented Jul 29, 2020

I have another cluster with 100+ pods, 20+ nodes where cluster-autoscaler uses <200mb, running 1.14.6.

@infa-ddeore
Copy link

infa-ddeore commented Oct 14, 2020

any updates on this? 1.15.7 CA is getting killed by k8 due to high memory usage.
CA reached to 1.3G memory utilisation with cluster of ~80 nodes but was killed by k8

so the option we tried is --aws-use-static-instance-list=true (Should CA fetch instance types in runtime or use a static list. AWS only) and now the memory utilisation dropped down to ~250MB

there is no issue with 1.14 CA

@MaciekPytel
Copy link
Contributor

@Jeffwan @jaypipes Based on comments above this seems like it may be a memory leak in AWS provider.

@nitrag
Copy link
Contributor

nitrag commented Oct 15, 2020

so the option we tried is --aws-use-static-instance-list=true (Should CA fetch instance types in runtime or use a static list. AWS only) and now the memory utilisation dropped down to ~250MB

I tried @infa-ddeore suggestion of static instance list. On my two clusters with minimal load I get average around 250mb before/after. On my production cluster memory was ~650-750mb past few days but after enabling static instance list it jumped to 950mb.

@nitrag
Copy link
Contributor

nitrag commented Oct 15, 2020

Well now after reverting it starts at 500mb and within 15 seconds it climbs to 950mb before k8s kills it for exceeding memory requests. HELP!

Logs

@jaypipes
Copy link
Contributor

any updates on this? 1.15.7 CA is getting killed by k8 due to high memory usage.
CA reached to 1.3G memory utilisation with cluster of ~80 nodes but was killed by k8

so the option we tried is --aws-use-static-instance-list=true (Should CA fetch instance types in runtime or use a static list. AWS only) and now the memory utilisation dropped down to ~250MB

there is no issue with 1.14 CA

I'm going to have to bisect to see what changed between the 1.14 and 1.15 CAS w/AWS release. At this point, I'm just not sure; I didn't think there were many changes actually to the AWS-specific code between 1.14 and 1.15.

@varunpalekar
Copy link

We are also getting a similar issue, in starting memory will go high around 1 GB but after 5 min it's 100 MB only.

cluster autoscaler image: tried both 1.15.6 and 1.15.7
k8s version: v1.15.10
cloud-provider: aws
Cluster Size: approx 40 nodes

@nitrag
Copy link
Contributor

nitrag commented Nov 18, 2020

--aws-use-static-instance-list=true
Didn't help us.

Holy crap. Container aws-cluster-autoscaler was using 1149444Ki, which exceeds its request of 900Mi.
@jaypipes I'm going to reach out to you via email so we can troubleshoot this 1:1.

@nitrag
Copy link
Contributor

nitrag commented Nov 18, 2020

I upgraded cluster-autoscaler to 1.16.7 (from 1.15.7) and the memory 1min after boot is 375MiB. Will continue to monitor.

Simultaneously I was removing 55k completed Batch Jobs (cronjobs) that were just lingering on the API server as complete. Anyone else with this issue have similar latent storage? I don't see how this could be related but you never know. kubectl get jobs --all-namespaces

System Info:
 Boot ID:                    872ea78f-3818-48bc-a30b-a9c402c1bc04
 Kernel Version:             4.14.198-152.320.amzn2.x86_64
 OS Image:                   Amazon Linux 2
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://19.3.6
 Kubelet Version:            v1.16.13-eks-ec92d4
 Kube-Proxy Version:         v1.16.13-eks-ec92d4

Update 3/31/21:

We experienced this issue again with EKS 1.15 and amazon-eks-node-1.15-v20201126 AMI. Our EKS 1.16 clusters are doing fine as mentioned previously.

@seamusabshere
Copy link

--aws-use-static-instance-list=true

this does not fix it.

    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Wed, 09 Dec 2020 09:41:39 -0500
      Finished:     Wed, 09 Dec 2020 09:43:50 -0500
    Ready:          True
    Restart Count:  1
    Limits:
      memory:  6Gi

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 9, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 30, 2021
@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 4, 2021
@mshivaramie22
Copy link

/remove-lifecycle rotten

@nitrag
Copy link
Contributor

nitrag commented May 18, 2021

I'd like to point out an observation. We just upgraded the last of our production clusters last Friday (-4 days) and I noticed something:

EKS Upgrade to 1.18.
Cluster-Autoscaler to 1.17.4.

Memory usage for the entire weekend, pod reboots, different nodes, latest AMI, etc....aws-cluster-autoscaler and 30 seconds it would scale up to 1.6GB MEM USAGE via docker stats.

Looking at our logs we noticed a logs we had not seen before:

I0515 13:23:27.351494       1 trace.go:116] Trace[1286906120]: "Reflector ListAndWatch" name:k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:328 (started: 2021-05-15 13:23:13.91257067 +0000 UTC m=+37587.395469158) (total time: 13.43885801s):
Trace[1286906120]: [13.283242498s] [13.283242498s] Objects listed
Trace[1286906120]: [13.283243098s] [600ns] Resource version extracted
Trace[1286906120]: [13.330326445s] [47.083347ms] Objects extracted
Trace[1286906120]: [13.43885412s] [108.527675ms] SyncWith done
Trace[1286906120]: [13.43885526s] [1.14µs] Resource version updated
Trace[1286906120]: [13.43885801s] [2.75µs] END

This led me to investigate how large our job history was (because EKS does not clean up Jobs once they complete! you should enable this AWS team, I think it's the TTL variables)

kubectl get jobs --all-namespaces | wc -

answer: 83339

Eighty thousand!

It took me 2+ days to clean these up. I had to write a python program to do it via Kube API.

It's now down to a reasonable 4500 (I programmed it to keep -15 day history).

Fast forward to today: I checked aws-cluster-autoscaler pod today and it was still 1.6GB of memory usage (docker stats)... I deleted the pod. It re-provisioned the new pod on the same node and it has been running for 30+min at 330MB. WOW!

@jaypipes

@edgan
Copy link

edgan commented Jul 22, 2021

I just saw this with Cluster Autoscaler 1.20.0. It is a jenkins/kube master running in AWS with an ASG. Two pods, jenkins slave agents, run jenkins jobs every half hour. So 2 jobs * 24 hours * 2(half hour) = 96 jobs daily. The instance in the ASG is fairly static given the frequency of the jenkins jobs.

The cluster-autoscaler is installed via helm, and had been running for about 15 days.

The pod had no resource limits, which I am fixing now. But this will just cause it to be auto-restarted. It did out of memory the host.

kernel: Out of memory: Killed process 2056769 (cluster-autosca) total-vm:1340256kB, anon-rss:383832kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:992kB oom_score_adj:1000

1340256kb / 1024 / 1024 = 1.27816772460937500000gb

The instance is a m5.large with 7749mb. The region is us-east-2. The cluster-autoscaler process was using 16.9% of memory.

Update: Interestingly, before I added the resource limits, the restart counter was at 68, and the cluster-autoscaler process had OOMed the box 65 times over the span of 4.5 hours all on July 22. So it had been mostly fine, then something changed in AWS, and memory usage blew up not just once, but repeatedly.

The total-vm counts across the 65 OOMs.

 12 1340512kB
 53 1340256kB

@xavipanda
Copy link

There is a leak for sure. we got OOMK at 1G limit...
our older version of autoscaler was fine, avg 250m~

@fblgit
Copy link

fblgit commented Sep 25, 2021

same here, using AWS. there is no way this could justify 1G of consumption...

there is a leak. it's related to the provider AWS. It can be reproduced with multiples ASG's, multiple nodes on each, and add/delete ASG's: nodes in and nodes out contribute to the effect.

Unfortunately, to reproduce and eventually fix this..u have to spend a reasonable amount of $ in AWS.

@mdshoaib707
Copy link

I am also facing OOM issue.
Cluster version => v1.19.12
Cluster-autoscaler version => v1.20.0
Pod memory limits => 2046Mi

@jaredhancock31
Copy link

Any updates on this? Or any established workarounds for large clusters?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 20, 2022
@dalvarezquiroga
Copy link

I think there is an issue with same problem described here:

#3506

It looks like that is fixed with new version of Cluster Autoscaler.

Some message users:
I've deployed v1.22.1 into a cluster which was previously seeing an oomkill with a memory limit of 300Mi. It's fixed the problem for us.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 23, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests