-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak? #3044
Comments
curious about the workload you're running? How many pods are running and the number of pending pods you see during those spikes? |
@marwanad thanks for responding. Those spikes take about a day to manifest. We have a pretty steady workflow so about 700+ pods up to 1900 if you count cron jobs. |
@bradleyd interesting, I've hit weird memory behaviour with the increasing number of unschedulable pods (as logged in CA). In most cases, around 3k pods or so would hit anywhere between 500-700Mi. Haven't got to running pprof yet. From what @MaciekPytel mentioned on Slack, it seems like this is not unexpected given that watches cache all nodes and pods in memory. |
I would definitely expect CA memory to grow with cluster size (#pods, #nodes and especially #ASGs), but the number of pods mentioned is not that large, this seems like more memory than I'd expect. Note that I only run clusters in GCP/GKE, so my intuition may be way off for memory use of AWS provider. |
@MaciekPytel that was my thoughts too. I get that memory will grow in proportion to the cluster size, but this seems like a lot ™ |
@bradleyd do you have pods using affinity/anti-affinity rules by any chance? I ran a quick pprof and it seems that |
I'm seeing the same issue on a 6 node cluster, Amazon EKS v1.15, k8s.gcr.io/cluster-autoscaler:v1.15.5. CA runs into the memory limit, runs out of memory, and is then restarted. Looks like this behavior is on a ~7d schedule. |
We just upgraded to 1.15.6 from 1.14.x and CA was OOMing on startup based on our requests and limits we had previously set, had to significantly increase these to get CA to start up. Has anything changed to greatly increase the memory footprint? |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
30-40 pods, 10-15 nodes (another 15 terminated nodes). AWS. My other two clusters which have 6-8 nodes use only 200mb. |
I have another cluster with 100+ pods, 20+ nodes where |
any updates on this? so the option we tried is there is no issue with 1.14 CA |
I tried @infa-ddeore suggestion of static instance list. On my two clusters with minimal load I get average around 250mb before/after. On my production cluster memory was ~650-750mb past few days but after enabling static instance list it jumped to 950mb. |
Well now after reverting it starts at 500mb and within 15 seconds it climbs to 950mb before k8s kills it for exceeding memory requests. HELP! |
I'm going to have to bisect to see what changed between the 1.14 and 1.15 CAS w/AWS release. At this point, I'm just not sure; I didn't think there were many changes actually to the AWS-specific code between 1.14 and 1.15. |
We are also getting a similar issue, in starting memory will go high around 1 GB but after 5 min it's 100 MB only. cluster autoscaler image: tried both |
Holy crap. |
I upgraded cluster-autoscaler to 1.16.7 (from 1.15.7) and the memory 1min after boot is 375MiB. Will continue to monitor. Simultaneously I was removing 55k completed Batch Jobs (cronjobs) that were just lingering on the API server as complete. Anyone else with this issue have similar latent storage? I don't see how this could be related but you never know.
Update 3/31/21: We experienced this issue again with EKS |
this does not fix it.
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
/remove-lifecycle rotten |
I'd like to point out an observation. We just upgraded the last of our production clusters last Friday (-4 days) and I noticed something: EKS Upgrade to 1.18. Memory usage for the entire weekend, pod reboots, different nodes, latest AMI, etc.... Looking at our logs we noticed a logs we had not seen before:
This led me to investigate how large our job history was (because EKS does not clean up Jobs once they complete! you should enable this AWS team, I think it's the TTL variables)
answer: 83339 Eighty thousand! It took me 2+ days to clean these up. I had to write a python program to do it via Kube API. It's now down to a reasonable 4500 (I programmed it to keep -15 day history). Fast forward to today: I checked |
I just saw this with The cluster-autoscaler is installed via helm, and had been running for about 15 days. The pod had no resource limits, which I am fixing now. But this will just cause it to be auto-restarted. It did out of memory the host.
1340256kb / 1024 / 1024 = 1.27816772460937500000gb The instance is a m5.large with 7749mb. The region is us-east-2. The cluster-autoscaler process was using 16.9% of memory. Update: Interestingly, before I added the resource limits, the restart counter was at 68, and the cluster-autoscaler process had OOMed the box 65 times over the span of 4.5 hours all on July 22. So it had been mostly fine, then something changed in AWS, and memory usage blew up not just once, but repeatedly. The total-vm counts across the 65 OOMs.
|
There is a leak for sure. we got OOMK at 1G limit... |
same here, using AWS. there is no way this could justify 1G of consumption... there is a leak. it's related to the provider AWS. It can be reproduced with multiples ASG's, multiple nodes on each, and add/delete ASG's: nodes in and nodes out contribute to the effect. Unfortunately, to reproduce and eventually fix this..u have to spend a reasonable amount of $ in AWS. |
I am also facing OOM issue. |
Any updates on this? Or any established workarounds for large clusters? |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
I think there is an issue with same problem described here: It looks like that is fixed with new version of Cluster Autoscaler. Some message users: |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
We are running CA 1.15.5 with k8s 1.15.7. We are seeing memory gradually grow over time. We have the limits set to 1G but it will eventually reach that in about a day then get oom'd.
Here is our config in the deployment
Any thoughts?
The text was updated successfully, but these errors were encountered: