Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Steady memory leak in VPA recommender #6368

Closed
DLakin01 opened this issue Dec 11, 2023 · 9 comments
Closed

Steady memory leak in VPA recommender #6368

DLakin01 opened this issue Dec 11, 2023 · 9 comments
Labels
area/vertical-pod-autoscaler kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@DLakin01
Copy link

Which component are you using?:

vertical-pod-autoscaler, recommender only

What version of the component are you using?

0.14.0

What k8s version are you using (kubectl version)?:

1.26

What environment is this in?:

AWS EKS, multiple clusters and accounts, multiple types of applications running on the cluster

What did you expect to happen?:

VPA recommender should run at more or less at the same memory level throughout the lifetime of a particular pod

What happened instead?:

There is a steady memory leak that is especially visible over a period of days, as seen here in a screen capture of our DataDog:
image

The upper lines with the steeper slope are from our large multi-tenant clusters, but the smaller clusters also experience the leak, albeit more slowly. If left alone, the memory will reach 200% of requests before the pod gets kicked. The recommender in the largest cluster is tracking 3161 PodStates at the time of creating this issue

How to reproduce it (as minimally and precisely as possible):

Not sure how reproducible the issue is outside of running VPA in a large cluster with > 3000 pods and waiting several days to see if the memory creeps up.

Anything else we need to know?:

We haven't yet created any VPA CRDs to generate recommendations, waiting until a future sprint to begin rolling those out.

@DLakin01 DLakin01 added the kind/bug Categorizes issue or PR as related to a bug. label Dec 11, 2023
@vkhacharia
Copy link

We also face the same issue. Our version is 0.11 with k8s version 1.24. Below is grafana snippet from the last restart.
image

@voelzmo
Copy link
Contributor

voelzmo commented Feb 26, 2024

Hey @vkhacharia @DLakin01 thanks for bringing this up!

To some extend, this behavior is expected and given only these graphs it is hard to tell, if the behavior is normal or not.
The recommender keeps metrics for each container, regardless if that container is under VPA control or not. I guess the reasoning is that you get accurate recommendations immediately if you would decide to enable VPA for this container at a later point in time. You can switch off this default behavior by enabling memory saver mode.

Even with memory saver mode enabled, there's some grow in memory expected:

So if you're rolling approximately the same number of times per week, your memory is expected to grow for ~2 weeks. If you're adding Containers and don't have memory saver mode enabled, memory will grow with every Container.

If all of those parameters are controlled and you still see memory growth, I guess this really is a memory leak that shouldn't happen.

@vkhacharia
Copy link

@voelzmo Thanks for the quick response, I wanted to try it now but noticed that I am on k8s version 1.24 which has compabitility with 0.11 of vpa recommender. I dont see the parameter memory-saver in code in branch for version 0.11.

@voelzmo
Copy link
Contributor

voelzmo commented Mar 5, 2024

Hey @vkhacharia, thanks for your efforts! VPA 0.11.0 also has memory-saver mode, but the parameter is in a different place and was moved to the above section in the code with a refactoring that happened later.

So you can still turn on --memory-saver=true and see what this does for you. Hope that helps!

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 3, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 3, 2024
@adrianmoisey
Copy link
Member

/area vertical-pod-autoscaler

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/vertical-pod-autoscaler kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

6 participants