-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High Memory Usage after helm-controller v0.12.0 upgrade #345
Comments
We have had one other report mentioning this, but they were upgrading from a much older version, making it harder to trace the issue back to a specific commit. What was the version you were using before the upgrade? |
There is one difference however, and that is the change in strategy. Can you try increasing the interval at which the charts are reconciled, and in addition, maybe increase the limit. Chart reconciliation from certain sources can be a more memory expensive task, as the chart needs to be loaded into memory for certain actions. We are working to reduce this consumption, but e.g. storing a chart using the Helm SDK is done by loading a chart in full, which can't be avoided for charts from unpackaged sources. |
I updated from v0.11.2 |
Can you tell me more about the size of the repositories they originate from, and what chart related features you are making use of (valuesFrom, etc.)? |
All charts that use ReconcileStrategy Revision are from the flux-system git repository which is 1.8Mb in size. The charts have local and remote (transitive) dependencies, which are committed into the git repository in the |
I now had to change all ReconcileStrategy to ChartVersion because the high memory utilization made our controller nodes unresponsive and autoscaler had do replace them. |
@tehlers320 can you provide more information about the version you upgraded from? |
@hiddeco we jumped from 0.10.2 and do not use the new reconcile revision. Its been running for 5 days (when it can, OOMkiller). Sorry i don't mean to walk over this issue, but i think perhaps its not related to the reconcilestrategy since we dont use Revision. Should i make a new one or do you think the issue is both? |
No, I think your observations are correct based on other reports on Slack. Did a quick dive into it with the limited time I had available, but the helm-controller didn't really change much besides Helm, kustomize and controller-runtime updates. It would be useful if someone could pinpoint the resource behavior change to an exact helm-controller version, which would help identifying the issue. I am at present working on Helm improvements for the source-controller in the area of Helm repository index, dependency, and chart build memory consumption. Once that's done, I have time (and am planning) to look in much greater detail at the current shape of the helm-controller (as part of https://github.com/fluxcd/helm-controller/milestone/1). |
We are facing the same issue where the memory of helm controller keeps on growing We upgraded from 0.16.2 (no issue with helm controller in this version) > 0.19.1 (started seeing issues ) > 0.20.1 We have around 258 helm releases and 8 helm repo's in our cluster with the interval set to 24h We have set 4GB as the memory limit for our helm controller Here is the output of |
I've spend the day digging around to find the root cause of the sudden increase in memory usage. Here is what I've found:
We can't do much in Flux, we have to wait for that PR to get merged, then wait for a Kubernetes release, then wait for a Helm release that uses the latest Kubernetes release and finally update Helm in Flux to fix the OOM issues. I propose we revert Helm to v3.6.3 for a couple of months until the |
We've pushed a release candidate for #352, here is the image: Please take it for a spin and let us know if it fixes the issue. |
We have been using helm controller with After a couple of hours in our staging cluster, We will monitor the controller for few more days
|
@glen-uc could you help test another variant? I would like to see if we could keep Helm at the latest version and only replace |
@stefanprodan Sure, if you give the image I can test it out in our staging cluster |
Please give Thank you! |
@stefanprodan I have deployed helm controller with the image provided, will monitor it for a couple of hours and look for restarts due to OOM Kill
|
@glen-uc thank you very much for helping us 🤗 |
@stefanprodan I'm glad I could help Our staging cluster is using helm controller with Memory footprint is very minimal (< 100 MB) and we haven't observed any restarts due to OOM Kill We will monitor the controller for few more days
|
Awesome, and thanks a lot for helping out. This seems to indicate that we can at least temporary work around the upstream problems by forcing the replacement of that specific Helm dependency, without having to stop receiving new Helm updates. |
I also tested |
@Legion2 would you be able to grab a pprof profile, as done for the graph in one of the comments above? You can send it to me via DM on CNCF Slack ( |
Received a HEAP profile from @Legion2 which generates the following map: What version was this for Leon? This information was not included in your mail. |
@Legion2 can you please run |
Here is the output of
|
@Legion2 can you please try |
yes |
There is an update from my end, like @Legion2 with helm Unfortunately, I don't have the memory profile of the helm controller when it was running on I Will monitor
|
@stefanprodan I think we should deem |
Thanks all for testing. This should now also be solved by updating the helm-controller Deployment image to CLI release for |
We're seeing a 75% drop in |
Helm released a patch yesterday which likely addresses this issue Due to the holiday period that is arriving pretty soon however, I am hesitant in releasing this as I will be on leave for 3 weeks. Unless someone has specific needs for the v3.7.x release range, in which case I can provide a RC. |
This commit updates Helm to 3.7.2, in an attempt to get to a v3.7.x release range _without_ any memory issues (see #345), which should have been addressed in this release. The change in replacements has been cross-checked with the dependencies of Helm (and more specifically, the Oras project), and confirmed to not trigger any warnings using `trivy`. Signed-off-by: Hidde Beydals <[email protected]>
Based on the PR above, a RC with Helm 3.7.2 is available as |
We are going ahead with the Helm v3.7.2 as v3.6.3 blocks us from fixing the containerd CVEs due to ORAS breaking changes. In case v3.7.2 brings back the memory leak please pin helm-controller to |
This commit updates Helm to 3.7.2, in an attempt to get to a v3.7.x release range _without_ any memory issues (see #345), which should have been addressed in this release. The change in replacements has been cross-checked with the dependencies of Helm (and more specifically, the Oras project), and confirmed to not trigger any warnings using `trivy`. Signed-off-by: Hidde Beydals <[email protected]>
I updated flux to the latest version and can confirm that the memory leak is still present, will pin helm-controller to v0.14.1. |
Is it still sensible for us to try to pin the helm-controller version to avoid this? We're still using Flux v0.24.1 / helm-controller v0.14.1, but don't want to get too far behind, now that 0.26.x is released. |
This issue has been confirmed to be solved in the latest release (https://github.com/fluxcd/helm-controller/releases/tag/v0.16.0) via #409. |
I updated to helm-controller v0.12.1 and started using
ReconcileStrategy
Revision
for all my local helm charts. Now helm-controller is restarted each time I push a commit to the GitRepository source, because helm-controller uses too much memory and is killed by kubernetes (OOMKilled). As a result of the controller being killed by kubernetes, some helm release are stuck in the upgrade process which must be manually rolled back (helm/helm#8987).The text was updated successfully, but these errors were encountered: