High Memory Usage after helm-controller v0.12.0 upgrade #345

Legion2 · 2021-10-24T16:41:20Z

I updated to helm-controller v0.12.1 and started using ReconcileStrategy Revision for all my local helm charts. Now helm-controller is restarted each time I push a commit to the GitRepository source, because helm-controller uses too much memory and is killed by kubernetes (OOMKilled). As a result of the controller being killed by kubernetes, some helm release are stuck in the upgrade process which must be manually rolled back (helm/helm#8987).

The text was updated successfully, but these errors were encountered:

hiddeco · 2021-10-24T16:44:48Z

We have had one other report mentioning this, but they were upgrading from a much older version, making it harder to trace the issue back to a specific commit.

What was the version you were using before the upgrade?

hiddeco · 2021-10-24T16:49:06Z

There is one difference however, and that is the change in strategy.

Can you try increasing the interval at which the charts are reconciled, and in addition, maybe increase the limit. Chart reconciliation from certain sources can be a more memory expensive task, as the chart needs to be loaded into memory for certain actions.

We are working to reduce this consumption, but e.g. storing a chart using the Helm SDK is done by loading a chart in full, which can't be avoided for charts from unpackaged sources.

Legion2 · 2021-10-24T16:49:30Z

I updated from v0.11.2

Legion2 · 2021-10-24T17:08:39Z

The current interval is 10m and limit is 4Gi. If I push a commit the memory usage increases by 1.5Gi. I have 26 Helm releases and 11 of them use ReconcileStrategy Revision.

hiddeco · 2021-10-24T17:26:16Z

Can you tell me more about the size of the repositories they originate from, and what chart related features you are making use of (valuesFrom, etc.)?

Legion2 · 2021-10-24T17:43:06Z

All charts that use ReconcileStrategy Revision are from the flux-system git repository which is 1.8Mb in size. The charts have local and remote (transitive) dependencies, which are committed into the git repository in the charts/ subdirectory as tgz files (I don't know if this is still needed with the current helm-controller version). The HelmReleases only use interval, chart, releaseName, install, upgrade and values in the spec. All intervals are set to 10m.

Legion2 · 2021-10-27T12:29:05Z

I now had to change all ReconcileStrategy to ChartVersion because the high memory utilization made our controller nodes unresponsive and autoscaler had do replace them.

tehlers320 · 2021-11-02T15:53:57Z

I'm not sure this is just the revision strategy. We do not use this anywhere and have seen a huge jump in memory use. I grabbed a pprof heap map.

v0.11.2...v0.12.1

hiddeco · 2021-11-02T16:08:24Z

@tehlers320 can you provide more information about the version you upgraded from?

tehlers320 · 2021-11-02T16:17:16Z

@hiddeco we jumped from 0.10.2 and do not use the new reconcile revision. Its been running for 5 days (when it can, OOMkiller). Sorry i don't mean to walk over this issue, but i think perhaps its not related to the reconcilestrategy since we dont use Revision.

Should i make a new one or do you think the issue is both?

hiddeco · 2021-11-02T16:41:25Z

No, I think your observations are correct based on other reports on Slack.

Did a quick dive into it with the limited time I had available, but the helm-controller didn't really change much besides Helm, kustomize and controller-runtime updates. It would be useful if someone could pinpoint the resource behavior change to an exact helm-controller version, which would help identifying the issue.

I am at present working on Helm improvements for the source-controller in the area of Helm repository index, dependency, and chart build memory consumption. Once that's done, I have time (and am planning) to look in much greater detail at the current shape of the helm-controller (as part of https://github.com/fluxcd/helm-controller/milestone/1).

glen-uc · 2021-11-03T11:47:04Z

We are facing the same issue where the memory of helm controller keeps on growing

We upgraded from 0.16.2 (no issue with helm controller in this version) > 0.19.1 (started seeing issues ) > 0.20.1

We have around 258 helm releases and 8 helm repo's in our cluster with the interval set to 24h

We have set 4GB as the memory limit for our helm controller

Here is the output of curl -Sk -v http://<helm_controller>:8080/debug/pprof/heap

heap.zip

stefanprodan · 2021-11-03T13:40:29Z

I've spend the day digging around to find the root cause of the sudden increase in memory usage. Here is what I've found:

in this commit kubectl switched to k8s.io/kube-openapi (released in k8s.io/kubectl v0.22.1)
in this commit Helm switched to k8s.io/kubectl v0.22.1 (released in helm.sh/helm/v3 v3.7.0)
the k8s.io/kube-openapi pkg has a bug where it consumes large amounts of memory kube-apiserver memory consumption during CRD creation kubernetes/kubernetes#101755
there is a fix underway Lazy marshaling for OpenAPI v2 spec kubernetes/kube-openapi#251

We can't do much in Flux, we have to wait for that PR to get merged, then wait for a Kubernetes release, then wait for a Helm release that uses the latest Kubernetes release and finally update Helm in Flux to fix the OOM issues.

I propose we revert Helm to v3.6.3 for a couple of months until the kube-openapi fixes end up in Helm.

stefanprodan · 2021-11-03T16:09:31Z

We've pushed a release candidate for #352, here is the image: ghcr.io/fluxcd/helm-controller:rc-4fe7a7c8

Please take it for a spin and let us know if it fixes the issue.

glen-uc · 2021-11-04T11:35:39Z

We have been using helm controller with ghcr.io/fluxcd/helm-controller:rc-4fe7a7c8 image for a couple of hours and already seeing a large reduction in memory footprint when compared to ghcr.io/fluxcd/helm-controller:v0.12.1

After a couple of hours in our staging cluster, ghcr.io/fluxcd/helm-controller:v0.12.1 used to consume around 4GB+ and crash as we have set the limit as 4GB, ghcr.io/fluxcd/helm-controller:rc-4fe7a7c8 is barely using 200MB

We will monitor the controller for few more days

flux: v0.21.0
helm-controller: rc-4fe7a7c8
image-automation-controller: v0.16.0
image-reflector-controller: v0.13.0
kustomize-controller: v0.16.0
notification-controller: v0.18.1
source-controller: v0.17.1

stefanprodan · 2021-11-04T12:08:16Z

@glen-uc could you help test another variant? I would like to see if we could keep Helm at the latest version and only replace kube-openapi with go-openapi.

glen-uc · 2021-11-04T13:34:48Z

@stefanprodan Sure, if you give the image I can test it out in our staging cluster

stefanprodan · 2021-11-04T14:06:57Z

Please give ghcr.io/fluxcd/helm-controller:rc-725fd784 a spin, this is built from #354

Thank you!

glen-uc · 2021-11-04T14:20:46Z

@stefanprodan I have deployed helm controller with the image provided, will monitor it for a couple of hours and look for restarts due to OOM Kill

flux: v0.21.0
helm-controller: rc-725fd784
image-automation-controller: v0.16.0
image-reflector-controller: v0.13.0
kustomize-controller: v0.16.0
notification-controller: v0.18.1
source-controller: v0.17.1

stefanprodan · 2021-11-04T14:22:14Z

@glen-uc thank you very much for helping us 🤗

glen-uc · 2021-11-05T11:31:16Z

@stefanprodan I'm glad I could help

Our staging cluster is using helm controller with ghcr.io/fluxcd/helm-controller:rc-725fd784 for a couple of hours now

Memory footprint is very minimal (< 100 MB) and we haven't observed any restarts due to OOM Kill

We will monitor the controller for few more days

flux: v0.21.0
helm-controller: rc-725fd784
image-automation-controller: v0.16.0
image-reflector-controller: v0.13.0
kustomize-controller: v0.16.0
notification-controller: v0.18.1
source-controller: v0.17.1

hiddeco · 2021-11-05T11:33:37Z

Awesome, and thanks a lot for helping out. This seems to indicate that we can at least temporary work around the upstream problems by forcing the replacement of that specific Helm dependency, without having to stop receiving new Helm updates.

Legion2 · 2021-11-07T13:24:46Z

I also tested ghcr.io/fluxcd/helm-controller:rc-725fd784 and experienced an OOM after 2 days at 2GiB usage. It looks like the momory usage of ghcr.io/fluxcd/helm-controller:rc-725fd784 is similar to the current release version.

hiddeco · 2021-11-07T18:30:14Z

@Legion2 would you be able to grab a pprof profile, as done for the graph in one of the comments above?

You can send it to me via DM on CNCF Slack (@hidde) or mail (hidde <at> weave.works).

hiddeco · 2021-11-08T09:23:54Z

Received a HEAP profile from @Legion2 which generates the following map:

What version was this for Leon? This information was not included in your mail.

stefanprodan · 2021-11-08T09:47:34Z

@Legion2 can you please run flux version against the cluster where the above profile was taken and post it here?

Legion2 · 2021-11-08T09:50:01Z

Here is the output of flux version:

flux: v0.21.0
helm-controller: rc-725fd784
image-automation-controller: v0.16.0
image-reflector-controller: v0.13.0
kustomize-controller: v0.16.0
notification-controller: v0.18.1
source-controller: v0.17.1

stefanprodan · 2021-11-08T09:53:55Z

@Legion2 can you please try rc-4fe7a7c8?

Legion2 · 2021-11-08T09:56:10Z

@Legion2 can you please try rc-4fe7a7c8?

yes

glen-uc · 2021-11-11T04:46:56Z

There is an update from my end, like @Legion2 with helm rc-725fd784 i started seeing restarts due to OOM Kill after 1 day (with memory limit set to 1GB) hence I promptly reverted the helm controller to rc-4fe7a7c8 it has been running for 3 days without any restarts

Unfortunately, I don't have the memory profile of the helm controller when it was running on rc-725fd784

I Will monitor rc-4fe7a7c8 for a few more days

flux: v0.21.0
helm-controller: rc-4fe7a7c8
image-automation-controller: v0.16.0
image-reflector-controller: v0.13.0
kustomize-controller: v0.16.0
notification-controller: v0.18.1
source-controller: v0.17.1

hiddeco · 2021-11-11T11:39:41Z

@stefanprodan I think we should deem rc-4fe7a7c8 the successor for now.

hiddeco · 2021-11-11T14:12:17Z

Thanks all for testing. This should now also be solved by updating the helm-controller Deployment image to v0.12.2.

CLI release for flux bootstrap, etc. will arrive later today.

poblish · 2021-11-12T10:58:36Z

We're seeing a 75% drop in helm-controller memory (peaks and base level) since picking up this version 👍

hiddeco · 2021-12-09T13:40:36Z

Helm released a patch yesterday which likely addresses this issue

Due to the holiday period that is arriving pretty soon however, I am hesitant in releasing this as I will be on leave for 3 weeks. Unless someone has specific needs for the v3.7.x release range, in which case I can provide a RC.

This commit updates Helm to 3.7.2, in an attempt to get to a v3.7.x release range _without_ any memory issues (see #345), which should have been addressed in this release. The change in replacements has been cross-checked with the dependencies of Helm (and more specifically, the Oras project), and confirmed to not trigger any warnings using `trivy`. Signed-off-by: Hidde Beydals <[email protected]>

hiddeco · 2021-12-10T09:53:06Z

Based on the PR above, a RC with Helm 3.7.2 is available as docker.io/fluxcd/helm-controller:rc-helm-3.7.2--a12eae8e

stefanprodan · 2022-01-07T16:04:28Z

We are going ahead with the Helm v3.7.2 as v3.6.3 blocks us from fixing the containerd CVEs due to ORAS breaking changes. In case v3.7.2 brings back the memory leak please pin helm-controller to v0.12.2.

This commit updates Helm to 3.7.2, in an attempt to get to a v3.7.x release range _without_ any memory issues (see #345), which should have been addressed in this release. The change in replacements has been cross-checked with the dependencies of Helm (and more specifically, the Oras project), and confirmed to not trigger any warnings using `trivy`. Signed-off-by: Hidde Beydals <[email protected]>

Legion2 · 2022-01-18T18:12:12Z

I updated flux to the latest version and can confirm that the memory leak is still present, will pin helm-controller to v0.14.1.

poblish · 2022-02-04T12:37:13Z

Is it still sensible for us to try to pin the helm-controller version to avoid this? We're still using Flux v0.24.1 / helm-controller v0.14.1, but don't want to get too far behind, now that 0.26.x is released.

hiddeco · 2022-02-04T12:57:48Z

This issue has been confirmed to be solved in the latest release (https://github.com/fluxcd/helm-controller/releases/tag/v0.16.0) via #409.

rossf7 mentioned this issue Nov 2, 2021

Set max allowed memory limit for helm-controller VPA giantswarm/flux-app#58

Merged

hiddeco mentioned this issue Nov 3, 2021

helm-controller Pod gets OOM-killed even with 1GB of RAM #349

Closed

1 task

stefanprodan changed the title ~~High Memory Usage with ReconcileStrategy Revision~~ High Memory Usage after helm-controller v0.12.1 upgrade Nov 3, 2021

stefanprodan mentioned this issue Nov 3, 2021

Downgrade Helm to v3.6.3 due to OOM issues #352

Merged

hiddeco mentioned this issue Nov 3, 2021

One failing HelmRelease seems to block all other releases from being installed/upgraded #351

Closed

hiddeco pinned this issue Nov 8, 2021

hiddeco changed the title ~~High Memory Usage after helm-controller v0.12.1 upgrade~~ High Memory Usage after helm-controller v0.12.0 upgrade Nov 8, 2021

hiddeco added area/helm Helm related issues and pull requests bug Something isn't working labels Nov 8, 2021

hiddeco unpinned this issue Nov 11, 2021

hiddeco mentioned this issue Dec 10, 2021

Update Helm to v3.7.2 #380

Merged

a-mcf mentioned this issue Jan 9, 2022

Nextcloud Helm Release Incompatiblity a-mcf/k3s-gitops#116

Closed

cbuto mentioned this issue Jan 27, 2022

Update helm to 3.7.2 #407

Merged

hiddeco closed this as completed Feb 4, 2022

ruanxin mentioned this issue Oct 18, 2022

High memory consumption of module-manager kyma-project/module-manager#141

Closed

High Memory Usage after helm-controller v0.12.0 upgrade #345

High Memory Usage after helm-controller v0.12.0 upgrade #345

Comments

Legion2 commented Oct 24, 2021

hiddeco commented Oct 24, 2021

hiddeco commented Oct 24, 2021

Legion2 commented Oct 24, 2021

Legion2 commented Oct 24, 2021

hiddeco commented Oct 24, 2021

Legion2 commented Oct 24, 2021

Legion2 commented Oct 27, 2021

tehlers320 commented Nov 2, 2021 • edited Loading

hiddeco commented Nov 2, 2021

tehlers320 commented Nov 2, 2021

hiddeco commented Nov 2, 2021 • edited Loading

glen-uc commented Nov 3, 2021 • edited Loading

stefanprodan commented Nov 3, 2021 • edited Loading

stefanprodan commented Nov 3, 2021 • edited Loading

glen-uc commented Nov 4, 2021 • edited Loading

stefanprodan commented Nov 4, 2021

glen-uc commented Nov 4, 2021

stefanprodan commented Nov 4, 2021

glen-uc commented Nov 4, 2021 • edited Loading

stefanprodan commented Nov 4, 2021

glen-uc commented Nov 5, 2021

hiddeco commented Nov 5, 2021

Legion2 commented Nov 7, 2021

hiddeco commented Nov 7, 2021

hiddeco commented Nov 8, 2021

stefanprodan commented Nov 8, 2021

Legion2 commented Nov 8, 2021

stefanprodan commented Nov 8, 2021

Legion2 commented Nov 8, 2021

glen-uc commented Nov 11, 2021 • edited Loading

hiddeco commented Nov 11, 2021

hiddeco commented Nov 11, 2021

poblish commented Nov 12, 2021 • edited Loading

hiddeco commented Dec 9, 2021

hiddeco commented Dec 10, 2021

stefanprodan commented Jan 7, 2022

Legion2 commented Jan 18, 2022 • edited Loading

poblish commented Feb 4, 2022

hiddeco commented Feb 4, 2022

tehlers320 commented Nov 2, 2021 •

edited

Loading

hiddeco commented Nov 2, 2021 •

edited

Loading

glen-uc commented Nov 3, 2021 •

edited

Loading

stefanprodan commented Nov 3, 2021 •

edited

Loading

stefanprodan commented Nov 3, 2021 •

edited

Loading

glen-uc commented Nov 4, 2021 •

edited

Loading

glen-uc commented Nov 4, 2021 •

edited

Loading

glen-uc commented Nov 11, 2021 •

edited

Loading

poblish commented Nov 12, 2021 •

edited

Loading

Legion2 commented Jan 18, 2022 •

edited

Loading