csi-resizer getting OOMKilled #1248

bwmetcalf · 2022-05-23T14:52:32Z

/kind bug

What happened?
We are using version v1.6.0-eksbuild.1 of the aws-ebs-csi-driver addon. We have one environment where the csi-resizer container is using substantially more memory and getting OOMKilled. Since the memory limits setting cannot be controlled with the addon, we have no way of increasing it. We can edit the deployment live, but it gets reverted by EKS. However, the real issue is that this container is using so much memory:

ebs-csi-controller-748586d5cc-9fbr5            csi-resizer                    13m          148Mi

Our other environments looks like

ebs-csi-controller-5458b96c8d-bmgmf             csi-resizer                    1m           21Mi

The memory settings for csi-resizer are

    Limits:
      cpu:     100m
      memory:  128Mi
    Requests:
      cpu:     10m
      memory:  40Mi

which is baked into the chart.

What you expected to happen?
csi-resizer should not be using so much memory.

How to reproduce it (as minimally and precisely as possible)?
This is unclear, but would welcome pointers on how to troubleshoot.

Anything else we need to know?:
This is occurring even though we aren't resizing and volumes.

Environment

Kubernetes version (use kubectl version): 1.22
Driver version: v1.6.0-eksbuild.1

The text was updated successfully, but these errors were encountered:

gtxu · 2022-05-23T17:16:43Z

Hi @bwmetcalf, the usage exceeded 128Mi is abnormal, can you specify the "other environments" and the environment that the issue occurs? Plus can you provide more details on the issue container by retrieve the log (kubectl logs ebs-csi-controller-748586d5cc-9fbr5 -c csi-resizer -n kube-system), if it convenient to you, thanks!

bwmetcalf · 2022-05-23T17:21:18Z

Thanks @gtxu. Here are the logs from one of the killed pods

$ kubectl logs ebs-csi-controller-6588dc7455-hzgdw -c csi-resizer -n kube-system
I0523 17:14:44.388430       1 main.go:83] Version : v
I0523 17:14:44.487283       1 connection.go:153] Connecting to unix:///var/lib/csi/sockets/pluginproxy/csi.sock
I0523 17:14:44.487729       1 common.go:111] Probing CSI driver for readiness
I0523 17:14:44.489037       1 main.go:128] CSI driver name: "ebs.csi.aws.com"
I0523 17:14:44.489904       1 controller.go:117] Register Pod informer for resizer ebs.csi.aws.com
I0523 17:14:44.489936       1 controller.go:241] Starting external resizer ebs.csi.aws.com
I0523 17:14:44.490161       1 reflector.go:219] Starting reflector *v1.Pod (10m0s) from k8s.io/client-go/informers/factory.go:134
I0523 17:14:44.490158       1 reflector.go:219] Starting reflector *v1.PersistentVolume (10m0s) from k8s.io/client-go/informers/factory.go:134
I0523 17:14:44.586501       1 reflector.go:219] Starting reflector *v1.PersistentVolumeClaim (10m0s) from k8s.io/client-go/informers/factory.go:134

There is not much activity before it gets killed.

In terms of environment differences, this particular environment is 1 of 3 we have in govcloud and we have 3 in commercial. Other than running different services that we develop, the environments are exactly the same. It's not clear what could be helpful in this regard to troubleshoot this.

ghost · 2022-06-01T18:47:34Z

@bwmetcalf or @gtxu I see that this is labeled as a bug, but was a workaround or temporary resolution found? We are experiencing the exact same issue, only in one environment, after deploying the EBS CSI driver add-on to that cluster for the first time. Manually increasing memory to 512Mi on the deployment allowed it to start successfully. Our logs look identical, there are no errors or useful messages relating to memory.

gtxu · 2022-06-01T19:39:05Z

Hi @ccs-warrior can you specify the Add-on version and k8s version?

ghost · 2022-06-01T19:55:39Z

@gtxu k8s 1.21 and add-on v1.6.0-eksbuild.1

bwmetcalf · 2022-06-01T22:29:53Z

This just happened again for us. We are using argo workflows and this issue somehow seems to be related to the number of Completed workflow pods that accumulate over time. Once the number of these pods grows to some threshold, csi-resizer starts this behavior. I just deleted all of these Completed workflow pods and the csi-resizer came back healthy and is using less memory.

I have no idea how this is correlated, but we've observed the above twice now; we are not resizing volumes.

bwmetcalf · 2022-06-02T13:39:41Z

Per my comment above, the environment where we are having this issue with csi-resizer is the only environment where we are running argo workflows. I'm fairly confident this is the cause.

bwmetcalf · 2022-06-02T13:42:16Z

@ccs-warrior The workaround for us is to delete the completed argo workflow pods that accumulate over time. We are looking at using https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/ once we upgrade to 1.23.

gnufied · 2022-06-02T14:06:01Z

hmm. resizer only watches pods if certain flag is enabled on it. If someone can grab heap dump while this happens I can take a look.

ghost · 2022-06-03T20:28:20Z

@gnufied Unfortunately we don't have a way to replicate it any more. The cluster where we saw the issue was a pre-prod environment and so the addition of the EKS add-on was backed out as quickly as we could after seeing the failure. We are unable to replicate the same issue in any lower environments. It may be worth noting that, similar to @bwmetcalf, we also use Argo workflows in this cluster. We have other clusters where we also use Argo and are not seeing the error, but this cluster is much larger and uses Argo much more heavily.

gtxu · 2022-06-03T20:40:38Z

Hi @ccs-warrior, how many volumes attached to pods in that cluster. I find a benchmark on sidecars here that 3K volumes reach the ~100Mi memory for resizer. While we are investigating this, we are raising the limitations for EKS Add-on in v1.6.2.

woz5999 · 2022-06-03T20:42:58Z

bwmetcalf is off today, but for our use case, our clusters have very few volumes; like 10 or less each.

ghost · 2022-06-06T15:23:38Z

@gtxu there are only around 125 volumes in that cluster

bhushanchirmade · 2022-06-07T18:51:24Z

We are seeing same issue for another customer who is also using govcloud. EBS Add-on out of memory on resizer with 200+ PVCs.

pascal-knapen · 2022-06-13T13:43:58Z

We see the same issue with latest helm-installed version as well. We put the limit to 300Mi and still OOMKilled.

sunghospark-calm · 2022-06-15T00:07:12Z

Having the same issue here. EKS v1.22 csi-resizer v1.1.0

torredil · 2022-06-15T22:00:40Z

Hi all, can anyone confirm if they are experiencing this issue with the latest version of the driver? The latest add-on release bumps up the csi-resizer to v1.4.0 and also increases the memory limit.

sunghospark-calm · 2022-06-16T00:10:17Z

@torredil It seems like the latest with EKS (v1.6.2) is still pointing to csi-resizer v1.1.0 https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/v1.6.2/charts/aws-ebs-csi-driver/values.yaml#L52

torredil · 2022-06-16T00:21:28Z

@sunghospark-calm Sorry let me clarify; The EKS add-on version of the CSI driver which was released today is now using csi-resizer v1.4.0. These changes are available in our latest Helm chart as well.

sunghospark-calm · 2022-06-16T22:22:53Z

@torredil no worries. Is there a way to test this as EKS addon? I still see v1.6.2 (v1.6.2-eksbuild.0) as the latest available version when I run aws eks describe-addon-versions

torredil · 2022-06-17T01:36:58Z

@sunghospark-calm Yes! v1.6.2-eksbuild.0 is the latest version of the EKS addon and it is using csi-resizer v1.4.0. Let me know if you upgrade and still run into this problem, thanks.

stijndehaes · 2022-06-17T06:24:55Z

We have beeing running csi-resizer v1.4.0 for about a week. We are still seeing this container go Out of Memory. We have one container with 62 restarts.

stijndehaes · 2022-06-17T07:01:14Z

I have opened a ticket on csi-resizer. I think the pod informer might be the culprit of memory blowing up: kubernetes-csi/external-resizer#211

stijndehaes · 2022-06-17T07:03:22Z

As we are not using the resizer, I would love to be able to turn of the flag handle-volume-inuse-error, would you guys be open to a PR to the helm charts to allows people to set extra flags to all different components?

bwmetcalf · 2022-06-17T11:46:13Z

As we are not using the resizer, I would love to be able to turn of the flag handle-volume-inuse-error, would you guys be open to a PR to the helm charts to allows people to set extra flags to all different components?

Unfortunately, this won't help when using the EKS add-on as the charts are not exposed.

bwmetcalf · 2022-06-17T11:51:53Z

I'm not seeing anything in https://github.com/kubernetes-csi/external-resizer/releases/tag/v1.4.0 that would address the issue at hand. Have code changes been made to address this particular problem or did the latest EKS add-on release simply increase the memory limits?

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 23, 2022

stijndehaes mentioned this issue Jun 20, 2022

Set handle-volume-inuse-error to false #1280

Merged

k8s-ci-robot closed this as completed in #1280 Jun 20, 2022

marcinpro1 mentioned this issue Nov 21, 2024

Csi resizer oom killed openebs/mayastor#1774

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

csi-resizer getting OOMKilled #1248

csi-resizer getting OOMKilled #1248

bwmetcalf commented May 23, 2022 •

edited

Loading

gtxu commented May 23, 2022

bwmetcalf commented May 23, 2022 •

edited

Loading

ghost commented Jun 1, 2022

gtxu commented Jun 1, 2022

ghost commented Jun 1, 2022

bwmetcalf commented Jun 1, 2022 •

edited

Loading

bwmetcalf commented Jun 2, 2022

bwmetcalf commented Jun 2, 2022 •

edited

Loading

gnufied commented Jun 2, 2022

ghost commented Jun 3, 2022

gtxu commented Jun 3, 2022 •

edited

Loading

woz5999 commented Jun 3, 2022

ghost commented Jun 6, 2022

bhushanchirmade commented Jun 7, 2022

pascal-knapen commented Jun 13, 2022

sunghospark-calm commented Jun 15, 2022

torredil commented Jun 15, 2022 •

edited

Loading

sunghospark-calm commented Jun 16, 2022

torredil commented Jun 16, 2022

sunghospark-calm commented Jun 16, 2022

torredil commented Jun 17, 2022

stijndehaes commented Jun 17, 2022

stijndehaes commented Jun 17, 2022

stijndehaes commented Jun 17, 2022

bwmetcalf commented Jun 17, 2022

bwmetcalf commented Jun 17, 2022

csi-resizer getting OOMKilled #1248

csi-resizer getting OOMKilled #1248

Comments

bwmetcalf commented May 23, 2022 • edited Loading

gtxu commented May 23, 2022

bwmetcalf commented May 23, 2022 • edited Loading

ghost commented Jun 1, 2022

gtxu commented Jun 1, 2022

ghost commented Jun 1, 2022

bwmetcalf commented Jun 1, 2022 • edited Loading

bwmetcalf commented Jun 2, 2022

bwmetcalf commented Jun 2, 2022 • edited Loading

gnufied commented Jun 2, 2022

ghost commented Jun 3, 2022

gtxu commented Jun 3, 2022 • edited Loading

woz5999 commented Jun 3, 2022

ghost commented Jun 6, 2022

bhushanchirmade commented Jun 7, 2022

pascal-knapen commented Jun 13, 2022

sunghospark-calm commented Jun 15, 2022

torredil commented Jun 15, 2022 • edited Loading

sunghospark-calm commented Jun 16, 2022

torredil commented Jun 16, 2022

sunghospark-calm commented Jun 16, 2022

torredil commented Jun 17, 2022

stijndehaes commented Jun 17, 2022

stijndehaes commented Jun 17, 2022

stijndehaes commented Jun 17, 2022

bwmetcalf commented Jun 17, 2022

bwmetcalf commented Jun 17, 2022

bwmetcalf commented May 23, 2022 •

edited

Loading

bwmetcalf commented May 23, 2022 •

edited

Loading

bwmetcalf commented Jun 1, 2022 •

edited

Loading

bwmetcalf commented Jun 2, 2022 •

edited

Loading

gtxu commented Jun 3, 2022 •

edited

Loading

torredil commented Jun 15, 2022 •

edited

Loading