Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csi-resizer getting OOMKilled #1248

Closed
bwmetcalf opened this issue May 23, 2022 · 26 comments · Fixed by #1280
Closed

csi-resizer getting OOMKilled #1248

bwmetcalf opened this issue May 23, 2022 · 26 comments · Fixed by #1280
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@bwmetcalf
Copy link

bwmetcalf commented May 23, 2022

/kind bug

What happened?
We are using version v1.6.0-eksbuild.1 of the aws-ebs-csi-driver addon. We have one environment where the csi-resizer container is using substantially more memory and getting OOMKilled. Since the memory limits setting cannot be controlled with the addon, we have no way of increasing it. We can edit the deployment live, but it gets reverted by EKS. However, the real issue is that this container is using so much memory:

ebs-csi-controller-748586d5cc-9fbr5            csi-resizer                    13m          148Mi

Our other environments looks like

ebs-csi-controller-5458b96c8d-bmgmf             csi-resizer                    1m           21Mi

The memory settings for csi-resizer are

    Limits:
      cpu:     100m
      memory:  128Mi
    Requests:
      cpu:     10m
      memory:  40Mi

which is baked into the chart.

What you expected to happen?
csi-resizer should not be using so much memory.

How to reproduce it (as minimally and precisely as possible)?
This is unclear, but would welcome pointers on how to troubleshoot.

Anything else we need to know?:
This is occurring even though we aren't resizing and volumes.

Environment

  • Kubernetes version (use kubectl version): 1.22
  • Driver version: v1.6.0-eksbuild.1
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 23, 2022
@gtxu
Copy link
Contributor

gtxu commented May 23, 2022

Hi @bwmetcalf, the usage exceeded 128Mi is abnormal, can you specify the "other environments" and the environment that the issue occurs? Plus can you provide more details on the issue container by retrieve the log (kubectl logs ebs-csi-controller-748586d5cc-9fbr5 -c csi-resizer -n kube-system), if it convenient to you, thanks!

@bwmetcalf
Copy link
Author

bwmetcalf commented May 23, 2022

Thanks @gtxu. Here are the logs from one of the killed pods

$ kubectl logs ebs-csi-controller-6588dc7455-hzgdw -c csi-resizer -n kube-system
I0523 17:14:44.388430       1 main.go:83] Version : v
I0523 17:14:44.487283       1 connection.go:153] Connecting to unix:///var/lib/csi/sockets/pluginproxy/csi.sock
I0523 17:14:44.487729       1 common.go:111] Probing CSI driver for readiness
I0523 17:14:44.489037       1 main.go:128] CSI driver name: "ebs.csi.aws.com"
I0523 17:14:44.489904       1 controller.go:117] Register Pod informer for resizer ebs.csi.aws.com
I0523 17:14:44.489936       1 controller.go:241] Starting external resizer ebs.csi.aws.com
I0523 17:14:44.490161       1 reflector.go:219] Starting reflector *v1.Pod (10m0s) from k8s.io/client-go/informers/factory.go:134
I0523 17:14:44.490158       1 reflector.go:219] Starting reflector *v1.PersistentVolume (10m0s) from k8s.io/client-go/informers/factory.go:134
I0523 17:14:44.586501       1 reflector.go:219] Starting reflector *v1.PersistentVolumeClaim (10m0s) from k8s.io/client-go/informers/factory.go:134

There is not much activity before it gets killed.

In terms of environment differences, this particular environment is 1 of 3 we have in govcloud and we have 3 in commercial. Other than running different services that we develop, the environments are exactly the same. It's not clear what could be helpful in this regard to troubleshoot this.

@ghost
Copy link

ghost commented Jun 1, 2022

@bwmetcalf or @gtxu I see that this is labeled as a bug, but was a workaround or temporary resolution found? We are experiencing the exact same issue, only in one environment, after deploying the EBS CSI driver add-on to that cluster for the first time. Manually increasing memory to 512Mi on the deployment allowed it to start successfully. Our logs look identical, there are no errors or useful messages relating to memory.

@gtxu
Copy link
Contributor

gtxu commented Jun 1, 2022

Hi @ccs-warrior can you specify the Add-on version and k8s version?

@ghost
Copy link

ghost commented Jun 1, 2022

@gtxu k8s 1.21 and add-on v1.6.0-eksbuild.1

@bwmetcalf
Copy link
Author

bwmetcalf commented Jun 1, 2022

This just happened again for us. We are using argo workflows and this issue somehow seems to be related to the number of Completed workflow pods that accumulate over time. Once the number of these pods grows to some threshold, csi-resizer starts this behavior. I just deleted all of these Completed workflow pods and the csi-resizer came back healthy and is using less memory.

I have no idea how this is correlated, but we've observed the above twice now; we are not resizing volumes.

@bwmetcalf
Copy link
Author

Per my comment above, the environment where we are having this issue with csi-resizer is the only environment where we are running argo workflows. I'm fairly confident this is the cause.

@bwmetcalf
Copy link
Author

bwmetcalf commented Jun 2, 2022

@ccs-warrior The workaround for us is to delete the completed argo workflow pods that accumulate over time. We are looking at using https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/ once we upgrade to 1.23.

@gnufied
Copy link
Contributor

gnufied commented Jun 2, 2022

hmm. resizer only watches pods if certain flag is enabled on it. If someone can grab heap dump while this happens I can take a look.

@ghost
Copy link

ghost commented Jun 3, 2022

@gnufied Unfortunately we don't have a way to replicate it any more. The cluster where we saw the issue was a pre-prod environment and so the addition of the EKS add-on was backed out as quickly as we could after seeing the failure. We are unable to replicate the same issue in any lower environments. It may be worth noting that, similar to @bwmetcalf, we also use Argo workflows in this cluster. We have other clusters where we also use Argo and are not seeing the error, but this cluster is much larger and uses Argo much more heavily.

@gtxu
Copy link
Contributor

gtxu commented Jun 3, 2022

Hi @ccs-warrior, how many volumes attached to pods in that cluster. I find a benchmark on sidecars here that 3K volumes reach the ~100Mi memory for resizer. While we are investigating this, we are raising the limitations for EKS Add-on in v1.6.2.

@woz5999
Copy link

woz5999 commented Jun 3, 2022

bwmetcalf is off today, but for our use case, our clusters have very few volumes; like 10 or less each.

@ghost
Copy link

ghost commented Jun 6, 2022

@gtxu there are only around 125 volumes in that cluster

@bhushanchirmade
Copy link

We are seeing same issue for another customer who is also using govcloud. EBS Add-on out of memory on resizer with 200+ PVCs.

@pascal-knapen
Copy link

We see the same issue with latest helm-installed version as well. We put the limit to 300Mi and still OOMKilled.

@sunghospark-calm
Copy link

Having the same issue here. EKS v1.22 csi-resizer v1.1.0

@torredil
Copy link
Member

torredil commented Jun 15, 2022

Hi all, can anyone confirm if they are experiencing this issue with the latest version of the driver? The latest add-on release bumps up the csi-resizer to v1.4.0 and also increases the memory limit.

@sunghospark-calm
Copy link

@torredil It seems like the latest with EKS (v1.6.2) is still pointing to csi-resizer v1.1.0 https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/v1.6.2/charts/aws-ebs-csi-driver/values.yaml#L52

@torredil
Copy link
Member

@sunghospark-calm Sorry let me clarify; The EKS add-on version of the CSI driver which was released today is now using csi-resizer v1.4.0. These changes are available in our latest Helm chart as well.

@sunghospark-calm
Copy link

@torredil no worries. Is there a way to test this as EKS addon? I still see v1.6.2 (v1.6.2-eksbuild.0) as the latest available version when I run aws eks describe-addon-versions

@torredil
Copy link
Member

@sunghospark-calm Yes! v1.6.2-eksbuild.0 is the latest version of the EKS addon and it is using csi-resizer v1.4.0. Let me know if you upgrade and still run into this problem, thanks.

@stijndehaes
Copy link
Contributor

We have beeing running csi-resizer v1.4.0 for about a week. We are still seeing this container go Out of Memory. We have one container with 62 restarts.

@stijndehaes
Copy link
Contributor

I have opened a ticket on csi-resizer. I think the pod informer might be the culprit of memory blowing up: kubernetes-csi/external-resizer#211

@stijndehaes
Copy link
Contributor

As we are not using the resizer, I would love to be able to turn of the flag handle-volume-inuse-error, would you guys be open to a PR to the helm charts to allows people to set extra flags to all different components?

@bwmetcalf
Copy link
Author

As we are not using the resizer, I would love to be able to turn of the flag handle-volume-inuse-error, would you guys be open to a PR to the helm charts to allows people to set extra flags to all different components?

Unfortunately, this won't help when using the EKS add-on as the charts are not exposed.

@bwmetcalf
Copy link
Author

I'm not seeing anything in https://github.com/kubernetes-csi/external-resizer/releases/tag/v1.4.0 that would address the issue at hand. Have code changes been made to address this particular problem or did the latest EKS add-on release simply increase the memory limits?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants