-
Notifications
You must be signed in to change notification settings - Fork 807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
csi-resizer getting OOMKilled #1248
Comments
Hi @bwmetcalf, the usage exceeded 128Mi is abnormal, can you specify the "other environments" and the environment that the issue occurs? Plus can you provide more details on the issue container by retrieve the log ( |
Thanks @gtxu. Here are the logs from one of the killed pods
There is not much activity before it gets killed. In terms of environment differences, this particular environment is 1 of 3 we have in govcloud and we have 3 in commercial. Other than running different services that we develop, the environments are exactly the same. It's not clear what could be helpful in this regard to troubleshoot this. |
@bwmetcalf or @gtxu I see that this is labeled as a bug, but was a workaround or temporary resolution found? We are experiencing the exact same issue, only in one environment, after deploying the EBS CSI driver add-on to that cluster for the first time. Manually increasing memory to 512Mi on the deployment allowed it to start successfully. Our logs look identical, there are no errors or useful messages relating to memory. |
Hi @ccs-warrior can you specify the Add-on version and k8s version? |
@gtxu k8s 1.21 and add-on v1.6.0-eksbuild.1 |
This just happened again for us. We are using argo workflows and this issue somehow seems to be related to the number of I have no idea how this is correlated, but we've observed the above twice now; we are not resizing volumes. |
Per my comment above, the environment where we are having this issue with |
@ccs-warrior The workaround for us is to delete the completed argo workflow pods that accumulate over time. We are looking at using https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/ once we upgrade to 1.23. |
hmm. resizer only watches pods if certain flag is enabled on it. If someone can grab heap dump while this happens I can take a look. |
@gnufied Unfortunately we don't have a way to replicate it any more. The cluster where we saw the issue was a pre-prod environment and so the addition of the EKS add-on was backed out as quickly as we could after seeing the failure. We are unable to replicate the same issue in any lower environments. It may be worth noting that, similar to @bwmetcalf, we also use Argo workflows in this cluster. We have other clusters where we also use Argo and are not seeing the error, but this cluster is much larger and uses Argo much more heavily. |
Hi @ccs-warrior, how many volumes attached to pods in that cluster. I find a benchmark on sidecars here that 3K volumes reach the ~100Mi memory for resizer. While we are investigating this, we are raising the limitations for EKS Add-on in v1.6.2. |
bwmetcalf is off today, but for our use case, our clusters have very few volumes; like 10 or less each. |
@gtxu there are only around 125 volumes in that cluster |
We are seeing same issue for another customer who is also using govcloud. EBS Add-on out of memory on resizer with 200+ PVCs. |
We see the same issue with latest helm-installed version as well. We put the limit to 300Mi and still OOMKilled. |
Having the same issue here. EKS v1.22 csi-resizer v1.1.0 |
Hi all, can anyone confirm if they are experiencing this issue with the latest version of the driver? The latest add-on release bumps up the csi-resizer to |
@torredil It seems like the latest with EKS (v1.6.2) is still pointing to csi-resizer v1.1.0 https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/v1.6.2/charts/aws-ebs-csi-driver/values.yaml#L52 |
@sunghospark-calm Sorry let me clarify; The EKS add-on version of the CSI driver which was released today is now using csi-resizer |
@torredil no worries. Is there a way to test this as EKS addon? I still see v1.6.2 (v1.6.2-eksbuild.0) as the latest available version when I run |
@sunghospark-calm Yes! |
We have beeing running csi-resizer |
I have opened a ticket on csi-resizer. I think the pod informer might be the culprit of memory blowing up: kubernetes-csi/external-resizer#211 |
As we are not using the resizer, I would love to be able to turn of the flag |
Unfortunately, this won't help when using the EKS add-on as the charts are not exposed. |
I'm not seeing anything in https://github.com/kubernetes-csi/external-resizer/releases/tag/v1.4.0 that would address the issue at hand. Have code changes been made to address this particular problem or did the latest EKS add-on release simply increase the memory limits? |
/kind bug
What happened?
We are using version v1.6.0-eksbuild.1 of the aws-ebs-csi-driver addon. We have one environment where the csi-resizer container is using substantially more memory and getting OOMKilled. Since the memory limits setting cannot be controlled with the addon, we have no way of increasing it. We can edit the deployment live, but it gets reverted by EKS. However, the real issue is that this container is using so much memory:
Our other environments looks like
The memory settings for csi-resizer are
which is baked into the chart.
What you expected to happen?
csi-resizer should not be using so much memory.
How to reproduce it (as minimally and precisely as possible)?
This is unclear, but would welcome pointers on how to troubleshoot.
Anything else we need to know?:
This is occurring even though we aren't resizing and volumes.
Environment
kubectl version
): 1.22The text was updated successfully, but these errors were encountered: