Taint removal race condition #1491

ricardov1 · 2024-10-28T22:36:23Z

/kind bug

What happened?
When spinning up an instance in a cluster where both the ebs-csi-driver and the efs-csi-driver are deployed, occasionally (about 10% of the time) either of the two taints are not removed, and the node is unable to accept any pods due to the lingering taint.

What you expected to happen?
After successful daemonset startup, the taint should be removed always.

How to reproduce it (as minimally and precisely as possible)?
This appears to be caused by a race condition between the efs and ebs daemonsets (described below) so it is difficult to reproduce. The best way to increase the likelihood of reproducing the issue without launching and relauching a single instance is to spin up many (50+) instances where both daemonsets run simultaneously on instance startup. If you encounter the race condition on one of those nodes, the node will have either the efs or ebs taint on it and the pod for the corresponding lingering taint will be missing the Removed taint(s) from local node log.

Anything else we need to know?:
This issue appears to be caused by a race condition where the daemonset incorrectly determines that the efs.csi.aws.com/agent-not-ready taint has been removed. This block fetches the taints on the node and uses the length of the list of taints on the node and the length of taintsToRemove to determine whether the step of removing the taint is needed. When we run into this issue, we suspect that the ebs-csi-driver "beats" the efs-csi-driver to the taint removal and when the lengths of the slices are compared, they are equal even though it was the ebs.csi.aws.com/agent-not-ready taint that was removed, not the efs.csi.aws.com/agent-not-ready taint.

related issue on ebs-csi-driver repo: kubernetes-sigs/aws-ebs-csi-driver#2199

Environment

Kubernetes version (use kubectl version): 1.29.8
Driver version: 1.7.5

Please also attach debug logs to help us better diagnose

Instructions to gather debug logs can be found here

The text was updated successfully, but these errors were encountered:

mskanth972 · 2024-11-05T14:21:36Z

Hello @ricardov1, it appears you're using version 1.7.5, which contains a known issue where taints aren't properly removed. This problem was addressed in a PR that was merged into version 2.0.6. Would you mind upgrading to the most recent version and seeing if that resolves your issue?

ricardov1 · 2024-11-05T16:16:54Z

@mskanth972 thanks! I'll try upgrading. Has a similar fix been implemented for the ebs-csi-driver? I've also seen this behavior there

mskanth972 · 2024-11-05T16:19:59Z

Should be, they have many fixes regarding the Taint in the latest versions.
https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/CHANGELOG.md

mskanth972 · 2024-11-06T19:45:16Z

Closing the issue, please feel free to reopen if the you are seeing the issue still

mamoit · 2024-11-07T15:58:43Z

This morning we upgraded both the EBS and EFS helm charts in all our clusters to the latest versions.
A couple of hours later there are still nodes that come up and are stuck with either the ebs or the efs CSI taint.

mamoit · 2024-11-11T11:20:57Z

/reopen

k8s-ci-robot · 2024-11-11T11:21:02Z

@mamoit: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

treyhyde · 2024-12-18T20:01:33Z

We still see this same issue with 2.1.1

mamoit · 2024-12-18T20:45:49Z

@treyhyde please check my comment here.

We were running into this issue because karpenter was running out of memory when scaling aggressively.
Try throwing some more memory at it to see if the issue persists.
Even if it doesn't really fix the issue, it avoids it quite well.

treyhyde · 2024-12-18T21:17:45Z

@treyhyde please check my comment here.

We were running into this issue because karpenter was running out of memory when scaling aggressively. Try throwing some more memory at it to see if the issue persists. Even if it doesn't really fix the issue, it avoids it quite well.

Thanks, I did check, we have complete stability with karpenter. No OOM issues. I don't see how karpenter could even get in the loop though between the node and the efs driver when clearing the taint except perhaps slow finalizers/webhooks? Our nodes get 3 startup taints (ebs, efs and istio), only the efs taint fails to clear.

mamoit · 2024-12-18T22:46:33Z

We noticed that karpenter was the issue when we analyzed the cluster audit logs.
The csi-drivers (ebs and efs) were taking their respective taint out when they finished initializing, but then came karpenter and set the taints back. So dig through your audit logs just in case.
I also read somewhere that karpenter can easily be bottlenecked by the CPU, even though I've not yet experienced it, I'd say that if you're not running any cpu credit instance you should be good.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 28, 2024

ricardov1 mentioned this issue Oct 28, 2024

Taint removal race condition kubernetes-sigs/aws-ebs-csi-driver#2199

Closed

mskanth972 closed this as completed Nov 6, 2024

mamoit mentioned this issue Nov 12, 2024

[EKS] [bug]: efs/ebs csi-drivers sometimes do not remove taints from nodes aws/containers-roadmap#2470

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taint removal race condition #1491

Taint removal race condition #1491

ricardov1 commented Oct 28, 2024

mskanth972 commented Nov 5, 2024

ricardov1 commented Nov 5, 2024

mskanth972 commented Nov 5, 2024 •

edited

Loading

mskanth972 commented Nov 6, 2024

mamoit commented Nov 7, 2024

mamoit commented Nov 11, 2024

k8s-ci-robot commented Nov 11, 2024

treyhyde commented Dec 18, 2024

mamoit commented Dec 18, 2024

treyhyde commented Dec 18, 2024

mamoit commented Dec 18, 2024

Taint removal race condition #1491

Taint removal race condition #1491

Comments

ricardov1 commented Oct 28, 2024

mskanth972 commented Nov 5, 2024

ricardov1 commented Nov 5, 2024

mskanth972 commented Nov 5, 2024 • edited Loading

mskanth972 commented Nov 6, 2024

mamoit commented Nov 7, 2024

mamoit commented Nov 11, 2024

k8s-ci-robot commented Nov 11, 2024

treyhyde commented Dec 18, 2024

mamoit commented Dec 18, 2024

treyhyde commented Dec 18, 2024

mamoit commented Dec 18, 2024

mskanth972 commented Nov 5, 2024 •

edited

Loading