-
Notifications
You must be signed in to change notification settings - Fork 558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Taint removal race condition #1491
Comments
Hello @ricardov1, it appears you're using version 1.7.5, which contains a known issue where taints aren't properly removed. This problem was addressed in a PR that was merged into version 2.0.6. Would you mind upgrading to the most recent version and seeing if that resolves your issue? |
@mskanth972 thanks! I'll try upgrading. Has a similar fix been implemented for the ebs-csi-driver? I've also seen this behavior there |
Should be, they have many fixes regarding the Taint in the latest versions. |
Closing the issue, please feel free to reopen if the you are seeing the issue still |
This morning we upgraded both the EBS and EFS helm charts in all our clusters to the latest versions. |
/reopen |
@mamoit: You can't reopen an issue/PR unless you authored it or you are a collaborator. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
We still see this same issue with 2.1.1 |
@treyhyde please check my comment here. We were running into this issue because karpenter was running out of memory when scaling aggressively. |
Thanks, I did check, we have complete stability with karpenter. No OOM issues. I don't see how karpenter could even get in the loop though between the node and the efs driver when clearing the taint except perhaps slow finalizers/webhooks? Our nodes get 3 startup taints (ebs, efs and istio), only the efs taint fails to clear. |
We noticed that karpenter was the issue when we analyzed the cluster audit logs. |
/kind bug
What happened?
When spinning up an instance in a cluster where both the ebs-csi-driver and the efs-csi-driver are deployed, occasionally (about 10% of the time) either of the two taints are not removed, and the node is unable to accept any pods due to the lingering taint.
What you expected to happen?
After successful daemonset startup, the taint should be removed always.
How to reproduce it (as minimally and precisely as possible)?
This appears to be caused by a race condition between the efs and ebs daemonsets (described below) so it is difficult to reproduce. The best way to increase the likelihood of reproducing the issue without launching and relauching a single instance is to spin up many (50+) instances where both daemonsets run simultaneously on instance startup. If you encounter the race condition on one of those nodes, the node will have either the efs or ebs taint on it and the pod for the corresponding lingering taint will be missing the
Removed taint(s) from local node
log.Anything else we need to know?:
This issue appears to be caused by a race condition where the daemonset incorrectly determines that the
efs.csi.aws.com/agent-not-ready
taint has been removed. This block fetches the taints on the node and uses the length of the list of taints on the node and the length oftaintsToRemove
to determine whether the step of removing the taint is needed. When we run into this issue, we suspect that the ebs-csi-driver "beats" the efs-csi-driver to the taint removal and when the lengths of the slices are compared, they are equal even though it was theebs.csi.aws.com/agent-not-ready
taint that was removed, not theefs.csi.aws.com/agent-not-ready
taint.related issue on ebs-csi-driver repo: kubernetes-sigs/aws-ebs-csi-driver#2199
Environment
kubectl version
): 1.29.8Please also attach debug logs to help us better diagnose
The text was updated successfully, but these errors were encountered: