Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taint removal race condition #1491

Closed
ricardov1 opened this issue Oct 28, 2024 · 11 comments
Closed

Taint removal race condition #1491

ricardov1 opened this issue Oct 28, 2024 · 11 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@ricardov1
Copy link

/kind bug

What happened?
When spinning up an instance in a cluster where both the ebs-csi-driver and the efs-csi-driver are deployed, occasionally (about 10% of the time) either of the two taints are not removed, and the node is unable to accept any pods due to the lingering taint.

What you expected to happen?
After successful daemonset startup, the taint should be removed always.

How to reproduce it (as minimally and precisely as possible)?
This appears to be caused by a race condition between the efs and ebs daemonsets (described below) so it is difficult to reproduce. The best way to increase the likelihood of reproducing the issue without launching and relauching a single instance is to spin up many (50+) instances where both daemonsets run simultaneously on instance startup. If you encounter the race condition on one of those nodes, the node will have either the efs or ebs taint on it and the pod for the corresponding lingering taint will be missing the Removed taint(s) from local node log.

Anything else we need to know?:
This issue appears to be caused by a race condition where the daemonset incorrectly determines that the efs.csi.aws.com/agent-not-ready taint has been removed. This block fetches the taints on the node and uses the length of the list of taints on the node and the length of taintsToRemove to determine whether the step of removing the taint is needed. When we run into this issue, we suspect that the ebs-csi-driver "beats" the efs-csi-driver to the taint removal and when the lengths of the slices are compared, they are equal even though it was the ebs.csi.aws.com/agent-not-ready taint that was removed, not the efs.csi.aws.com/agent-not-ready taint.

related issue on ebs-csi-driver repo: kubernetes-sigs/aws-ebs-csi-driver#2199

Environment

  • Kubernetes version (use kubectl version): 1.29.8
  • Driver version: 1.7.5

Please also attach debug logs to help us better diagnose

  • Instructions to gather debug logs can be found here
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 28, 2024
@mskanth972
Copy link
Contributor

Hello @ricardov1, it appears you're using version 1.7.5, which contains a known issue where taints aren't properly removed. This problem was addressed in a PR that was merged into version 2.0.6. Would you mind upgrading to the most recent version and seeing if that resolves your issue?

@ricardov1
Copy link
Author

@mskanth972 thanks! I'll try upgrading. Has a similar fix been implemented for the ebs-csi-driver? I've also seen this behavior there

@mskanth972
Copy link
Contributor

mskanth972 commented Nov 5, 2024

Should be, they have many fixes regarding the Taint in the latest versions.
https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/CHANGELOG.md

@mskanth972
Copy link
Contributor

Closing the issue, please feel free to reopen if the you are seeing the issue still

@mamoit
Copy link

mamoit commented Nov 7, 2024

This morning we upgraded both the EBS and EFS helm charts in all our clusters to the latest versions.
A couple of hours later there are still nodes that come up and are stuck with either the ebs or the efs CSI taint.

@mamoit
Copy link

mamoit commented Nov 11, 2024

/reopen

@k8s-ci-robot
Copy link
Contributor

@mamoit: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@treyhyde
Copy link

We still see this same issue with 2.1.1

@mamoit
Copy link

mamoit commented Dec 18, 2024

@treyhyde please check my comment here.

We were running into this issue because karpenter was running out of memory when scaling aggressively.
Try throwing some more memory at it to see if the issue persists.
Even if it doesn't really fix the issue, it avoids it quite well.

@treyhyde
Copy link

@treyhyde please check my comment here.

We were running into this issue because karpenter was running out of memory when scaling aggressively. Try throwing some more memory at it to see if the issue persists. Even if it doesn't really fix the issue, it avoids it quite well.

Thanks, I did check, we have complete stability with karpenter. No OOM issues. I don't see how karpenter could even get in the loop though between the node and the efs driver when clearing the taint except perhaps slow finalizers/webhooks? Our nodes get 3 startup taints (ebs, efs and istio), only the efs taint fails to clear.

@mamoit
Copy link

mamoit commented Dec 18, 2024

We noticed that karpenter was the issue when we analyzed the cluster audit logs.
The csi-drivers (ebs and efs) were taking their respective taint out when they finished initializing, but then came karpenter and set the taints back. So dig through your audit logs just in case.
I also read somewhere that karpenter can easily be bottlenecked by the CPU, even though I've not yet experienced it, I'd say that if you're not running any cpu credit instance you should be good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

5 participants