-
Notifications
You must be signed in to change notification settings - Fork 807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Taint removal race condition #2199
Comments
Hi @ricardov1, thank you for the detailed bug report 🙂 I've added a task in our backlog to reproduce this behavior using a supported version of the EBS CSI driver, as there have been several enhancements in this code path since As I understand the issue, it should be reproducible even in the latest version of the driver. If you are able to upgrade and have the means to easily repro it in your environment, please provide the logs for the relevant CSI node pod using
|
Hey @torredil, thanks for the quick response! I can't easily upgrade and reproduce but I do have some saved debug logs from an occasion where we ran into this issue, although in this instance, it was the efs-csi-driver that was unable to remove the taint, not the ebs-csi-driver, so these are efs-csi-driver logs. After taking another look at these logs, it looks like the GET node API call made right before the taints are removed is failing to return all the node attributes, among them I haven't seen this behavior before, it's strange that the response returns a 200 status. A retry of the API call when the
|
It looks like this issue was fixed in the efs-csi-driver, has there been a similar fix in this driver? |
@ricardov1 Yes, as I mentioned there have been several enhancements in this code path since |
This morning we upgraded both the EBS and EFS helm charts in all our clusters to the latest versions ( |
Hi, based on our testing, the issue described here is caused by a bug with Karpenter replacing the taint from the EBS CSI Driver after the driver has already removed the taint. This bug was already fixed on the Karpenter side (fix: kubernetes-sigs/karpenter#1658 cherry-pick: kubernetes-sigs/karpenter#1705), but the fix was not included until Karpenter release Please upgrade to Karpenter |
After checking the audit logs of a random affected node in one of our clusters I can see that:
So this is not an ebs nor efs csi-driver issue, but instead a karpenter issue. |
I am also facing the same issue. So if I delete ebs-csi-driver pod and new ebs-csi pod startup it shows error in logs - I am using Karpenter and worker-node both. ebs-csi-driver running on both the nodes, show the same error. |
Just for the record if it helps someone (@abhiverma001 maybe it can help you), our issue was that karpenter was running out of memory when the cluster scaled aggressively. For reference we're currently running karpenter with 4GB or RAM, it's using at most 2.5GB when the cluster scales up but we want a lot of head room (the RAM is cheaper than the stuck instances). |
/close @mamoit thank you for chiming in with your RAM related suggestions. Closing this issue to consolidate the discussion to kubernetes-sigs/karpenter's issue: kubernetes-sigs/karpenter#1772. Alternatively, EBS/EFS CSI drivers wouldn't need this startup taint at all if kubernetes/kubernetes#95911 is fixed. Will discuss with my team if we can spend some time tackling that issue. |
@AndrewSirenko: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/kind bug
What happened?
When spinning up an instance in a cluster where both the ebs-csi-driver and the efs-csi-driver are deployed, occasionally (about 10% of the time) either of the two taints are not removed, and the node is unable to accept any pods due to the lingering taint.
What you expected to happen?
After successful daemonset startup, the taint should be removed always.
How to reproduce it (as minimally and precisely as possible)?
This appears to be caused by a race condition between the efs and ebs daemonsets (described below) so it is difficult to reproduce. The best way to increase the likelihood of reproducing the issue without launching and relauching a single instance is to spin up many (50+) instances where both daemonsets run simultaneously on instance startup. If you encounter the race condition on one of those nodes, the node will have either the efs or ebs taint on it and the pod for the corresponding lingering taint will be missing the
Removed taint(s) from local node
log.Anything else we need to know?:
This issue appears to be caused by a race condition where the daemonset incorrectly determines that the
ebs.csi.aws.com/agent-not-ready
taint has been removed. This block fetches the taints on the node and uses the length of the list of taints on the node and the length oftaintsToRemove
to determine whether the step of removing the taint is needed. When we run into this issue, we suspect that the efs-csi-driver "beats" the ebs-csi-driver to the taint removal and when the lengths of the slices are compared, they are equal even though it was theefs.csi.aws.com/agent-not-ready
taint that was removed, not theebs.csi.aws.com/agent-not-ready
taint.related issue on efs-csi-driver repo: kubernetes-sigs/aws-efs-csi-driver#1491
Environment
kubectl version
): 1.29.8The text was updated successfully, but these errors were encountered: