You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened?
We use Karpenter for the cluster node scaler. We recently implemented the startup taint for the efs driver. Since then we have occasionally seen nodes stuck with unscheduleable pods. Upon investigation it is caused by the efs.csi.aws.com/agent-not-ready:NoExecute taint still being present on the node despite the efs-csi-node pod running correctly.
The efs-plugin container has this log line:
E0419 06:03:32.149207 1 driver.go:134] "Unexpected failure when attempting to remove node taint(s)" err="the server rejected our request due to an error in our request"
What you expected to happen? efs-csi-node pod to successfully remove the node taint so that other pods can be scheduled.
How to reproduce it (as minimally and precisely as possible)?
We don't currently know. It's a rare and intermittent problem.
Anything else we need to know?:
I managed to find the failed request in the audit logs. At the same time there was a request being processed to add the taint node.kubernetes.io/not-ready:NoExecute by node-controller. This could be a classic race condition with other parts of the system? Although this is a sample size of 1.
This is the opposite of #1273. That issue is complaining that the taint is removed too fast, before the service is really ready. This issue is about the startup taint sometimes not being removed because of a race condition.
/kind bug
What happened?
We use Karpenter for the cluster node scaler. We recently implemented the startup taint for the efs driver. Since then we have occasionally seen nodes stuck with unscheduleable pods. Upon investigation it is caused by the
efs.csi.aws.com/agent-not-ready:NoExecute
taint still being present on the node despite theefs-csi-node
pod running correctly.The
efs-plugin
container has this log line:What you expected to happen?
efs-csi-node
pod to successfully remove the node taint so that other pods can be scheduled.How to reproduce it (as minimally and precisely as possible)?
We don't currently know. It's a rare and intermittent problem.
Anything else we need to know?:
I managed to find the failed request in the audit logs. At the same time there was a request being processed to add the taint
node.kubernetes.io/not-ready:NoExecute
by node-controller. This could be a classic race condition with other parts of the system? Although this is a sample size of 1.I attach the two entries:
efs-csi.json
node-controller.json
Environment
kubectl version
):v1.29.1-eks-b9c9ed7
v1.7.6
v1.7.6-eksbuild.2
Please also attach debug logs to help us better diagnose
results.tgz
The text was updated successfully, but these errors were encountered: