Startup Taint is sometimes not removed #1320

dpiddock · 2024-04-19T12:52:16Z

/kind bug

What happened?
We use Karpenter for the cluster node scaler. We recently implemented the startup taint for the efs driver. Since then we have occasionally seen nodes stuck with unscheduleable pods. Upon investigation it is caused by the efs.csi.aws.com/agent-not-ready:NoExecute taint still being present on the node despite the efs-csi-node pod running correctly.

The efs-plugin container has this log line:

E0419 06:03:32.149207       1 driver.go:134] "Unexpected failure when attempting to remove node taint(s)" err="the server rejected our request due to an error in our request"

What you expected to happen?
efs-csi-node pod to successfully remove the node taint so that other pods can be scheduled.

How to reproduce it (as minimally and precisely as possible)?
We don't currently know. It's a rare and intermittent problem.

Anything else we need to know?:
I managed to find the failed request in the audit logs. At the same time there was a request being processed to add the taint node.kubernetes.io/not-ready:NoExecute by node-controller. This could be a classic race condition with other parts of the system? Although this is a sample size of 1.

I attach the two entries:
efs-csi.json
node-controller.json

Environment

Kubernetes version (use kubectl version): v1.29.1-eks-b9c9ed7
EKS 1.29
Driver version: v1.7.6
EKS add-on: v1.7.6-eksbuild.2

Please also attach debug logs to help us better diagnose

Instructions to gather debug logs can be found here

results.tgz

The text was updated successfully, but these errors were encountered:

seanzatzdev-amazon · 2024-04-19T18:17:41Z

Hi @dpiddock , thank you for bringing this to our attention. We are working together with the author of the following PR to address this issue: #1287

Please let us know if you have any further questions or concerns.

mteodori · 2024-06-05T09:38:05Z

is this same as #1273 ?

dpiddock · 2024-06-05T16:04:58Z

This is the opposite of #1273. That issue is complaining that the taint is removed too fast, before the service is really ready. This issue is about the startup taint sometimes not being removed because of a race condition.

seanzatzdev-amazon · 2024-06-07T19:03:11Z

I've merged #1287 into mainline to address this

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 19, 2024

seanzatzdev-amazon mentioned this issue Apr 19, 2024

make sure the startup taint will eventually being removed after efs driver ready #1287

Merged

seanzatzdev-amazon closed this as completed Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Startup Taint is sometimes not removed #1320

Startup Taint is sometimes not removed #1320

dpiddock commented Apr 19, 2024

seanzatzdev-amazon commented Apr 19, 2024

mteodori commented Jun 5, 2024

dpiddock commented Jun 5, 2024 •

edited

Loading

seanzatzdev-amazon commented Jun 7, 2024

Startup Taint is sometimes not removed #1320

Startup Taint is sometimes not removed #1320

Comments

dpiddock commented Apr 19, 2024

seanzatzdev-amazon commented Apr 19, 2024

mteodori commented Jun 5, 2024

dpiddock commented Jun 5, 2024 • edited Loading

seanzatzdev-amazon commented Jun 7, 2024

dpiddock commented Jun 5, 2024 •

edited

Loading