Taint removal race condition #2199

ricardov1 · 2024-10-28T22:35:54Z

/kind bug

What happened?
When spinning up an instance in a cluster where both the ebs-csi-driver and the efs-csi-driver are deployed, occasionally (about 10% of the time) either of the two taints are not removed, and the node is unable to accept any pods due to the lingering taint.

What you expected to happen?
After successful daemonset startup, the taint should be removed always.

How to reproduce it (as minimally and precisely as possible)?
This appears to be caused by a race condition between the efs and ebs daemonsets (described below) so it is difficult to reproduce. The best way to increase the likelihood of reproducing the issue without launching and relauching a single instance is to spin up many (50+) instances where both daemonsets run simultaneously on instance startup. If you encounter the race condition on one of those nodes, the node will have either the efs or ebs taint on it and the pod for the corresponding lingering taint will be missing the Removed taint(s) from local node log.

Anything else we need to know?:
This issue appears to be caused by a race condition where the daemonset incorrectly determines that the ebs.csi.aws.com/agent-not-ready taint has been removed. This block fetches the taints on the node and uses the length of the list of taints on the node and the length of taintsToRemove to determine whether the step of removing the taint is needed. When we run into this issue, we suspect that the efs-csi-driver "beats" the ebs-csi-driver to the taint removal and when the lengths of the slices are compared, they are equal even though it was the efs.csi.aws.com/agent-not-ready taint that was removed, not the ebs.csi.aws.com/agent-not-ready taint.

related issue on efs-csi-driver repo: kubernetes-sigs/aws-efs-csi-driver#1491

Environment

Kubernetes version (use kubectl version): 1.29.8
Driver version: 1.28.0

The text was updated successfully, but these errors were encountered:

torredil · 2024-10-31T18:00:28Z

Hi @ricardov1, thank you for the detailed bug report 🙂

I've added a task in our backlog to reproduce this behavior using a supported version of the EBS CSI driver, as there have been several enhancements in this code path since 1.28.

As I understand the issue, it should be reproducible even in the latest version of the driver. If you are able to upgrade and have the means to easily repro it in your environment, please provide the logs for the relevant CSI node pod using logLevel >= 4, it would really help with the RCA. Thanks!

aws-ebs-csi-driver/charts/aws-ebs-csi-driver/values.yaml

Line 347 in 35af0b2

logLevel: 2

ricardov1 · 2024-11-04T23:17:29Z

Hey @torredil, thanks for the quick response! I can't easily upgrade and reproduce but I do have some saved debug logs from an occasion where we ran into this issue, although in this instance, it was the efs-csi-driver that was unable to remove the taint, not the ebs-csi-driver, so these are efs-csi-driver logs.

After taking another look at these logs, it looks like the GET node API call made right before the taints are removed is failing to return all the node attributes, among them spec.taints, which is why the daemonset incorrectly determines that the taint does not need to be removed since the length of the list of taints is 0 to begin with.

I haven't seen this behavior before, it's strange that the response returns a 200 status. A retry of the API call when the spec.taints attribute is missing would help, but I'm curious about that partial response.

I0923 21:33:36.983967       1 config_dir.go:88] Creating symlink from '/etc/amazon/efs' to '/var/amazon/efs'
I0923 21:33:37.076220       1 metadata.go:65] getting MetadataService...
I0923 21:33:37.078102       1 metadata.go:70] retrieving metadata from EC2 metadata service
I0923 21:33:37.079030       1 cloud.go:147] EFS Client created using the following endpoint: https://elasticfilesystem.us-east-1.amazonaws.com
I0923 21:33:37.079042       1 driver.go:87] Node Service capability for Get Volume Stats Not enabled
I0923 21:33:37.081167       1 mount_linux.go:283] Detected umount with safe 'not mounted' behavior
I0923 21:33:37.081222       1 driver.go:150] Did not find any input tags.
I0923 21:33:37.081509       1 driver.go:116] Registering Node Server
I0923 21:33:37.081518       1 driver.go:118] Registering Controller Server
I0923 21:33:37.081527       1 driver.go:121] Starting efs-utils watchdog
I0923 21:33:37.081615       1 efs_watch_dog.go:221] Skip copying /etc/amazon/efs/efs-utils.conf since it exists already
I0923 21:33:37.081625       1 efs_watch_dog.go:221] Skip copying /etc/amazon/efs/efs-utils.crt since it exists already
I0923 21:33:37.081862       1 driver.go:127] Starting reaper
I0923 21:33:37.082188       1 round_trippers.go:466] curl -v -XGET  -H "Accept: application/json, */*" -H "User-Agent: aws-efs-csi-driver/v0.0.0 (linux/amd64) kubernetes/$Format" -H "Authorization: Bearer <masked>" 'https://192.168.0.1:443/api/v1/nodes/ip-123-123-12-12.ec2.internal'
I0923 21:33:37.082856       1 round_trippers.go:510] HTTP Trace: Dial to tcp:192.168.0.1:443 succeed
I0923 21:33:37.175904       1 round_trippers.go:553] GET https://192.168.0.1:443/api/v1/nodes/ip-123-123-12-12.ec2.internal 200 OK in 93 milliseconds
I0923 21:33:37.175929       1 round_trippers.go:570] HTTP Statistics: DNSLookup 0 ms Dial 0 ms TLSHandshake 3 ms ServerProcessing 89 ms Duration 93 ms
I0923 21:33:37.175936       1 round_trippers.go:577] Response Headers:
I0923 21:33:37.175943       1 round_trippers.go:580]     Audit-Id: 313d0599-ccb4-4ebc-9b64-14d0cb8751ad
I0923 21:33:37.175950       1 round_trippers.go:580]     Cache-Control: no-cache, private
I0923 21:33:37.175958       1 round_trippers.go:580]     Content-Type: application/json
I0923 21:33:37.175964       1 round_trippers.go:580]     X-Kubernetes-Pf-Flowschema-Uid: 24fc1286-36af-483b-a22d-b81c3bb5a261
I0923 21:33:37.175974       1 round_trippers.go:580]     X-Kubernetes-Pf-Prioritylevel-Uid: f38fae65-e6a8-4275-9d79-c41406bb5083
I0923 21:33:37.175980       1 round_trippers.go:580]     Date: Mon, 23 Sep 2024 21:33:37 GMT
I0923 21:33:37.176080       1 request.go:1171] Response Body: { "kind": "Node", "apiVersion": "v1", "metadata": { "name": "ip-123-123-12-12.ec2.internal", "resourceVersion": "1116646436", "creationTimestamp": "2024-09-23T21:33:11Z", "labels": { "beta.kubernetes.io/arch": "amd64", "beta.kubernetes.io/instance-type": "p4de.24xlarge", "beta.kubernetes.io/os": "linux", "failure-domain.beta.kubernetes.io/region": "us-east-1", "failure-domain.beta.kubernetes.io/zone": "us-east-1b", "kubernetes.io/arch": "amd64", "kubernetes.io/hostname": "ip-123-123-12-12.ec2.internal", "kubernetes.io/os": "linux", "node.kubernetes.io/instance-type": "p4de.24xlarge", "topology.kubernetes.io/region": "us-east-1", "topology.kubernetes.io/zone": "us-east-1b" }, "annotations": { "alpha.kubernetes.io/provided-node-ip": "123.123.12.12", "node.alpha.kubernetes.io/ttl": "15", "volumes.kubernetes.io/controller-managed-attach-detach": "true" }, "managedFields": [ { "manager": "aws-cloud-controller-manager", "operation": "Update", "apiVersion": "v1", "time": "2024-09-23T21:33:11Z", "fieldsType": "FieldsV1", "fieldsV1": { "f:status": { "f:addresses": { "k:{\"type\":\"InternalDNS\"}": { ".": {}, "f:address": {}, "f:type": {} } } } }, "subresource": "status" }, { "manager": "kubelet", "operation": "Update", "apiVersion": "v1", "time": "2024-09-23T21:33:11Z", "fieldsType": "FieldsV1", "fieldsV1": { "f:metadata": { "f:annotations": { ".": {}, "f:alpha.kubernetes.io/provided-node-ip": {}, "f:volumes.kubernetes.io/controller-managed-attach-detach": {} }, "f:labels": { ".": {}, "f:beta.kubernetes.io/arch": {}, "f:beta.kubernetes.io/os": {}, "f:kubernetes.io/arch": {}, "f:kubernetes.io/hostname": {}, "f:kubernetes.io/os": {} } }, "f:spec": { "f:providerID": {} } } }, { "manager": "aws-cloud-controller-manager", "operation": "Update", "apiVersion": "v1", "time": "2024-09-23T21:33:19Z", "fieldsType": "FieldsV1", "fieldsV1": { "f:metadata": { "f:labels": { "f:beta.kubernetes.io/instance-type": {}, "f:failure-domain.beta.kubernetes.io/region": {}, "f:failure-domain.beta.kubernetes.io/zone": {}, "f:k8s.io/cloud-provider-aws": {}, "f:node.kubernetes.io/instance-type": {}, "f:topology.kubernetes.io/region": {}, "f:topology.kubernetes.io/zone": {} } } } }, { "manager": "kubelet", "operation": "Update", "apiVersion": "v1", "time": "2024-09-23T21:33:27Z", "fieldsType": "FieldsV1", "fieldsV1": { "f:status": { "f:conditions": { "k:{\"type\":\"DiskPressure\"}": { "f:lastHeartbeatTime": {} }, "k:{\"type\":\"MemoryPressure\"}": { "f:lastHeartbeatTime": {} }, "k:{\"type\":\"PIDPressure\"}": { "f:lastHeartbeatTime": {} }, "k:{\"type\":\"Ready\"}": { "f:lastHeartbeatTime": {}, "f:lastTransitionTime": {}, "f:message": {}, "f:reason": {}, "f:status": {} } } } }, "subresource": "status" }, { "manager": "kube-controller-manager", "operation": "Update", "apiVersion": "v1", "time": "2024-09-23T21:33:32Z", "fieldsType": "FieldsV1", "fieldsV1": { "f:metadata": { "f:annotations": { "f:node.alpha.kubernetes.io/ttl": {} } } } }, { "manager": "cilium-operator-generic", "operation": "Update", "apiVersion": "v1", "time": "2024-09-23T21:33:34Z", "fieldsType": "FieldsV1", "fieldsV1": { "f:status": { "f:conditions": { "k:{\"type\":\"NetworkUnavailable\"}": { ".": {}, "f:lastHeartbeatTime": {}, "f:lastTransitionTime": {}, "f:message": {}, "f:reason": {}, "f:status": {}, "f:type": {} } } } }, "subresource": "status" } ] }, "spec": { "providerID": "aws:///us-east-1b/i-123456789" }, "status": { "capacity": { "cpu": "96", "ephemeral-storage": "7168566Mi", "hugepages-1Gi": "0", "hugepages-2Mi": "10256Mi", "memory": "1176298736Ki", "pods": "110" }, "allocatable": { "cpu": "95690m", "ephemeral-storage": "6764033882790", "hugepages-1Gi": "0", "hugepages-2Mi": "10256Mi", "memory": "1157131504Ki", "pods": "110" }, "conditions": [ { "type": "NetworkUnavailable", "status": "False", "lastHeartbeatTime": "2024-09-23T21:33:34Z", "lastTransitionTime": "2024-09-23T21:33:34Z", "reason": "CiliumIsUp", "message": "Cilium is running on this node" }, { "type": "MemoryPressure", "status": "False", "lastHeartbeatTime": "2024-09-23T21:33:27Z", "lastTransitionTime": "2024-09-23T21:33:09Z", "reason": "KubeletHasSufficientMemory", "message": "kubelet has sufficient memory available" }, { "type": "DiskPressure", "status": "False", "lastHeartbeatTime": "2024-09-23T21:33:27Z", "lastTransitionTime": "2024-09-23T21:33:09Z", "reason": "KubeletHasNoDiskPressure", "message": "kubelet has no disk pressure" }, { "type": "PIDPressure", "status": "False", "lastHeartbeatTime": "2024-09-23T21:33:27Z", "lastTransitionTime": "2024-09-23T21:33:09Z", "reason": "KubeletHasSufficientPID", "message": "kubelet has sufficient PID available" }, { "type": "Ready", "status": "True", "lastHeartbeatTime": "2024-09-23T21:33:27Z", "lastTransitionTime": "2024-09-23T21:33:27Z", "reason": "KubeletReady", "message": "kubelet is posting ready status" } ], "addresses": [ { "type": "InternalIP", "address": "123.123.12.12" }, { "type": "InternalDNS", "address": "ip-123-123-12-12.ec2.internal" }, { "type": "Hostname", "address": "ip-123-123-12-12.ec2.internal" } ], "daemonEndpoints": { "kubeletEndpoint": { "Port": 10250 } }, "nodeInfo": { "kernelVersion": "5.10.224-212.876.amzn2.x86_64", "osImage": "Amazon Linux 2", "containerRuntimeVersion": "containerd://1.7.11", "kubeletVersion": "v1.29.6-eks-1552ad0", "kubeProxyVersion": "v1.29.6-eks-1552ad0", "operatingSystem": "linux", "architecture": "amd64" }, "images": [] } }
I0923 21:33:37.176760       1 node.go:486] "No taints to remove on node, skipping taint removal"
I0923 21:33:37.176775       1 driver.go:137] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0923 21:33:38.109926       1 node.go:311] NodeGetInfo: called with args

ricardov1 · 2024-11-05T16:17:48Z

It looks like this issue was fixed in the efs-csi-driver, has there been a similar fix in this driver?

torredil · 2024-11-05T16:26:56Z

@ricardov1 Yes, as I mentioned there have been several enhancements in this code path since 1.28. Given there is not clear root cause for this behavior, I can't say that upgrading will fix it (or provide insight into fixes implemented in the efs-csi-driver), but please do upgrade and let us know if you continue to run into this.

mamoit · 2024-11-07T15:59:13Z

This morning we upgraded both the EBS and EFS helm charts in all our clusters to the latest versions (2.36.0 and 3.0.8 respectively).
A couple of hours later there are still nodes that come up and are stuck with either the ebs or the efs CSI taint.

ConnorJC3 · 2024-11-22T19:00:08Z

Hi, based on our testing, the issue described here is caused by a bug with Karpenter replacing the taint from the EBS CSI Driver after the driver has already removed the taint.

This bug was already fixed on the Karpenter side (fix: kubernetes-sigs/karpenter#1658 cherry-pick: kubernetes-sigs/karpenter#1705), but the fix was not included until Karpenter release v1.0.4.

Please upgrade to Karpenter v1.0.4 or later and see if that fixes the issue for you, thanks!

mamoit · 2024-11-22T19:51:27Z

After checking the audit logs of a random affected node in one of our clusters I can see that:

the ebs-csi-driver removes its taint
then the efs-csi-driver removes its taint
then karpenter comes 3s later to patch both the taints back in the node.

So this is not an ebs nor efs csi-driver issue, but instead a karpenter issue.
The curve ball is that we're running karpenter v1.0.7 (of the karpenter-provider-aws), so the issue should be fixed there too unless there was a regression of some sorts.

abhiverma001 · 2024-12-12T10:29:16Z

I am also facing the same issue. So if I delete ebs-csi-driver pod and new ebs-csi pod startup it shows error in logs -
node.go:861] "Unexpected failure when attempting to remove node taint(s)" err="isAllocatableSet: driver not found on node ip-xx.xx.xx.xx.eu-west-1.compute.internal" but yes after few seconds it get stabilized.

I am using Karpenter and worker-node both. ebs-csi-driver running on both the nodes, show the same error.
Updated the ebs (2.37.0) and efs (3.0.8) both and also updated Karpenter (1.0.8), but this error still occurs in ebs-csi pod log.

mamoit · 2024-12-12T12:46:50Z

Just for the record if it helps someone (@abhiverma001 maybe it can help you), our issue was that karpenter was running out of memory when the cluster scaled aggressively.
We added more RAM to it and it stopped crashing and re-adding the taints to the new nodes that had come up while karpenter was crashed.

For reference we're currently running karpenter with 4GB or RAM, it's using at most 2.5GB when the cluster scales up but we want a lot of head room (the RAM is cheaper than the stuck instances).
We're having moments when we add about 600 pods to the cluster in a couple of minutes and karpenter needs to provision around 60 nodes.

AndrewSirenko · 2024-12-12T21:06:07Z

/close

@mamoit thank you for chiming in with your RAM related suggestions.

Closing this issue to consolidate the discussion to kubernetes-sigs/karpenter's issue: kubernetes-sigs/karpenter#1772.

Alternatively, EBS/EFS CSI drivers wouldn't need this startup taint at all if kubernetes/kubernetes#95911 is fixed. Will discuss with my team if we can spend some time tackling that issue.

k8s-ci-robot · 2024-12-12T21:06:13Z

@AndrewSirenko: Closing this issue.

In response to this:

/close

@mamoit thank you for chiming in with your RAM related suggestions.

Closing this issue to consolidate the discussion to kubernetes-sigs/karpenter's issue: kubernetes-sigs/karpenter#1772.

Alternatively, EBS/EFS CSI drivers wouldn't need this startup taint at all if kubernetes/kubernetes#95911 is fixed. Will discuss with my team if we can spend some time tackling that issue.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 28, 2024

ricardov1 mentioned this issue Oct 28, 2024

Taint removal race condition kubernetes-sigs/aws-efs-csi-driver#1491

Closed

mamoit mentioned this issue Nov 12, 2024

[EKS] [bug]: efs/ebs csi-drivers sometimes do not remove taints from nodes aws/containers-roadmap#2470

Open

k8s-ci-robot closed this as completed Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taint removal race condition #2199

Taint removal race condition #2199

ricardov1 commented Oct 28, 2024 •

edited

Loading

torredil commented Oct 31, 2024

ricardov1 commented Nov 4, 2024

ricardov1 commented Nov 5, 2024

torredil commented Nov 5, 2024

mamoit commented Nov 7, 2024 •

edited

Loading

ConnorJC3 commented Nov 22, 2024 •

edited

Loading

mamoit commented Nov 22, 2024 •

edited

Loading

abhiverma001 commented Dec 12, 2024

mamoit commented Dec 12, 2024

AndrewSirenko commented Dec 12, 2024

k8s-ci-robot commented Dec 12, 2024

Taint removal race condition #2199

Taint removal race condition #2199

Comments

ricardov1 commented Oct 28, 2024 • edited Loading

torredil commented Oct 31, 2024

ricardov1 commented Nov 4, 2024

ricardov1 commented Nov 5, 2024

torredil commented Nov 5, 2024

mamoit commented Nov 7, 2024 • edited Loading

ConnorJC3 commented Nov 22, 2024 • edited Loading

mamoit commented Nov 22, 2024 • edited Loading

abhiverma001 commented Dec 12, 2024

mamoit commented Dec 12, 2024

AndrewSirenko commented Dec 12, 2024

k8s-ci-robot commented Dec 12, 2024

ricardov1 commented Oct 28, 2024 •

edited

Loading

mamoit commented Nov 7, 2024 •

edited

Loading

ConnorJC3 commented Nov 22, 2024 •

edited

Loading

mamoit commented Nov 22, 2024 •

edited

Loading