Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taint removal race condition #2199

Closed
ricardov1 opened this issue Oct 28, 2024 · 11 comments
Closed

Taint removal race condition #2199

ricardov1 opened this issue Oct 28, 2024 · 11 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@ricardov1
Copy link

ricardov1 commented Oct 28, 2024

/kind bug

What happened?
When spinning up an instance in a cluster where both the ebs-csi-driver and the efs-csi-driver are deployed, occasionally (about 10% of the time) either of the two taints are not removed, and the node is unable to accept any pods due to the lingering taint.

What you expected to happen?
After successful daemonset startup, the taint should be removed always.

How to reproduce it (as minimally and precisely as possible)?
This appears to be caused by a race condition between the efs and ebs daemonsets (described below) so it is difficult to reproduce. The best way to increase the likelihood of reproducing the issue without launching and relauching a single instance is to spin up many (50+) instances where both daemonsets run simultaneously on instance startup. If you encounter the race condition on one of those nodes, the node will have either the efs or ebs taint on it and the pod for the corresponding lingering taint will be missing the Removed taint(s) from local node log.

Anything else we need to know?:
This issue appears to be caused by a race condition where the daemonset incorrectly determines that the ebs.csi.aws.com/agent-not-ready taint has been removed. This block fetches the taints on the node and uses the length of the list of taints on the node and the length of taintsToRemove to determine whether the step of removing the taint is needed. When we run into this issue, we suspect that the efs-csi-driver "beats" the ebs-csi-driver to the taint removal and when the lengths of the slices are compared, they are equal even though it was the efs.csi.aws.com/agent-not-ready taint that was removed, not the ebs.csi.aws.com/agent-not-ready taint.

related issue on efs-csi-driver repo: kubernetes-sigs/aws-efs-csi-driver#1491

Environment

  • Kubernetes version (use kubectl version): 1.29.8
  • Driver version: 1.28.0
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 28, 2024
@torredil
Copy link
Member

Hi @ricardov1, thank you for the detailed bug report 🙂

I've added a task in our backlog to reproduce this behavior using a supported version of the EBS CSI driver, as there have been several enhancements in this code path since 1.28.

As I understand the issue, it should be reproducible even in the latest version of the driver. If you are able to upgrade and have the means to easily repro it in your environment, please provide the logs for the relevant CSI node pod using logLevel >= 4, it would really help with the RCA. Thanks!

@ricardov1
Copy link
Author

Hey @torredil, thanks for the quick response! I can't easily upgrade and reproduce but I do have some saved debug logs from an occasion where we ran into this issue, although in this instance, it was the efs-csi-driver that was unable to remove the taint, not the ebs-csi-driver, so these are efs-csi-driver logs.

After taking another look at these logs, it looks like the GET node API call made right before the taints are removed is failing to return all the node attributes, among them spec.taints, which is why the daemonset incorrectly determines that the taint does not need to be removed since the length of the list of taints is 0 to begin with.

I haven't seen this behavior before, it's strange that the response returns a 200 status. A retry of the API call when the spec.taints attribute is missing would help, but I'm curious about that partial response.

I0923 21:33:36.983967       1 config_dir.go:88] Creating symlink from '/etc/amazon/efs' to '/var/amazon/efs'
I0923 21:33:37.076220       1 metadata.go:65] getting MetadataService...
I0923 21:33:37.078102       1 metadata.go:70] retrieving metadata from EC2 metadata service
I0923 21:33:37.079030       1 cloud.go:147] EFS Client created using the following endpoint: https://elasticfilesystem.us-east-1.amazonaws.com
I0923 21:33:37.079042       1 driver.go:87] Node Service capability for Get Volume Stats Not enabled
I0923 21:33:37.081167       1 mount_linux.go:283] Detected umount with safe 'not mounted' behavior
I0923 21:33:37.081222       1 driver.go:150] Did not find any input tags.
I0923 21:33:37.081509       1 driver.go:116] Registering Node Server
I0923 21:33:37.081518       1 driver.go:118] Registering Controller Server
I0923 21:33:37.081527       1 driver.go:121] Starting efs-utils watchdog
I0923 21:33:37.081615       1 efs_watch_dog.go:221] Skip copying /etc/amazon/efs/efs-utils.conf since it exists already
I0923 21:33:37.081625       1 efs_watch_dog.go:221] Skip copying /etc/amazon/efs/efs-utils.crt since it exists already
I0923 21:33:37.081862       1 driver.go:127] Starting reaper
I0923 21:33:37.082188       1 round_trippers.go:466] curl -v -XGET  -H "Accept: application/json, */*" -H "User-Agent: aws-efs-csi-driver/v0.0.0 (linux/amd64) kubernetes/$Format" -H "Authorization: Bearer <masked>" 'https://192.168.0.1:443/api/v1/nodes/ip-123-123-12-12.ec2.internal'
I0923 21:33:37.082856       1 round_trippers.go:510] HTTP Trace: Dial to tcp:192.168.0.1:443 succeed
I0923 21:33:37.175904       1 round_trippers.go:553] GET https://192.168.0.1:443/api/v1/nodes/ip-123-123-12-12.ec2.internal 200 OK in 93 milliseconds
I0923 21:33:37.175929       1 round_trippers.go:570] HTTP Statistics: DNSLookup 0 ms Dial 0 ms TLSHandshake 3 ms ServerProcessing 89 ms Duration 93 ms
I0923 21:33:37.175936       1 round_trippers.go:577] Response Headers:
I0923 21:33:37.175943       1 round_trippers.go:580]     Audit-Id: 313d0599-ccb4-4ebc-9b64-14d0cb8751ad
I0923 21:33:37.175950       1 round_trippers.go:580]     Cache-Control: no-cache, private
I0923 21:33:37.175958       1 round_trippers.go:580]     Content-Type: application/json
I0923 21:33:37.175964       1 round_trippers.go:580]     X-Kubernetes-Pf-Flowschema-Uid: 24fc1286-36af-483b-a22d-b81c3bb5a261
I0923 21:33:37.175974       1 round_trippers.go:580]     X-Kubernetes-Pf-Prioritylevel-Uid: f38fae65-e6a8-4275-9d79-c41406bb5083
I0923 21:33:37.175980       1 round_trippers.go:580]     Date: Mon, 23 Sep 2024 21:33:37 GMT
I0923 21:33:37.176080       1 request.go:1171] Response Body: { "kind": "Node", "apiVersion": "v1", "metadata": { "name": "ip-123-123-12-12.ec2.internal", "resourceVersion": "1116646436", "creationTimestamp": "2024-09-23T21:33:11Z", "labels": { "beta.kubernetes.io/arch": "amd64", "beta.kubernetes.io/instance-type": "p4de.24xlarge", "beta.kubernetes.io/os": "linux", "failure-domain.beta.kubernetes.io/region": "us-east-1", "failure-domain.beta.kubernetes.io/zone": "us-east-1b", "kubernetes.io/arch": "amd64", "kubernetes.io/hostname": "ip-123-123-12-12.ec2.internal", "kubernetes.io/os": "linux", "node.kubernetes.io/instance-type": "p4de.24xlarge", "topology.kubernetes.io/region": "us-east-1", "topology.kubernetes.io/zone": "us-east-1b" }, "annotations": { "alpha.kubernetes.io/provided-node-ip": "123.123.12.12", "node.alpha.kubernetes.io/ttl": "15", "volumes.kubernetes.io/controller-managed-attach-detach": "true" }, "managedFields": [ { "manager": "aws-cloud-controller-manager", "operation": "Update", "apiVersion": "v1", "time": "2024-09-23T21:33:11Z", "fieldsType": "FieldsV1", "fieldsV1": { "f:status": { "f:addresses": { "k:{\"type\":\"InternalDNS\"}": { ".": {}, "f:address": {}, "f:type": {} } } } }, "subresource": "status" }, { "manager": "kubelet", "operation": "Update", "apiVersion": "v1", "time": "2024-09-23T21:33:11Z", "fieldsType": "FieldsV1", "fieldsV1": { "f:metadata": { "f:annotations": { ".": {}, "f:alpha.kubernetes.io/provided-node-ip": {}, "f:volumes.kubernetes.io/controller-managed-attach-detach": {} }, "f:labels": { ".": {}, "f:beta.kubernetes.io/arch": {}, "f:beta.kubernetes.io/os": {}, "f:kubernetes.io/arch": {}, "f:kubernetes.io/hostname": {}, "f:kubernetes.io/os": {} } }, "f:spec": { "f:providerID": {} } } }, { "manager": "aws-cloud-controller-manager", "operation": "Update", "apiVersion": "v1", "time": "2024-09-23T21:33:19Z", "fieldsType": "FieldsV1", "fieldsV1": { "f:metadata": { "f:labels": { "f:beta.kubernetes.io/instance-type": {}, "f:failure-domain.beta.kubernetes.io/region": {}, "f:failure-domain.beta.kubernetes.io/zone": {}, "f:k8s.io/cloud-provider-aws": {}, "f:node.kubernetes.io/instance-type": {}, "f:topology.kubernetes.io/region": {}, "f:topology.kubernetes.io/zone": {} } } } }, { "manager": "kubelet", "operation": "Update", "apiVersion": "v1", "time": "2024-09-23T21:33:27Z", "fieldsType": "FieldsV1", "fieldsV1": { "f:status": { "f:conditions": { "k:{\"type\":\"DiskPressure\"}": { "f:lastHeartbeatTime": {} }, "k:{\"type\":\"MemoryPressure\"}": { "f:lastHeartbeatTime": {} }, "k:{\"type\":\"PIDPressure\"}": { "f:lastHeartbeatTime": {} }, "k:{\"type\":\"Ready\"}": { "f:lastHeartbeatTime": {}, "f:lastTransitionTime": {}, "f:message": {}, "f:reason": {}, "f:status": {} } } } }, "subresource": "status" }, { "manager": "kube-controller-manager", "operation": "Update", "apiVersion": "v1", "time": "2024-09-23T21:33:32Z", "fieldsType": "FieldsV1", "fieldsV1": { "f:metadata": { "f:annotations": { "f:node.alpha.kubernetes.io/ttl": {} } } } }, { "manager": "cilium-operator-generic", "operation": "Update", "apiVersion": "v1", "time": "2024-09-23T21:33:34Z", "fieldsType": "FieldsV1", "fieldsV1": { "f:status": { "f:conditions": { "k:{\"type\":\"NetworkUnavailable\"}": { ".": {}, "f:lastHeartbeatTime": {}, "f:lastTransitionTime": {}, "f:message": {}, "f:reason": {}, "f:status": {}, "f:type": {} } } } }, "subresource": "status" } ] }, "spec": { "providerID": "aws:///us-east-1b/i-123456789" }, "status": { "capacity": { "cpu": "96", "ephemeral-storage": "7168566Mi", "hugepages-1Gi": "0", "hugepages-2Mi": "10256Mi", "memory": "1176298736Ki", "pods": "110" }, "allocatable": { "cpu": "95690m", "ephemeral-storage": "6764033882790", "hugepages-1Gi": "0", "hugepages-2Mi": "10256Mi", "memory": "1157131504Ki", "pods": "110" }, "conditions": [ { "type": "NetworkUnavailable", "status": "False", "lastHeartbeatTime": "2024-09-23T21:33:34Z", "lastTransitionTime": "2024-09-23T21:33:34Z", "reason": "CiliumIsUp", "message": "Cilium is running on this node" }, { "type": "MemoryPressure", "status": "False", "lastHeartbeatTime": "2024-09-23T21:33:27Z", "lastTransitionTime": "2024-09-23T21:33:09Z", "reason": "KubeletHasSufficientMemory", "message": "kubelet has sufficient memory available" }, { "type": "DiskPressure", "status": "False", "lastHeartbeatTime": "2024-09-23T21:33:27Z", "lastTransitionTime": "2024-09-23T21:33:09Z", "reason": "KubeletHasNoDiskPressure", "message": "kubelet has no disk pressure" }, { "type": "PIDPressure", "status": "False", "lastHeartbeatTime": "2024-09-23T21:33:27Z", "lastTransitionTime": "2024-09-23T21:33:09Z", "reason": "KubeletHasSufficientPID", "message": "kubelet has sufficient PID available" }, { "type": "Ready", "status": "True", "lastHeartbeatTime": "2024-09-23T21:33:27Z", "lastTransitionTime": "2024-09-23T21:33:27Z", "reason": "KubeletReady", "message": "kubelet is posting ready status" } ], "addresses": [ { "type": "InternalIP", "address": "123.123.12.12" }, { "type": "InternalDNS", "address": "ip-123-123-12-12.ec2.internal" }, { "type": "Hostname", "address": "ip-123-123-12-12.ec2.internal" } ], "daemonEndpoints": { "kubeletEndpoint": { "Port": 10250 } }, "nodeInfo": { "kernelVersion": "5.10.224-212.876.amzn2.x86_64", "osImage": "Amazon Linux 2", "containerRuntimeVersion": "containerd://1.7.11", "kubeletVersion": "v1.29.6-eks-1552ad0", "kubeProxyVersion": "v1.29.6-eks-1552ad0", "operatingSystem": "linux", "architecture": "amd64" }, "images": [] } }
I0923 21:33:37.176760       1 node.go:486] "No taints to remove on node, skipping taint removal"
I0923 21:33:37.176775       1 driver.go:137] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0923 21:33:38.109926       1 node.go:311] NodeGetInfo: called with args 

@ricardov1
Copy link
Author

It looks like this issue was fixed in the efs-csi-driver, has there been a similar fix in this driver?

@torredil
Copy link
Member

torredil commented Nov 5, 2024

@ricardov1 Yes, as I mentioned there have been several enhancements in this code path since 1.28. Given there is not clear root cause for this behavior, I can't say that upgrading will fix it (or provide insight into fixes implemented in the efs-csi-driver), but please do upgrade and let us know if you continue to run into this.

@mamoit
Copy link

mamoit commented Nov 7, 2024

This morning we upgraded both the EBS and EFS helm charts in all our clusters to the latest versions (2.36.0 and 3.0.8 respectively).
A couple of hours later there are still nodes that come up and are stuck with either the ebs or the efs CSI taint.

@ConnorJC3
Copy link
Contributor

ConnorJC3 commented Nov 22, 2024

Hi, based on our testing, the issue described here is caused by a bug with Karpenter replacing the taint from the EBS CSI Driver after the driver has already removed the taint.

This bug was already fixed on the Karpenter side (fix: kubernetes-sigs/karpenter#1658 cherry-pick: kubernetes-sigs/karpenter#1705), but the fix was not included until Karpenter release v1.0.4.

Please upgrade to Karpenter v1.0.4 or later and see if that fixes the issue for you, thanks!

@mamoit
Copy link

mamoit commented Nov 22, 2024

After checking the audit logs of a random affected node in one of our clusters I can see that:

  1. the ebs-csi-driver removes its taint
  2. then the efs-csi-driver removes its taint
  3. then karpenter comes 3s later to patch both the taints back in the node.

So this is not an ebs nor efs csi-driver issue, but instead a karpenter issue.
The curve ball is that we're running karpenter v1.0.7 (of the karpenter-provider-aws), so the issue should be fixed there too unless there was a regression of some sorts.

@abhiverma001
Copy link

I am also facing the same issue. So if I delete ebs-csi-driver pod and new ebs-csi pod startup it shows error in logs -
node.go:861] "Unexpected failure when attempting to remove node taint(s)" err="isAllocatableSet: driver not found on node ip-xx.xx.xx.xx.eu-west-1.compute.internal" but yes after few seconds it get stabilized.

I am using Karpenter and worker-node both. ebs-csi-driver running on both the nodes, show the same error.
Updated the ebs (2.37.0) and efs (3.0.8) both and also updated Karpenter (1.0.8), but this error still occurs in ebs-csi pod log.

@mamoit
Copy link

mamoit commented Dec 12, 2024

Just for the record if it helps someone (@abhiverma001 maybe it can help you), our issue was that karpenter was running out of memory when the cluster scaled aggressively.
We added more RAM to it and it stopped crashing and re-adding the taints to the new nodes that had come up while karpenter was crashed.

For reference we're currently running karpenter with 4GB or RAM, it's using at most 2.5GB when the cluster scales up but we want a lot of head room (the RAM is cheaper than the stuck instances).
We're having moments when we add about 600 pods to the cluster in a couple of minutes and karpenter needs to provision around 60 nodes.

@AndrewSirenko
Copy link
Contributor

/close

@mamoit thank you for chiming in with your RAM related suggestions.

Closing this issue to consolidate the discussion to kubernetes-sigs/karpenter's issue: kubernetes-sigs/karpenter#1772.

Alternatively, EBS/EFS CSI drivers wouldn't need this startup taint at all if kubernetes/kubernetes#95911 is fixed. Will discuss with my team if we can spend some time tackling that issue.

@k8s-ci-robot
Copy link
Contributor

@AndrewSirenko: Closing this issue.

In response to this:

/close

@mamoit thank you for chiming in with your RAM related suggestions.

Closing this issue to consolidate the discussion to kubernetes-sigs/karpenter's issue: kubernetes-sigs/karpenter#1772.

Alternatively, EBS/EFS CSI drivers wouldn't need this startup taint at all if kubernetes/kubernetes#95911 is fixed. Will discuss with my team if we can spend some time tackling that issue.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

7 participants