Weird Rpc error: code = DeadlineExceeded desc = context deadline exceeded and error listing AWS instances: RequestCanceled: request context canceled #1783

zjalicflw · 2023-10-13T13:06:16Z

/kind bug

What happened?

After uninstalling and installing bitnami/kafka Helm chart on my EKS cluster a couple of times due to some errors, a new blocking error occurred. Suddenly, all pods are in status ContainerCreating. Upon inspection, describe pod command displays:

Warning FailedAttachVolume 10s (x6 over 29s) attachdetach-controller AttachVolume.Attach failed for volume "pvc-95a5209c-797c-49de-ae30-9def18935393" : rpc error: code = DeadlineExceeded desc = context deadline exceeded

After this, today I tried to delete and recreate PVCs, but similar error happens when recreating PVCs:

Warning ProvisioningFailed 20m  ebs.csi.aws.com_ebs-csi-controller-7cb6bff767-8f9jj_ff3337d4-2a27-4593-b371-0c78b6b73fe0 failed to provision volume with StorageClass "gp2": rpc error: code = Internal desc = Could not create volume "pvc-3b745751-ce69-446d-a094-89f84900bdbc": could not create volume in EC2: RequestCanceled: request context canceled
caused by: context deadline exceeded
 Normal  Provisioning     6m33s (x12 over 21m) ebs.csi.aws.com_ebs-csi-controller-7cb6bff767-8f9jj_ff3337d4-2a27-4593-b371-0c78b6b73fe0 External provisioner is provisioning volume for claim "default/data-kafka-0"
 Warning ProvisioningFailed  6m23s (x11 over 21m) ebs.csi.aws.com_ebs-csi-controller-7cb6bff767-8f9jj_ff3337d4-2a27-4593-b371-0c78b6b73fe0 failed to provision volume with StorageClass "gp2": rpc error: code = DeadlineExceeded desc = context deadline exceeded
 Normal  ExternalProvisioning 100s (x83 over 21m)  persistentvolume-controller
Waiting for a volume to be created either by the external provisioner 'ebs.csi.aws.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.

Upon describing pod with csi drivers:

E1013 12:43:23.647806       1 driver.go:124] "GRPC error" err=<
	rpc error: code = Internal desc = Could not detach volume "vol-0d61e5511a40db185" from node "i-0a7f1ad09359b3374": error listing AWS instances: RequestCanceled: request context canceled
	caused by: context canceled                        
E1013 12:43:23.652891       1 driver.go:124] "GRPC error" err=<
	rpc error: code = Internal desc = Could not detach volume "vol-0e37dabb932ace606" from node "i-0187ea34d2b675a5c": error listing AWS instances: RequestCanceled: request context canceled
	caused by: context deadline exceeded                                         
I1013 12:43:23.664699       1 controller.go:444] "ControllerUnpublishVolume: detaching" volumeID="vol-0d61e5511a40db185" nodeID="i-0a7f1ad09359b3374"                                    
I1013 12:43:23.667103       1 controller.go:444] "ControllerUnpublishVolume: detaching" volumeID="vol-0e37dabb932ace606" nodeID="i-0187ea34d2b675a5c"                                       
E1013 12:43:23.774055       1 driver.go:124] "GRPC error" err=<
	rpc error: code = Internal desc = Could not detach volume "vol-0fb663d85437897ab" from node "i-05b75e1891fb38735": error listing AWS instances: RequestCanceled: request context canceled
	caused by: context canceled                                                                              
E1013 12:43:23.776023       1 driver.go:124] "GRPC error" err=<
	rpc error: code = Internal desc = Could not detach volume "vol-0163a5d445e993518" from node "i-0187ea34d2b675a5c": error listing AWS instances: RequestCanceled: request context canceled
	caused by: context canceled

What you expected to happen?

CSI driver should reattach properly to volumes.

How to reproduce it (as minimally and precisely as possible)?

Not sure, very specific situation

Anything else we need to know?:

Is this some AWS quota block? Because of testing, I uninstalled and installed kafka chart many times, but each time there was no problem with PVCs, and then suddenly pod describe gives context deadline exceeded errors.

Environment

Kubernetes version (use kubectl version):

Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.1-eks-43840fb

Driver version: v1.23.1-eksbuild.1

The text was updated successfully, but these errors were encountered:

zjalicflw · 2023-10-17T07:34:00Z

#214

This seems similar, however I have tried everything to solve this, no matter what I get the same error - context deadline exceeded

debdutdeb · 2023-10-19T22:44:12Z

Facing this right now

zjalicflw · 2023-10-20T11:30:30Z

Hi @debdutdeb

I managed to solve my issue by reinstalling both CoreDNS plugins and VPC CNI and EBS Driver. I updated them to a latest version. After this my kafka pods were running.

This should be easily fixed by uninstalling all addons, making sure to uninstall ones that are NOT installed through AWS addons console, install them all again and then delete some PVCs if stuck on attaching. Of course this will just work if you use dynamic provisioning. If using static, just attach and retattach volumes.

Taking a look at your PVCs, PVs, EBS volumes attached to your EKS clusters instance and carefully inspecting them should fix your problem.

You can elaborate more if you need help, I will try to do my best.

Filip

j-land · 2023-11-13T19:35:22Z

We are running into the same issue in an EKS environment.

Kubernetes version: v1.24.17-eks-4f4795d
Driver version: 1.24.0 (from helm chart version aws-ebs-csi-driver-2.24.0)

I1113 08:42:54.079730       1 csi_handler.go:251] Attaching "csi-57939a06730aa4167c1609c46f5d8a3f6196360670b974e355bf2f6cf01a746c"
I1113 08:42:54.079786       1 csi_handler.go:251] Attaching "csi-b394ecc409f06a620fbce7118bdf4db434e5f359196317f98a42cdcac85eacdb"
I1113 08:42:54.080160       1 controller.go:415] "ControllerPublishVolume: attaching" volumeID="vol-0934dc0da8301b04d" nodeID="i-0c8e24cd69c5ca516"
I1113 08:42:54.080160       1 controller.go:415] "ControllerPublishVolume: attaching" volumeID="vol-056e1e688e7a0aa8c" nodeID="i-0c8e24cd69c5ca516"
E1113 08:43:09.080470       1 driver.go:124] "GRPC error" err=<
	rpc error: code = Internal desc = Could not attach volume "vol-056e1e688e7a0aa8c" to node "i-0c8e24cd69c5ca516": error listing AWS instances: RequestCanceled: request context canceled
	caused by: context canceled
 >
E1113 08:43:09.080469       1 driver.go:124] "GRPC error" err=<
	rpc error: code = Internal desc = Could not attach volume "vol-0934dc0da8301b04d" to node "i-0c8e24cd69c5ca516": error listing AWS instances: RequestCanceled: request context canceled
	caused by: context canceled
 >	
I1113 08:43:09.087184       1 csi_handler.go:234] Error processing "csi-b394ecc409f06a620fbce7118bdf4db434e5f359196317f98a42cdcac85eacdb": failed to attach: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I1113 08:43:09.089415       1 csi_handler.go:234] Error processing "csi-57939a06730aa4167c1609c46f5d8a3f6196360670b974e355bf2f6cf01a746c": failed to attach: rpc error: code = DeadlineExceeded desc = context deadline exceeded

I managed to solve my issue by reinstalling both CoreDNS plugins and VPC CNI and EBS Driver. ... This should be easily fixed by uninstalling all addons, making sure to uninstall ones that are NOT installed through AWS addons console, install them all again and then delete some PVCs if stuck on attaching. ...

These steps may be fine for one off cases, but this isn't feasible for our production environment. I would like to work towards a more durable fix in the ebs-csi-driver application.

j-land · 2023-11-13T19:38:04Z

@zjalicflw Can you reopen this issue?

torredil · 2023-11-13T20:38:08Z

Hi @j-land, as a first step, I recommend upgrading to the latest version of the driver, which sets a more sensible default timeout value for the external attacher. See our release notes here for more information: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/CHANGELOG.md#v1250.

Beyond that, If you are still running into issues, I'd recommend enabling SDK logs via the sdkDebugLog parameter to help provide further insight into networking or auth related issues. Feel free to open a new issue if you need any help.

j-land · 2023-11-13T21:11:00Z

@torredil That's helpful, I appreciate it! Hopefully upgrading does the trick, but I'll enable SDK logs to debug if not.

nookseal · 2024-10-01T05:04:03Z

Does upgrading solved the problem? @j-land

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 13, 2023

zjalicflw closed this as completed Oct 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird Rpc error: code = DeadlineExceeded desc = context deadline exceeded and error listing AWS instances: RequestCanceled: request context canceled #1783

Weird Rpc error: code = DeadlineExceeded desc = context deadline exceeded and error listing AWS instances: RequestCanceled: request context canceled #1783

zjalicflw commented Oct 13, 2023 •

edited

Loading

zjalicflw commented Oct 17, 2023

debdutdeb commented Oct 19, 2023

zjalicflw commented Oct 20, 2023

j-land commented Nov 13, 2023 •

edited

Loading

j-land commented Nov 13, 2023

torredil commented Nov 13, 2023

j-land commented Nov 13, 2023

nookseal commented Oct 1, 2024

Weird Rpc error: code = DeadlineExceeded desc = context deadline exceeded and error listing AWS instances: RequestCanceled: request context canceled #1783

Weird Rpc error: code = DeadlineExceeded desc = context deadline exceeded and error listing AWS instances: RequestCanceled: request context canceled #1783

Comments

zjalicflw commented Oct 13, 2023 • edited Loading

zjalicflw commented Oct 17, 2023

debdutdeb commented Oct 19, 2023

zjalicflw commented Oct 20, 2023

j-land commented Nov 13, 2023 • edited Loading

j-land commented Nov 13, 2023

torredil commented Nov 13, 2023

j-land commented Nov 13, 2023

nookseal commented Oct 1, 2024

zjalicflw commented Oct 13, 2023 •

edited

Loading

j-land commented Nov 13, 2023 •

edited

Loading