Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird Rpc error: code = DeadlineExceeded desc = context deadline exceeded and error listing AWS instances: RequestCanceled: request context canceled #1783

Closed
zjalicflw opened this issue Oct 13, 2023 · 8 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@zjalicflw
Copy link

zjalicflw commented Oct 13, 2023

/kind bug

What happened?

After uninstalling and installing bitnami/kafka Helm chart on my EKS cluster a couple of times due to some errors, a new blocking error occurred. Suddenly, all pods are in status ContainerCreating. Upon inspection, describe pod command displays:

Warning FailedAttachVolume 10s (x6 over 29s) attachdetach-controller AttachVolume.Attach failed for volume "pvc-95a5209c-797c-49de-ae30-9def18935393" : rpc error: code = DeadlineExceeded desc = context deadline exceeded

After this, today I tried to delete and recreate PVCs, but similar error happens when recreating PVCs:

Warning ProvisioningFailed 20m  ebs.csi.aws.com_ebs-csi-controller-7cb6bff767-8f9jj_ff3337d4-2a27-4593-b371-0c78b6b73fe0 failed to provision volume with StorageClass "gp2": rpc error: code = Internal desc = Could not create volume "pvc-3b745751-ce69-446d-a094-89f84900bdbc": could not create volume in EC2: RequestCanceled: request context canceled
caused by: context deadline exceeded
 Normal  Provisioning     6m33s (x12 over 21m) ebs.csi.aws.com_ebs-csi-controller-7cb6bff767-8f9jj_ff3337d4-2a27-4593-b371-0c78b6b73fe0 External provisioner is provisioning volume for claim "default/data-kafka-0"
 Warning ProvisioningFailed  6m23s (x11 over 21m) ebs.csi.aws.com_ebs-csi-controller-7cb6bff767-8f9jj_ff3337d4-2a27-4593-b371-0c78b6b73fe0 failed to provision volume with StorageClass "gp2": rpc error: code = DeadlineExceeded desc = context deadline exceeded
 Normal  ExternalProvisioning 100s (x83 over 21m)  persistentvolume-controller
Waiting for a volume to be created either by the external provisioner 'ebs.csi.aws.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.

Upon describing pod with csi drivers:

E1013 12:43:23.647806       1 driver.go:124] "GRPC error" err=<
	rpc error: code = Internal desc = Could not detach volume "vol-0d61e5511a40db185" from node "i-0a7f1ad09359b3374": error listing AWS instances: RequestCanceled: request context canceled
	caused by: context canceled                        
E1013 12:43:23.652891       1 driver.go:124] "GRPC error" err=<
	rpc error: code = Internal desc = Could not detach volume "vol-0e37dabb932ace606" from node "i-0187ea34d2b675a5c": error listing AWS instances: RequestCanceled: request context canceled
	caused by: context deadline exceeded                                         
I1013 12:43:23.664699       1 controller.go:444] "ControllerUnpublishVolume: detaching" volumeID="vol-0d61e5511a40db185" nodeID="i-0a7f1ad09359b3374"                                    
I1013 12:43:23.667103       1 controller.go:444] "ControllerUnpublishVolume: detaching" volumeID="vol-0e37dabb932ace606" nodeID="i-0187ea34d2b675a5c"                                       
E1013 12:43:23.774055       1 driver.go:124] "GRPC error" err=<
	rpc error: code = Internal desc = Could not detach volume "vol-0fb663d85437897ab" from node "i-05b75e1891fb38735": error listing AWS instances: RequestCanceled: request context canceled
	caused by: context canceled                                                                              
E1013 12:43:23.776023       1 driver.go:124] "GRPC error" err=<
	rpc error: code = Internal desc = Could not detach volume "vol-0163a5d445e993518" from node "i-0187ea34d2b675a5c": error listing AWS instances: RequestCanceled: request context canceled
	caused by: context canceled

What you expected to happen?

CSI driver should reattach properly to volumes.

How to reproduce it (as minimally and precisely as possible)?

Not sure, very specific situation

Anything else we need to know?:

Is this some AWS quota block? Because of testing, I uninstalled and installed kafka chart many times, but each time there was no problem with PVCs, and then suddenly pod describe gives context deadline exceeded errors.

Environment

  • Kubernetes version (use kubectl version):
Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.1-eks-43840fb
  • Driver version: v1.23.1-eksbuild.1
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 13, 2023
@zjalicflw
Copy link
Author

#214

This seems similar, however I have tried everything to solve this, no matter what I get the same error - context deadline exceeded

@debdutdeb
Copy link

Facing this right now

@zjalicflw
Copy link
Author

Hi @debdutdeb

I managed to solve my issue by reinstalling both CoreDNS plugins and VPC CNI and EBS Driver. I updated them to a latest version. After this my kafka pods were running.

This should be easily fixed by uninstalling all addons, making sure to uninstall ones that are NOT installed through AWS addons console, install them all again and then delete some PVCs if stuck on attaching. Of course this will just work if you use dynamic provisioning. If using static, just attach and retattach volumes.

Taking a look at your PVCs, PVs, EBS volumes attached to your EKS clusters instance and carefully inspecting them should fix your problem.

You can elaborate more if you need help, I will try to do my best.

Filip

@j-land
Copy link

j-land commented Nov 13, 2023

We are running into the same issue in an EKS environment.

Kubernetes version: v1.24.17-eks-4f4795d
Driver version: 1.24.0 (from helm chart version aws-ebs-csi-driver-2.24.0)

I1113 08:42:54.079730       1 csi_handler.go:251] Attaching "csi-57939a06730aa4167c1609c46f5d8a3f6196360670b974e355bf2f6cf01a746c"
I1113 08:42:54.079786       1 csi_handler.go:251] Attaching "csi-b394ecc409f06a620fbce7118bdf4db434e5f359196317f98a42cdcac85eacdb"
I1113 08:42:54.080160       1 controller.go:415] "ControllerPublishVolume: attaching" volumeID="vol-0934dc0da8301b04d" nodeID="i-0c8e24cd69c5ca516"
I1113 08:42:54.080160       1 controller.go:415] "ControllerPublishVolume: attaching" volumeID="vol-056e1e688e7a0aa8c" nodeID="i-0c8e24cd69c5ca516"
E1113 08:43:09.080470       1 driver.go:124] "GRPC error" err=<
	rpc error: code = Internal desc = Could not attach volume "vol-056e1e688e7a0aa8c" to node "i-0c8e24cd69c5ca516": error listing AWS instances: RequestCanceled: request context canceled
	caused by: context canceled
 >
E1113 08:43:09.080469       1 driver.go:124] "GRPC error" err=<
	rpc error: code = Internal desc = Could not attach volume "vol-0934dc0da8301b04d" to node "i-0c8e24cd69c5ca516": error listing AWS instances: RequestCanceled: request context canceled
	caused by: context canceled
 >	
I1113 08:43:09.087184       1 csi_handler.go:234] Error processing "csi-b394ecc409f06a620fbce7118bdf4db434e5f359196317f98a42cdcac85eacdb": failed to attach: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I1113 08:43:09.089415       1 csi_handler.go:234] Error processing "csi-57939a06730aa4167c1609c46f5d8a3f6196360670b974e355bf2f6cf01a746c": failed to attach: rpc error: code = DeadlineExceeded desc = context deadline exceeded

I managed to solve my issue by reinstalling both CoreDNS plugins and VPC CNI and EBS Driver. ... This should be easily fixed by uninstalling all addons, making sure to uninstall ones that are NOT installed through AWS addons console, install them all again and then delete some PVCs if stuck on attaching. ...

These steps may be fine for one off cases, but this isn't feasible for our production environment. I would like to work towards a more durable fix in the ebs-csi-driver application.

@j-land
Copy link

j-land commented Nov 13, 2023

@zjalicflw Can you reopen this issue?

@torredil
Copy link
Member

Hi @j-land, as a first step, I recommend upgrading to the latest version of the driver, which sets a more sensible default timeout value for the external attacher. See our release notes here for more information: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/CHANGELOG.md#v1250.

Beyond that, If you are still running into issues, I'd recommend enabling SDK logs via the sdkDebugLog parameter to help provide further insight into networking or auth related issues. Feel free to open a new issue if you need any help.

@j-land
Copy link

j-land commented Nov 13, 2023

@torredil That's helpful, I appreciate it! Hopefully upgrading does the trick, but I'll enable SDK logs to debug if not.

@nookseal
Copy link

nookseal commented Oct 1, 2024

Does upgrading solved the problem? @j-land

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

6 participants