-
Notifications
You must be signed in to change notification settings - Fork 748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pods stuck in ContainerCreating due to CNI failed to set up pod: Pod Event says : Unauthorized; until aws-node pod restart. #1831
Comments
@malpania Was there an AMI upgrade on these old nodes? So, even a previously running |
Thanks @achevuru , for your response. ** #Was there an AMI upgrade on these old nodes ** EKS Version - v1.21.4-eks-033ce7e The latest node (35 days old now) is using: EKS Version - v1.21.5-eks-bc4871b ** even a previously running Datadog pod on this node started failing as well ** ** Did you check the resource usage on these nodes ** |
@malpania Is this an EKS cluster? This looks like some sort of cert expiry/SA token expiry and these pods are not able to access certain resources (pods in this case). Restarting the CNI daemonset probably regenerated the SA token and restored access to these resources. Not a CNI issue and we should check what contributed to the 100 days also reminds me that certs signed by providers like Let's encrypt comes with a default 90 day expiry. Not sure if you are using it in anyway but just thought it is worth calling out... |
Is this an EKS cluster? @achevuru : I have captured the zip file after running /opt/cni/bin/aws-cni-support.sh. let me know if you like me to send you as email. |
@achevuru wondering if you've come across anything regarding this? We're starting to see this happen in multiple production EKS clusters and just wanted to check |
@envybee EKS/K8S 1.21 has BoundServiceAccountToken feature enabled by default. So, the ServiceAccount tokens are time and audience bound. If you're using VPC CNI 1.9.x+, you shouldn't be affected by this issue with regards to VPC CNI pods. If you're on an older CNI version, you can upgrade to the latest version. If your application pods are running in to this, you can check if they read the refreshed token periodically. @malpania Sorry, I missed your reply. Were you able to figure out what contributed to the expired SA tokens? Also, was it a brand new EKS 1.21 cluster (or) did you upgrade your existing EKS clusters to EKS 1.21? You can send your logs to |
Hi @achevuru , We have upgraded our cluster 1.19 to 1.20 and now running 1.21 version. CNI version we are running 1.9.0. |
Similar issue has been raised by us on istio/istio#38077 |
I am having this exact same issue on multiple clusters....happens around twice a week, REALLY annoying. Running EKS 1.21, 1.10.1 CNI, managed node groups. When this happens, NO pods can spawn at all, since the aws-node CNI is unable to setup networking for them. I've collected logs and created support tickets to no avail.....if you check the aws-node daemonset logs, there are no errors or anything, everything seems fine. |
@jbilliau-rcd Are you using Security Groups Per Pod feature? Please refer to - https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html#kubernetes-1.21 |
@achevuru nope, not using security group feature at all, and we are already on 1.10.1 of the CNI. It happened on 1.7.5, and we thought upgrading would fix it, but it still happening. As for an application being dependent on old k8s client SDK, this has nothing to do with our applications; the pods themselves wont even attempt to schedule because of this error, so the code isn't even being called at this point. Just had it happen again last night, restarting aws-node daemonset fixes it:
Im not quite sure I understand this |
Found this issue on the Istio github, maybe aws-node isn't actually the problem and is a red herring? |
This issue is more of istio related. Closing this as attached ticket has to be fixed by istio team. |
|
On the nodes more than 100 days old when few of the pods like nats or datadog agent get scheduled in that pod gets stuck on ContainerCreating. Node size varies and subnet has sufficient ip available. The new nodes (say 20-30) days old does not have this issue. If these pods gets scheduled to those nodes running recently they come up fine. Also nodes are not spot instances. We have policy set on pod scheduling to schedule only on demand nodes.
We have 2 different eks cluster running in different account. 1 is over 100 days old and another 90 days old. This problem started happening recently say 26 Jan 2022 onwards. Prior to that we never had issue.
The Error Message on pod event is like below
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "551807b9b2e97601e6779a8435b7650d6a54b0c11292c4e6a63365659d0dc846" network for pod "nats-2": networkPlugin cni failed to set up pod "nats-2_nats" network: Unauthorized
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "0aa5c3567cf0438bae62b6a9f61567bb66cd33b17ffa533fd0c9341639e864e9" network for pod "datadog-l8jct": networkPlugin cni failed to set up pod "datadog-l8jct_datadog-system" network: Unauthorized
** Work around.
I have tried #59 solution restarted aws-node pod in the node where issue happened and it solved the issue.
Attach logs
IPAMD.log file and similar logs in plugin.log
Datadog pod log (Running pod started failing and after restart it was stuck )
How to reproduce it (as minimally and precisely as possible):
I have no idea how to reproduce it as it started appearing suddenly.
Environment:
The text was updated successfully, but these errors were encountered: