Pods stuck in ContainerCreating or aws-node restarting after Service Account role for managed Addon is added/updated #1338

abhipth · 2020-12-17T17:51:34Z

What happened:
On adding or updating the aws-node managed add on's service account role from the console the service account was updated with the role ARN but a rolling restart on the aws-node was not triggered which lead to it still using the previous identity to make all the EC2 API Calls. If the previous identity no longer has the AmazonEKS_CNI_Policy then the EC2 API call will start to fail with UnauthorizedOperation and it's possible that the aws-node may not come up or new pods are stuck in ContainerCreating.

It can be verified by checking the ipamd logs for Unauthorized errors on calling the EC2 API.

kubectl exec -ti -n kube-system <aws-node-name> grep "UnauthorizedOperation" /host/var/log/aws-routed-eni/ipamd.log | wc -l

Current Workaround

After updating the role associated with the SA from console manually triggering a rolling restart should recreate the pod with the right identity.

kubectl rollout restart daemonset -n kube-system aws-node

To verify the new pods are using the right role ARN

kubectl exec -ti -n kube-system <aws-node-name> env | grep AWS       

AWS_ROLE_ARN=<NEW-ROLE-ARN>
AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token

Attach logs
In case the aws-node keeps on restarting
Pod Events

Warning  Unhealthy  4m24s  kubelet Liveness probe failed: {"level":"info","ts":"2020-12-17T17:44:25.283Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
Warning  Unhealthy  35s   kubelet Readiness probe failed: {"level":"info","ts":"2020-12-17T17:43:34.677Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}

Logs

{"level":"error","ts":"2020-12-17T17:44:56.063Z","caller":"aws-k8s-agent/main.go:28","msg":"Initialization failure: ipamd: can not initialize with AWS SDK interface: refreshSGIDs: unable to update the ENI's SG: UnauthorizedOperation: You are not authorized to perform this operation. Encoded authorization failure message: MESSAGE status code: 403, request id: 1be865fa-f79b-4ecc-a07c-8238e546372e"}

In case aws-node starts but unable to invoke EC2 API

{"level":"error","ts":"2020-12-15T06:22:54.435Z","caller":"ipamd/ipamd.go:627","msg":"Failed to increase pool size due to not able to allocate ENI AllocENI: failed to create ENI: failed to create network interface: UnauthorizedOperation: You are not authorized to perform this operation. Encoded authorization failure message: <MESSAGE> status code: 403, request id: 85e9f1e1-0cc1-4752-afce-aa18c4dce347"}
{"level":"error","ts":"2020-12-15T06:22:59.553Z","caller":"awsutils/awsutils.go:724","msg":"Failed to CreateNetworkInterface UnauthorizedOperation: You are not authorized to perform this operation. Encoded authorization failure message: <MESSAGE> status code: 403, request id: e86cd6eb-3f08-46e6-b245-83446a2c5285"}

What you expected to happen:
Managed Add on should provide an option to do rolling restart when the Service account role is updated or provide instructions to do the same manually when updating the service account role.

How to reproduce it (as minimally and precisely as possible):

Create aws-node as managed add on with node instance role.
Update the Service account role to a new role from the console.
Remove the AmazonEKS_CNI_Policy form the node instance role.

Environment:

Kubernetes version (use kubectl version): v1.18
CNI Version: v.1.7.5
Managed Add On Enabled: Yes

The text was updated successfully, but these errors were encountered:

0xlen · 2021-06-09T09:59:33Z

I experienced the same issue and able to see the IPAMD got the UnauthorizedOperation error when using CNI Plugin v.1.7.5 due to missing policy AmazonEKS_CNI_Policy, the CNI plugin will restart always:

ipamd

{"level":"error","ts":"2021-06-09T08:43:25.353Z","caller":"aws-k8s-agent/main.go:28","msg":"Initialization failure: ipamd: can not initialize with AWS SDK interface: refreshSGIDs: unable to update the ENI's SG: UnauthorizedOperation: You are not authorized to perform this operation. Encoded authorization failure message: ....\n\tstatus code: 403, request id: 14fef5de-6046-4fa7-894f-b5a7559430ba"}

kubelet

Jun 09 08:30:23 ip-172-31-3-84.ec2.internal kubelet[4561]: I0609 08:30:23.934720    4561 prober.go:117] Liveness probe for "aws-node-4k7pr_kube-system(352b46ea-0570-4e98-960e-267b2c732da9):aws-node" failed (failure):

Jun 09 08:30:25 ip-172-31-3-84.ec2.internal kubelet[4561]: E0609 08:30:25.094183    4561 kubelet.go:2187] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

Jun 09 08:30:25 ip-172-31-3-84.ec2.internal kubelet[4561]: E0609 08:30:25.499488    4561 remote_runtime.go:392] ExecSync 914c6eeef5949b91e3fb8361aae032ffb00b6eca29ce954326ffb4567286b1ca '/app/grpc-health-probe -addr=:50051' from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Jun 09 08:30:25 ip-172-31-3-84.ec2.internal kubelet[4561]: I0609 08:30:25.499542    4561 prober.go:117] Readiness probe for "aws-node-4k7pr_kube-system(352b46ea-0570-4e98-960e-267b2c732da9):aws-node" failed (failure):

...
Jun 09 08:47:30 ip-172-31-3-84.ec2.internal kubelet[4561]: E0609 08:47:30.740652    4561 pod_workers.go:191] Error syncing pod 352b46ea-0570-4e98-960e-267b2c732da9 ("aws-node-4k7pr_kube-system(352b46ea-0570-4e98-960e-267b2c732da9)"), skipping: failed to "StartContainer" for "aws-node" with CrashLoopBackOff: "back-off 5m0s restarting failed container=aws-node pod=aws-node-4k7pr_kube-system(352b46ea-0570-4e98-960e-267b2c732da9)"

cp38510 · 2021-07-14T07:21:54Z

Hello!
with amazon-k8s-cni:v1.7.5-eksbuild.1 we have had a similar error 2 times already, but we don't use managed nodes.
we have several clusters in EKS and this error appears suddenly and it's not clear how to reproduce it.

has anyone tried the amazon-k8s-cni update?

logs:

 aws-node {"level":"info","ts":"2021-07-14T07:13:49.321Z","caller":"entrypoint.sh","msg":"Install CNI binary.."}
 aws-node {"level":"info","ts":"2021-07-14T07:13:49.334Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
 aws-node {"level":"info","ts":"2021-07-14T07:13:49.336Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
 aws-node stream closed

describe:

   Normal   Pulling    58m (x356 over 25h)     kubelet, ip-10-8-137-214.eu-central-1.compute.internal  Pulling image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/amazon-k8s-cni:v1.7.5-eksbuild.1"           │
│   Warning  Unhealthy  8m6s (x4735 over 25h)   kubelet, ip-10-8-137-214.eu-central-1.compute.internal  (combined from similar events): Liveness probe failed: {"level":"info","ts":"2021-07-14T07:13:28.981Z","ca │
│ ller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}                                                                                                   │
│   Warning  BackOff    2m56s (x4464 over 25h)  kubelet, ip-10-8-137-214.eu-central-1.compute.internal  Back-off restarting failed container

katherinel · 2021-07-21T23:59:04Z

@cp38510 Exact same problem here. Unmanaged EKS 1.17 with amazon-k8s-cni:v1.7.9

gerardgorrion · 2021-08-27T11:12:44Z

Same problem here. We are using EKS in v1.21 and cni 1.7.5. We need to specify some configurations into cni daemonset of the cluster and we deploy the aws-k8s-cni.yaml of v1.7.5. The problem are the aws-node pods stuck and restarting again and again and some time later, some cluster nodes entry into not-ready status. Some idea?

Node description:

Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----             ------    -----------------                 ------------------                ------              -------
  MemoryPressure   Unknown   Fri, 27 Aug 2021 13:27:29 +0200   Fri, 27 Aug 2021 13:32:42 +0200   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure     Unknown   Fri, 27 Aug 2021 13:27:29 +0200   Fri, 27 Aug 2021 13:32:42 +0200   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure      Unknown   Fri, 27 Aug 2021 13:27:29 +0200   Fri, 27 Aug 2021 13:32:42 +0200   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready            Unknown   Fri, 27 Aug 2021 13:27:29 +0200   Fri, 27 Aug 2021 13:32:42 +0200   NodeStatusUnknown   Kubelet stopped posting node status.

github-actions · 2022-04-14T00:16:15Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions · 2022-04-28T00:22:49Z

Issue closed due to inactivity.

abhipth added the bug label Dec 17, 2020

jayanthvn changed the title ~~Pods stuck in ContainerCreating or aws-node restarting after Service Account role for managed on is added/updated~~ Pods stuck in ContainerCreating or aws-node restarting after Service Account role for managed Addon is added/updated Dec 17, 2020

abhipth mentioned this issue Dec 17, 2020

cni version 1.7.6. Pods are failing to launch. #1322

Closed

jayanthvn assigned abhipth Jan 21, 2021

cp38510 mentioned this issue Jul 14, 2021

error with the amazon-k8s-cni container after the update awslabs/amazon-eks-ami#689

Closed

github-actions bot added the stale Issue or PR is stale label Apr 14, 2022

github-actions bot closed this as completed Apr 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods stuck in ContainerCreating or aws-node restarting after Service Account role for managed Addon is added/updated #1338

Pods stuck in ContainerCreating or aws-node restarting after Service Account role for managed Addon is added/updated #1338

abhipth commented Dec 17, 2020

0xlen commented Jun 9, 2021 •

edited

Loading

cp38510 commented Jul 14, 2021 •

edited

Loading

katherinel commented Jul 21, 2021

gerardgorrion commented Aug 27, 2021 •

edited

Loading

github-actions bot commented Apr 14, 2022

github-actions bot commented Apr 28, 2022

Pods stuck in ContainerCreating or aws-node restarting after Service Account role for managed Addon is added/updated #1338

Pods stuck in ContainerCreating or aws-node restarting after Service Account role for managed Addon is added/updated #1338

Comments

abhipth commented Dec 17, 2020

0xlen commented Jun 9, 2021 • edited Loading

cp38510 commented Jul 14, 2021 • edited Loading

katherinel commented Jul 21, 2021

gerardgorrion commented Aug 27, 2021 • edited Loading

github-actions bot commented Apr 14, 2022

github-actions bot commented Apr 28, 2022

0xlen commented Jun 9, 2021 •

edited

Loading

cp38510 commented Jul 14, 2021 •

edited

Loading

gerardgorrion commented Aug 27, 2021 •

edited

Loading