Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods stuck in ContainerCreating or aws-node restarting after Service Account role for managed Addon is added/updated #1338

Closed
abhipth opened this issue Dec 17, 2020 · 6 comments
Assignees
Labels
bug stale Issue or PR is stale

Comments

@abhipth
Copy link
Contributor

abhipth commented Dec 17, 2020

What happened:
On adding or updating the aws-node managed add on's service account role from the console the service account was updated with the role ARN but a rolling restart on the aws-node was not triggered which lead to it still using the previous identity to make all the EC2 API Calls. If the previous identity no longer has the AmazonEKS_CNI_Policy then the EC2 API call will start to fail with UnauthorizedOperation and it's possible that the aws-node may not come up or new pods are stuck in ContainerCreating.

It can be verified by checking the ipamd logs for Unauthorized errors on calling the EC2 API.

kubectl exec -ti -n kube-system <aws-node-name> grep "UnauthorizedOperation" /host/var/log/aws-routed-eni/ipamd.log | wc -l

Current Workaround

After updating the role associated with the SA from console manually triggering a rolling restart should recreate the pod with the right identity.

kubectl rollout restart daemonset -n kube-system aws-node 

To verify the new pods are using the right role ARN

kubectl exec -ti -n kube-system <aws-node-name> env | grep AWS       

AWS_ROLE_ARN=<NEW-ROLE-ARN>
AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token

Attach logs
In case the aws-node keeps on restarting
Pod Events

Warning  Unhealthy  4m24s  kubelet Liveness probe failed: {"level":"info","ts":"2020-12-17T17:44:25.283Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
Warning  Unhealthy  35s   kubelet Readiness probe failed: {"level":"info","ts":"2020-12-17T17:43:34.677Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}

Logs

{"level":"error","ts":"2020-12-17T17:44:56.063Z","caller":"aws-k8s-agent/main.go:28","msg":"Initialization failure: ipamd: can not initialize with AWS SDK interface: refreshSGIDs: unable to update the ENI's SG: UnauthorizedOperation: You are not authorized to perform this operation. Encoded authorization failure message: MESSAGE status code: 403, request id: 1be865fa-f79b-4ecc-a07c-8238e546372e"}

In case aws-node starts but unable to invoke EC2 API

{"level":"error","ts":"2020-12-15T06:22:54.435Z","caller":"ipamd/ipamd.go:627","msg":"Failed to increase pool size due to not able to allocate ENI AllocENI: failed to create ENI: failed to create network interface: UnauthorizedOperation: You are not authorized to perform this operation. Encoded authorization failure message: <MESSAGE> status code: 403, request id: 85e9f1e1-0cc1-4752-afce-aa18c4dce347"}
{"level":"error","ts":"2020-12-15T06:22:59.553Z","caller":"awsutils/awsutils.go:724","msg":"Failed to CreateNetworkInterface UnauthorizedOperation: You are not authorized to perform this operation. Encoded authorization failure message: <MESSAGE> status code: 403, request id: e86cd6eb-3f08-46e6-b245-83446a2c5285"}                       

What you expected to happen:
Managed Add on should provide an option to do rolling restart when the Service account role is updated or provide instructions to do the same manually when updating the service account role.

How to reproduce it (as minimally and precisely as possible):

  • Create aws-node as managed add on with node instance role.
  • Update the Service account role to a new role from the console.
  • Remove the AmazonEKS_CNI_Policy form the node instance role.

Environment:

  • Kubernetes version (use kubectl version): v1.18
  • CNI Version: v.1.7.5
  • Managed Add On Enabled: Yes
@abhipth abhipth added the bug label Dec 17, 2020
@jayanthvn jayanthvn changed the title Pods stuck in ContainerCreating or aws-node restarting after Service Account role for managed on is added/updated Pods stuck in ContainerCreating or aws-node restarting after Service Account role for managed Addon is added/updated Dec 17, 2020
@0xlen
Copy link

0xlen commented Jun 9, 2021

I experienced the same issue and able to see the IPAMD got the UnauthorizedOperation error when using CNI Plugin v.1.7.5 due to missing policy AmazonEKS_CNI_Policy, the CNI plugin will restart always:

ipamd

{"level":"error","ts":"2021-06-09T08:43:25.353Z","caller":"aws-k8s-agent/main.go:28","msg":"Initialization failure: ipamd: can not initialize with AWS SDK interface: refreshSGIDs: unable to update the ENI's SG: UnauthorizedOperation: You are not authorized to perform this operation. Encoded authorization failure message: ....\n\tstatus code: 403, request id: 14fef5de-6046-4fa7-894f-b5a7559430ba"}

kubelet

Jun 09 08:30:23 ip-172-31-3-84.ec2.internal kubelet[4561]: I0609 08:30:23.934720    4561 prober.go:117] Liveness probe for "aws-node-4k7pr_kube-system(352b46ea-0570-4e98-960e-267b2c732da9):aws-node" failed (failure):

Jun 09 08:30:25 ip-172-31-3-84.ec2.internal kubelet[4561]: E0609 08:30:25.094183    4561 kubelet.go:2187] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

Jun 09 08:30:25 ip-172-31-3-84.ec2.internal kubelet[4561]: E0609 08:30:25.499488    4561 remote_runtime.go:392] ExecSync 914c6eeef5949b91e3fb8361aae032ffb00b6eca29ce954326ffb4567286b1ca '/app/grpc-health-probe -addr=:50051' from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Jun 09 08:30:25 ip-172-31-3-84.ec2.internal kubelet[4561]: I0609 08:30:25.499542    4561 prober.go:117] Readiness probe for "aws-node-4k7pr_kube-system(352b46ea-0570-4e98-960e-267b2c732da9):aws-node" failed (failure):

...
Jun 09 08:47:30 ip-172-31-3-84.ec2.internal kubelet[4561]: E0609 08:47:30.740652    4561 pod_workers.go:191] Error syncing pod 352b46ea-0570-4e98-960e-267b2c732da9 ("aws-node-4k7pr_kube-system(352b46ea-0570-4e98-960e-267b2c732da9)"), skipping: failed to "StartContainer" for "aws-node" with CrashLoopBackOff: "back-off 5m0s restarting failed container=aws-node pod=aws-node-4k7pr_kube-system(352b46ea-0570-4e98-960e-267b2c732da9)"

@cp38510
Copy link

cp38510 commented Jul 14, 2021

Hello!
with amazon-k8s-cni:v1.7.5-eksbuild.1 we have had a similar error 2 times already, but we don't use managed nodes.
we have several clusters in EKS and this error appears suddenly and it's not clear how to reproduce it.

has anyone tried the amazon-k8s-cni update?

logs:

 aws-node {"level":"info","ts":"2021-07-14T07:13:49.321Z","caller":"entrypoint.sh","msg":"Install CNI binary.."}
 aws-node {"level":"info","ts":"2021-07-14T07:13:49.334Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
 aws-node {"level":"info","ts":"2021-07-14T07:13:49.336Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
 aws-node stream closed

describe:

   Normal   Pulling    58m (x356 over 25h)     kubelet, ip-10-8-137-214.eu-central-1.compute.internal  Pulling image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/amazon-k8s-cni:v1.7.5-eksbuild.1"           │
│   Warning  Unhealthy  8m6s (x4735 over 25h)   kubelet, ip-10-8-137-214.eu-central-1.compute.internal  (combined from similar events): Liveness probe failed: {"level":"info","ts":"2021-07-14T07:13:28.981Z","ca │
│ ller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}                                                                                                   │
│   Warning  BackOff    2m56s (x4464 over 25h)  kubelet, ip-10-8-137-214.eu-central-1.compute.internal  Back-off restarting failed container

@katherinel
Copy link

@cp38510 Exact same problem here. Unmanaged EKS 1.17 with amazon-k8s-cni:v1.7.9

@gerardgorrion
Copy link

gerardgorrion commented Aug 27, 2021

Same problem here. We are using EKS in v1.21 and cni 1.7.5. We need to specify some configurations into cni daemonset of the cluster and we deploy the aws-k8s-cni.yaml of v1.7.5. The problem are the aws-node pods stuck and restarting again and again and some time later, some cluster nodes entry into not-ready status. Some idea?

Node description:

Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----             ------    -----------------                 ------------------                ------              -------
  MemoryPressure   Unknown   Fri, 27 Aug 2021 13:27:29 +0200   Fri, 27 Aug 2021 13:32:42 +0200   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure     Unknown   Fri, 27 Aug 2021 13:27:29 +0200   Fri, 27 Aug 2021 13:32:42 +0200   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure      Unknown   Fri, 27 Aug 2021 13:27:29 +0200   Fri, 27 Aug 2021 13:32:42 +0200   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready            Unknown   Fri, 27 Aug 2021 13:27:29 +0200   Fri, 27 Aug 2021 13:32:42 +0200   NodeStatusUnknown   Kubelet stopped posting node status.

@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Apr 14, 2022
@github-actions
Copy link

Issue closed due to inactivity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug stale Issue or PR is stale
Projects
None yet
Development

No branches or pull requests

5 participants