Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eventually some pods Stucks on "container creating" #1479

Closed
gabrieladt opened this issue Jun 3, 2021 · 2 comments
Closed

Eventually some pods Stucks on "container creating" #1479

gabrieladt opened this issue Jun 3, 2021 · 2 comments
Labels

Comments

@gabrieladt
Copy link

From time to time, we have a problem in our production environment, where some pods stucks in ContainerCreating.

Attach logs
eks_i-0f7c80b4e860e2978_2021-06-03_1124-UTC_0.6.2.tar.gz

What you expected to happen:

Depending on the node which it was scheduled, here is an example:
Screenshot 2021-06-03 at 14 56 20

Now, analyzing, I could see what is happening on the node ip-10-38-58-208.ec2.internal

{"level":"info","ts":"2021-06-02T14:56:59.994Z","caller":"awsutils/awsutils.go:724","msg":"Created a new ENI: eni-05ec4b26a3b50051c"} <this when the eni is created, all good here>
...
...
{"level":"debug","ts":"2021-06-02T23:03:03.702Z","caller":"datastore/data_store.go:675","msg":"ENI eni-05ec4b26a3b50051c cannot be deleted because it is required for WARM_IP_TARGET: 5"} < controller starts to try to delete the ENI, all good until here >
...
...
After some time, He actually manage to delete the ENI,
{"level":"debug","ts":"2021-06-02T23:05:40.462Z","caller":"datastore/data_store.go:675","msg":"getDeletableENI: found a deletable ENI eni-05ec4b26a3b50051c"}
{"level":"info","ts":"2021-06-02T23:05:40.462Z","caller":"ipamd/ipamd.go:523","msg":"RemoveUnusedENIFromStore eni-05ec4b26a3b50051c: IP address pool stats: free 1 addresses, total: 52, assigned: 46"}
{"level":"debug","ts":"2021-06-02T23:05:40.462Z","caller":"ipamd/ipamd.go:489","msg":"Start freeing ENI eni-05ec4b26a3b50051c"}
{"level":"info","ts":"2021-06-02T23:05:40.462Z","caller":"awsutils/awsutils.go:946","msg":"Trying to free ENI: eni-05ec4b26a3b50051c"}
{"level":"debug","ts":"2021-06-02T23:05:40.577Z","caller":"awsutils/awsutils.go:946","msg":"Found ENI eni-05ec4b26a3b50051c attachment id: eni-attach-0b34b3738a21e3dce "}
{"level":"info","ts":"2021-06-02T23:05:41.089Z","caller":"retry/retry.go:69","msg":"Successfully detached ENI: eni-05ec4b26a3b50051c"}
{"level":"debug","ts":"2021-06-02T23:05:43.090Z","caller":"awsutils/awsutils.go:990","msg":"Trying to delete ENI: eni-05ec4b26a3b50051c"}
{"level":"debug","ts":"2021-06-02T23:05:43.376Z","caller":"retry/retry.go:69","msg":"Not able to delete ENI: InvalidParameterValue: Network interface 'eni-05ec4b26a3b50051c' is currently in use.\n\tstatus code: 400, request id: 4e7d28b4-2d61-4b58-9bfc-fe8ddc1c9077 "}
{"level":"debug","ts":"2021-06-02T23:05:44.199Z","caller":"retry/retry.go:69","msg":"Not able to delete ENI: InvalidParameterValue: Network interface 'eni-05ec4b26a3b50051c' is currently in use.\n\tstatus code: 400, request id: 9b345d05-c090-47e7-8373-9177f929dfcf "}
{"level":"info","ts":"2021-06-02T23:05:45.730Z","caller":"retry/retry.go:69","msg":"Successfully deleted ENI: eni-05ec4b26a3b50051c"}
{"level":"info","ts":"2021-06-02T23:05:45.730Z","caller":"awsutils/awsutils.go:946","msg":"Successfully freed ENI: eni-05ec4b26a3b50051c"}
... after some time, we had the following situation
{"level":"debug","ts":"2021-06-02T23:07:10.019Z","caller":"awsutils/awsutils.go:569","msg":"Found ENI: eni-05ec4b26a3b50051c, MAC 0e:c3:3c:06:0d:55, device 2"}
...
{"level":"error","ts":"2021-06-02T23:07:10.090Z","caller":"ipamd/ipamd.go:934","msg":"Failed to call ec2:DescribeNetworkInterfaces for [eni-0770f69b1b7fd3933 eni-05ec4b26a3b50051c eni-003f996afb640642a]: InvalidNetworkInterfaceID.NotFound: The networkInterface ID 'eni-05ec4b26a3b50051c' does not exist\n\tstatus code: 400, request id: 85313606-a8dd-48a0-ad57-34d808d7e7a0"}
{"level":"debug","ts":"2021-06-02T23:07:10.090Z","caller":"ipamd/ipamd.go:934","msg":"Could not find interface: The networkInterface ID 'eni-05ec4b26a3b50051c' does not exist, ID: eni-05ec4b26a3b50051c"}

... end after some hours we end up having the aws-cni pod (agent) not able to start anymore

{"level":"debug","ts":"2021-06-03T11:20:23.645Z","caller":"awsutils/awsutils.go:388","msg":"Update ENI eni-05ec4b26a3b50051c"}
{"level":"error","ts":"2021-06-03T11:20:23.808Z","caller":"aws-k8s-agent/main.go:28","msg":"Initialization failure: ipamd: can not initialize with AWS SDK interface: refreshSGIDs: unable to update the ENI's SG: InvalidNetworkInterfaceID.NotFound: The networkInterface ID 'eni-05ec4b26a3b50051c' does not exist\n\tstatus code: 400, request id: 9a2ec4d8-0b6a-4bbd-ab57-0995d1db9762"}

Then as expected, Kubelet start to have issues as well

Jun 3 11:24:32 ip-10-38-58-208 kubelet: E0603 11:24:32.026947 4869 pod_workers.go:191] Error syncing pod 5b7ad595-0a55-490d-af50-a83800f8cff4 ("cash-collection-hn-6fd657fc96-wnsrx_default(5b7ad595-0a55-490d-af50-a83800f8cff4)"), skipping: failed to "KillPodSandbox" for "5b7ad595-0a55-490d-af50-a83800f8cff4" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod "cash-collection-hn-6fd657fc96-wnsrx_default" network: del cmd: error received from DelNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused""

Extra
describe pod on aws-node-p62dc
Events:
Type Reason Age From Message


Warning Unhealthy 14m (x1318 over 7h18m) kubelet (combined from similar events): Readiness probe failed: {"level":"info","ts":"2021-06-03T12:52:35.683Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service ":50051" within 1s"}
Warning BackOff 4m23s (x1244 over 7h7m) kubelet Back-off restarting failed container

describe pod on dashboard-ar-web-598cf9b987-jq4pv
Events:
Type Reason Age From Message


Normal SandboxChanged 105s (x1019 over 3h41m) kubelet Pod sandbox changed, it will be killed and re-created.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:
Deleting the aws-cni pod, did not solve, still trying to get the eni-05ec4b26a3b50051c interface, which as we saw does not exist anymore.
{"level":"error","ts":"2021-06-03T13:18:15.960Z","caller":"aws-k8s-agent/main.go:28","msg":"Initialization failure: ipamd: can not initialize with AWS SDK interface: refreshSGIDs: unable to update the ENI's SG: InvalidNetworkInterfaceID.NotFound: The networkInterface ID 'eni-05ec4b26a3b50051c' does not exist\n\tstatus code: 400, request id: db7e1487-0a22-4873-af78-f9c343b57b20"}

Environment: eks Production

  • Kubernetes version (use kubectl version): v1.18.9-eks-d1db3c
  • CNI Version: v1.7.5
  • OS (e.g: cat /etc/os-release):
    NAME="Amazon Linux"
    VERSION="2"
    ID="amzn"
    ID_LIKE="centos rhel fedora"
    VERSION_ID="2"
    PRETTY_NAME="Amazon Linux 2"
    ANSI_COLOR="0;33"
    CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
    HOME_URL="https://amazonlinux.com/"
  • Kernel (e.g. uname -a):
    Linux ip-10-38-58-208.ec2.internal 4.14.209-160.339.amzn2.x86_64 Initial commit of amazon-vpc-cni-k8s #1 SMP Wed Dec 16 22:44:04 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
@gabrieladt gabrieladt added the bug label Jun 3, 2021
@jayanthvn
Copy link
Contributor

Hi @gabrieladt

Nice debugging :), based on the logs seems like IMDS has never synced after the delete -

{"level":"info","ts":"2021-06-02T23:05:45.730Z","caller":"awsutils/awsutils.go:946","msg":"Successfully freed ENI: eni-05ec4b26a3b50051c"}

https://github.com/aws/amazon-vpc-cni-k8s/blob/release-1.7/pkg/awsutils/awsutils.go#L602

{"level":"debug","ts":"2021-06-02T23:07:10.019Z","caller":"awsutils/awsutils.go:569","msg":"Found ENI: eni-05ec4b26a3b50051c, MAC 0e:c3:3c:06:0d:55, device 2"}

The issue you are seeing after restart is known which is again because of this stale ENI - #1340

@jayanthvn
Copy link
Contributor

Please ref this - #1340 (comment) and will update #1340 once the race condition issue is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants