Application pods in ContainerCreating status when aws cni pods restart #739

uruddarraju · 2019-12-02T23:45:03Z

Cluster setup: 1.12 K8s, 1.6.rc1 aws-cni

We have been doing some chaos testing in our development cluster and found an interesting edge case that has been impacting our integration test runs in the Kubernetes cluster.

Scenario:

WS CNI stops running(as part of the upgrade)
Kubelet tries provisioning a pod, gets an error from CNI as it cannot connect
Kubelet requeue the pod for a later retry
AWS CNI comes to running state
Kubelet retries provisoning the container, it sees that this is a retry and tries to Delete and Recreate PodSandbox(network namespace)
The AWS CNI gets a DELETEContainer request from kubelet
IPAMD gets an unassign IP for the pod
IPAMD sees that an IP is never assigned and returns and error
Kubelet fails setting up the pod

Logs

Kubelet:

Dec 02 22:42:42 ip-10-20-221-183 kubelet[6338]: E1202 22:42:42.730456    6338 remote_runtime.go:119] StopPodSandbox "ea67ed192d25d91ea4ae03a455754c95751641f521626cea6670ed85ca30682e" from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "nginx-21-d88d97665-ngdpb_integration-test-jenkins-k8s-integration-flake-8433-dvr" network: rpc error: code = Unknown desc = datastore: unknown pod
Dec 02 22:42:42 ip-10-20-221-183 kubelet[6338]: E1202 22:42:42.730492    6338 kuberuntime_manager.go:810] Failed to stop sandbox {"docker" "ea67ed192d25d91ea4ae03a455754c95751641f521626cea6670ed85ca30682e"}
Dec 02 22:42:42 ip-10-20-221-183 kubelet[6338]: E1202 22:42:42.730528    6338 kuberuntime_manager.go:605] killPodWithSyncResult failed: failed to "KillPodSandbox" for "076c43cb-1551-11ea-a01c-06e059ec4554" with KillPodSandboxError: "rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod \"nginx-21-d88d97665-ngdpb_integration-test-jenkins-k8s-integration-flake-8433-dvr\" network: rpc error: code = Unknown desc = datastore: unknown pod"
Dec 02 22:42:42 ip-10-20-221-183 kubelet[6338]: E1202 22:42:42.730546    6338 pod_workers.go:186] Error syncing pod 076c43cb-1551-11ea-a01c-06e059ec4554 ("nginx-21-d88d97665-ngdpb_integration-test-jenkins-k8s-integration-flake-8433-dvr(076c43cb-1551-11ea-a01c-06e059ec4554)"), skipping: failed to "KillPodSandbox" for "076c43cb-1551-11ea-a01c-06e059ec4554" with KillPodSandboxError: "rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod \"nginx-21-d88d97665-ngdpb_integration-test-jenkins-k8s-integration-flake-8433-dvr\" network: rpc error: code = Unknown desc = datastore: unknown pod"

CNI:

/var/log/aws-routed-eni/plugin.log.2019-12-02-22:2019-12-02T22:44:20.764Z [INFO] 	Received CNI del request: ContainerID(ea67ed192d25d91ea4ae03a455754c95751641f521626cea6670ed85ca30682e) Netns() IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=integration-test-jenkins-k8s-integration-flake-8433-dvr;K8S_POD_NAME=nginx-21-d88d97665-ngdpb;K8S_POD_INFRA_CONTAINER_ID=ea67ed192d25d91ea4ae03a455754c95751641f521626cea6670ed85ca30682e) Path(/opt/cni/bin/) argsStdinData({"cniVersion":"0.3.1","mtu":"9001","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
/var/log/aws-routed-eni/plugin.log.2019-12-02-22:2019-12-02T22:44:20.766Z [ERROR] 	Error received from DelNetwork grpc call for pod nginx-21-d88d97665-ngdpb namespace integration-test-jenkins-k8s-integration-flake-8433-dvr container ea67ed192d25d91ea4ae03a455754c95751641f521626cea6670ed85ca30682e: rpc error: code = Unknown desc = datastore: unknown pod

ipamD:

2019-12-02T22:46:55.726Z [DEBUG] 	UnassignPodIPv4Address: IP address pool stats: total:180, assigned 55, pod(Name: nginx-21-d88d97665-ngdpb, Namespace: integration-test-jenkins-k8s-integration-flake-8433-dvr, Container ea67ed192d25d91ea4ae03a455754c95751641f521626cea6670ed85ca30682e)
2019-12-02T22:46:55.726Z [WARN] 	UnassignPodIPv4Address: Failed to find pod nginx-21-d88d97665-ngdpb namespace integration-test-jenkins-k8s-integration-flake-8433-dvr Container ea67ed192d25d91ea4ae03a455754c95751641f521626cea6670ed85ca30682e
2019-12-02T22:46:55.726Z [DEBUG] 	UnassignPodIPv4Address: IP address pool stats: total:180, assigned 55, pod(Name: nginx-21-d88d97665-ngdpb, Namespace: integration-test-jenkins-k8s-integration-flake-8433-dvr, Container )
2019-12-02T22:46:55.726Z [WARN] 	UnassignPodIPv4Address: Failed to find pod nginx-21-d88d97665-ngdpb namespace integration-test-jenkins-k8s-integration

Looking at the code for UnassignIP, I see that we return an error on ipamD when UnassignIP is called on pods that dont have an IP assigned yet. We might have to make this more idempotent to fix this issue.

The text was updated successfully, but these errors were encountered:

uruddarraju · 2019-12-03T02:24:39Z

A potential fix: #740

jaypipes · 2019-12-11T19:56:36Z

Should be fixed now with #740

uruddarraju mentioned this issue Dec 3, 2019

Not returning error when unassign called on pod without IP #740

Closed

uruddarraju mentioned this issue Dec 10, 2019

Treating ErrUnknownPod from ipamd to be a noop #750

Merged

jaypipes closed this as completed Dec 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Application pods in ContainerCreating status when aws cni pods restart #739

Application pods in ContainerCreating status when aws cni pods restart #739

uruddarraju commented Dec 2, 2019

uruddarraju commented Dec 3, 2019

jaypipes commented Dec 11, 2019

Application pods in ContainerCreating status when aws cni pods restart #739

Application pods in ContainerCreating status when aws cni pods restart #739

Comments

uruddarraju commented Dec 2, 2019

uruddarraju commented Dec 3, 2019

jaypipes commented Dec 11, 2019