ip rule leaked under some conditions #2048

cdemonchy · 2022-08-01T08:55:41Z

Hi,

What happened:

Sometimes the ip rule associated to a pod are not deleted when the pod is deleted. The leaked rules are never deleted but the ip is free in the ipam.

The ip can be reallocated on another node in the cluster. When this happens, the pods on the node with the leaked rules can't reach the pod with the reallocated ip.

We've the case when cluster autoscaler fails to evict a node :

cluster autoscaler wants to evict a node
pods on the nodes are stopped calling the cni plugin. The pod aws-node is stopped and restart quickly.
for some pods the cni plugin can't call ipamd before the container is deleted. In this case the rule are leaked.
After 2 minutes, as it didn't success to move all the pod quickly (due to a very conservative pdb). The eviction is stopped and the node came back in the pool. (cluster austoscaler faq)

Attach logs

The log for a pod during the delete phase :

{"level":"debug","ts":"2022-07-31T11:35:11.678Z","caller":"routed-eni-cni-plugin/cni.go:303","msg":"Prev Result: &{0.4.0 [{Name:eni054dbc5e65c Mac: Sandbox:} {Name:eth0 Mac: Sandbox:/proc/12533/ns/net}] [{Version:4 Interface:0xc000270280 Address:{IP:10.109.167.233 Mask:ffffffff} Gateway:<nil>}] [] {[]  [] []}}\n"}
{"level":"info","ts":"2022-07-31T11:35:11.678Z","caller":"routed-eni-cni-plugin/cni.go:303","msg":"Received CNI del request: ContainerID(2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69) Netns(/proc/12533/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=prod-annonce;K8S_POD_NAME=XXX-84c9dd9cc5-5s86q;K8S_POD_INFRA_CONTAINER_ID=2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69) Path(/opt/cni/bin) argsStdinData({\"cniVersion\":\"0.4.0\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"DEBUG\",\"podSGEnforcingMode\":\"strict\",\"prevResult\":{\"cniVersion\":\"0.4.0\",\"interfaces\":[{\"name\":\"eni054dbc5e65c\"},{\"name\":\"eth0\",\"sandbox\":\"/proc/12533/ns/net\"}],\"ips\":[{\"version\":\"4\",\"interface\":1,\"address\":\"10.109.167.233/32\"}],\"dns\":{}},\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})"}
{"level":"error","ts":"2022-07-31T11:35:11.678Z","caller":"routed-eni-cni-plugin/cni.go:303","msg":"Error received from DelNetwork gRPC call for container 2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused\""}
{"level":"debug","ts":"2022-07-31T11:35:12.404Z","caller":"routed-eni-cni-plugin/cni.go:303","msg":"Prev Result: &{0.4.0 [{Name:eni054dbc5e65c Mac: Sandbox:} {Name:eth0 Mac: Sandbox:/proc/12533/ns/net}] [{Version:4 Interface:0xc0002e0280 Address:{IP:10.109.167.233 Mask:ffffffff} Gateway:<nil>}] [] {[]  [] []}}\n"}
{"level":"info","ts":"2022-07-31T11:35:12.404Z","caller":"routed-eni-cni-plugin/cni.go:303","msg":"Received CNI del request: ContainerID(2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69) Netns() IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=prod-annonce;K8S_POD_NAME=XXX-84c9dd9cc5-5s86q;K8S_POD_INFRA_CONTAINER_ID=2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69) Path(/opt/cni/bin) argsStdinData({\"cniVersion\":\"0.4.0\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"DEBUG\",\"podSGEnforcingMode\":\"strict\",\"prevResult\":{\"cniVersion\":\"0.4.0\",\"interfaces\":[{\"name\":\"eni054dbc5e65c\"},{\"name\":\"eth0\",\"sandbox\":\"/proc/12533/ns/net\"}],\"ips\":[{\"version\":\"4\",\"interface\":1,\"address\":\"10.109.167.233/32\"}],\"dns\":{}},\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})"}
{"level":"error","ts":"2022-07-31T11:35:12.404Z","caller":"routed-eni-cni-plugin/cni.go:303","msg":"Error received from DelNetwork gRPC call for container 2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused\""}
{"level":"debug","ts":"2022-07-31T11:35:13.476Z","caller":"routed-eni-cni-plugin/cni.go:303","msg":"Prev Result: &{0.4.0 [{Name:eni054dbc5e65c Mac: Sandbox:} {Name:eth0 Mac: Sandbox:/proc/12533/ns/net}] [{Version:4 Interface:0xc00027a2e0 Address:{IP:10.109.167.233 Mask:ffffffff} Gateway:<nil>}] [] {[]  [] []}}\n"}
{"level":"info","ts":"2022-07-31T11:35:13.476Z","caller":"routed-eni-cni-plugin/cni.go:303","msg":"Received CNI del request: ContainerID(2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69) Netns() IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=prod-annonce;K8S_POD_NAME=XXX-84c9dd9cc5-5s86q;K8S_POD_INFRA_CONTAINER_ID=2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69) Path(/opt/cni/bin) argsStdinData({\"cniVersion\":\"0.4.0\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"DEBUG\",\"podSGEnforcingMode\":\"strict\",\"prevResult\":{\"cniVersion\":\"0.4.0\",\"interfaces\":[{\"name\":\"eni054dbc5e65c\"},{\"name\":\"eth0\",\"sandbox\":\"/proc/12533/ns/net\"}],\"ips\":[{\"version\":\"4\",\"interface\":1,\"address\":\"10.109.167.233/32\"}],\"dns\":{}},\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})"}
{"level":"error","ts":"2022-07-31T11:35:13.477Z","caller":"routed-eni-cni-plugin/cni.go:303","msg":"Error received from DelNetwork gRPC call for container 2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused\""}
{"level":"debug","ts":"2022-07-31T11:36:10.189Z","caller":"routed-eni-cni-plugin/cni.go:303","msg":"Prev Result: &{0.4.0 [{Name:eni054dbc5e65c Mac: Sandbox:} {Name:eth0 Mac: Sandbox:/proc/12533/ns/net}] [{Version:4 Interface:0xc0002a0a70 Address:{IP:10.109.167.233 Mask:ffffffff} Gateway:<nil>}] [] {[]  [] []}}\n"}
{"level":"info","ts":"2022-07-31T11:36:10.189Z","caller":"routed-eni-cni-plugin/cni.go:303","msg":"Received CNI del request: ContainerID(2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69) Netns() IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=prod-annonce;K8S_POD_NAME=XXX-84c9dd9cc5-5s86q;K8S_POD_INFRA_CONTAINER_ID=2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69) Path(/opt/cni/bin) argsStdinData({\"cniVersion\":\"0.4.0\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"DEBUG\",\"podSGEnforcingMode\":\"strict\",\"prevResult\":{\"cniVersion\":\"0.4.0\",\"interfaces\":[{\"name\":\"eni054dbc5e65c\"},{\"name\":\"eth0\",\"sandbox\":\"/proc/12533/ns/net\"}],\"ips\":[{\"version\":\"4\",\"interface\":1,\"address\":\"10.109.167.233/32\"}],\"dns\":{}},\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})"}
{"level":"info","ts":"2022-07-31T11:36:10.190Z","caller":"routed-eni-cni-plugin/cni.go:303","msg":"Container 2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69 not found"}

We can see 3 attempts with a connection refused when connecting to ipamd. The last attempt fails has the container has been removed.

When looking at the rule tables, the rules associated to the ip are still present.

# ip rule | grep 10.109.167.233
from all to 10.109.167.233 lookup main 
from 10.109.167.233 lookup 2

I'll send the complete support tarball by mail.

What you expected to happen:
We expect that the rules are removed when a container is stopped.

Anything else we need to know?:

When looking at the del function del, the logic seems to be :

tryDelWithPrevResult: try to clean the ip but only for pod with a security group attached.
call the ipamd
do the clean

In our case, during the first call the function tryDelWithPrevResult didn't do anything as the code seems to only handle pod with a security group attached. We're not attaching security group to the pod we're using Individual IP addresses assigned to network interface mode.

During the last call, the call to ipamd seems to succeed the cleaning from the del function is not called : cni.go#L405 as we've a container not found error before : cni.go#L364.

Maybe it's possible to do the clean in the tryDelWithPrevResult like for branch eni ?

Environment:

Kubernetes version (use kubectl version): 1.21
CNI Version : v1.11.2

The text was updated successfully, but these errors were encountered:

jayanthvn · 2022-08-01T17:32:53Z

Hi @cdemonchy , We will land here cni.go#L364. because of

amazon-vpc-cni-k8s/cmd/routed-eni-cni-plugin/cni.go

Line 360 in 21f39ee

if strings.Contains(err.Error(), datastore.ErrUnknownPod.Error()) {

which is because the IPAMD state file is missing the container. One possible cause is, when IPAMD started in the last retry, the container was already deleted so IPAMD marked it as stale, not added to state file and IP is free. Now when CNI tries to reach IPAMD the container is not available in the state file so returned unknown pod error. I can confirm this once we have the logs. I think one solution we can do is in the plugin, when there is a new add we delete if any stale rule is present and then program the new rule. We do similarly for SGPP pods clan rules.

cdemonchy · 2022-08-01T19:36:30Z

I forgot to send the log, I've just sent them.

jayanthvn · 2022-08-09T18:01:47Z

Hi @cdemonchy

Based on the logs, the CRI is missing the container hence the IP was never recovered -

Here is the snapshot of the events -

IPAMD started ->

{"level":"info","ts":"2022-07-31T05:05:07.289Z","caller":"aws-k8s-agent/main.go:27","msg":"Starting L-IPAMD   ..."}

Pod created ->

{"level":"info","ts":"2022-07-31T06:12:11.301Z","caller":"rpc/rpc.pb.go:713","msg":"Received AddNetwork for NS /proc/12533/ns/net, Sandbox 2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69, ifname eth0"}
{"level":"debug","ts":"2022-07-31T06:12:11.301Z","caller":"rpc/rpc.pb.go:713","msg":"AddNetworkRequest: K8S_POD_NAME:\"ad-geoloc-84c9dd9cc5-5s86q\" K8S_POD_NAMESPACE:\"prod-annonce\" K8S_POD_INFRA_CONTAINER_ID:\"2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69\" ContainerID:\"2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69\" IfName:\"eth0\" NetworkName:\"aws-cni\" Netns:\"/proc/12533/ns/net\""}
{"level":"debug","ts":"2022-07-31T06:12:11.301Z","caller":"datastore/data_store.go:1396","msg":"Found a free IP not in DB - 10.109.167.233"}
{"level":"debug","ts":"2022-07-31T06:12:11.301Z","caller":"datastore/data_store.go:790","msg":"Returning Free IP 10.109.167.233"}
{"level":"debug","ts":"2022-07-31T06:12:11.301Z","caller":"datastore/data_store.go:713","msg":"New IP from CIDR pool- 10.109.167.233"}
{"level":"info","ts":"2022-07-31T06:12:11.301Z","caller":"datastore/data_store.go:817","msg":"AssignPodIPv4Address: Assign IP 10.109.167.233 to sandbox aws-cni/2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69/eth0"}

Restart of IPAMD ->

{"level":"info","ts":"2022-07-31T11:35:25.785Z","caller":"aws-k8s-agent/main.go:27","msg":"Starting L-IPAMD   ..."}

Container 2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69 is missing during recovery ->

{"level":"info","ts":"2022-07-31T11:35:29.123Z","caller":"ipamd/ipamd.go:509","msg":"Reading ipam state from CRI"}
{"level":"debug","ts":"2022-07-31T11:35:29.123Z","caller":"datastore/data_store.go:389","msg":"Getting running pod sandboxes from \"unix:///var/run/dockershim.sock\""}
{"level":"debug","ts":"2022-07-31T11:35:29.128Z","caller":"datastore/data_store.go:389","msg":"Ignoring sandbox 9f2cd1aa9ce3aecbdf5028ce9bc92728eef0f304a4a49f7a25f5fb77e3f1462d with non-pod netns mode NODE"}
{"level":"debug","ts":"2022-07-31T11:35:29.129Z","caller":"datastore/data_store.go:389","msg":"Ignoring sandbox 53251f6714ee1fce8a0623625e8efb3bb44a24c74996fc15f0e303e861f5ca2c with non-pod netns mode NODE"}
{"level":"debug","ts":"2022-07-31T11:35:29.130Z","caller":"datastore/data_store.go:389","msg":"Ignoring sandbox 272a4c341017e88f48123834b81cafffb61d67bfa2e90af689283cc9ef05652c with non-pod netns mode NODE"}
{"level":"debug","ts":"2022-07-31T11:35:29.130Z","caller":"datastore/data_store.go:389","msg":"Ignoring sandbox 62eb1680e11e345c9c8f9fca2d72219a040d902ea75f73265936e9ca4ef008cb with non-pod netns mode NODE"}
{"level":"debug","ts":"2022-07-31T11:35:29.172Z","caller":"ipamd/ipamd.go:509","msg":"Adding container ID: 30a9bf31700eea808a0d1186d616d22da860fac60abb8f0eb635194de2ba0946"}
{"level":"debug","ts":"2022-07-31T11:35:29.172Z","caller":"ipamd/ipamd.go:509","msg":"Adding container ID: afe52528fb37be9d0776c44cc3d3023866e337bfc4ed4ac614ad9d0a9a3f1cc7"}
{"level":"debug","ts":"2022-07-31T11:35:29.172Z","caller":"ipamd/ipamd.go:509","msg":"Adding container ID: ce8b7f18b62b1317879b1c6a34789f14d9f9a59ca960f9a14d558344d0a6e405"}
{"level":"debug","ts":"2022-07-31T11:35:29.172Z","caller":"ipamd/ipamd.go:509","msg":"Adding container ID: 570d7d7a78a583303e4e0c4346de10b4972dca06ecdc203b2d32d5e101f50014"}
{"level":"debug","ts":"2022-07-31T11:35:29.172Z","caller":"ipamd/ipamd.go:509","msg":"Adding container ID: bbce29efa2b9a0bf7238828003b13dbd2bf019d9e2e99a4e75da3a6d9cb6a6d7"}
{"level":"debug","ts":"2022-07-31T11:35:29.172Z","caller":"ipamd/ipamd.go:509","msg":"Adding container ID: a3c5b33a1f742b935c84448c69c63527327a9e0ef95311b6fedae1b0400c658f"}
{"level":"debug","ts":"2022-07-31T11:35:29.172Z","caller":"ipamd/ipamd.go:509","msg":"Adding container ID: 39032e83a36e9214c77f668f92750efd1b977ecd015917b8d7d161c845336334"}
{"level":"debug","ts":"2022-07-31T11:35:29.172Z","caller":"ipamd/ipamd.go:509","msg":"Adding container ID: 1b5854f94f358b38f07b40be06d549707db5376f7d3b403558603572cffa27ac"}
{"level":"debug","ts":"2022-07-31T11:35:29.172Z","caller":"ipamd/ipamd.go:509","msg":"Adding container ID: a0f858f864d7fbb0464889b51bef0b9bd10cbd3b6fec7806149c66d4edc7a2cb"}
{"level":"debug","ts":"2022-07-31T11:35:29.172Z","caller":"ipamd/ipamd.go:509","msg":"Adding container ID: c5b8baf95e927ef636ea24501f932da9c114457d9bd3c13930d7d784d27d23db"}
{"level":"debug","ts":"2022-07-31T11:35:29.172Z","caller":"ipamd/ipamd.go:509","msg":"Adding container ID: 514b331fb6d9b54b20db37124a8d371a21deaffccc0350209673b9646b4ad217"}
{"level":"info","ts":"2022-07-31T11:35:29.172Z","caller":"datastore/data_store.go:483","msg":"AssignPodIPv4Address: Assign IP 10.109.163.138 to sandbox _migrated-from-cri/30a9bf31700eea808a0d1186d616d22da860fac60abb8f0eb635194de2ba0946/unknown"}
{"level":"debug","ts":"2022-07-31T11:35:29.172Z","caller":"ipamd/ipamd.go:509","msg":"Recovered _migrated-from-cri/30a9bf31700eea808a0d1186d616d22da860fac60abb8f0eb635194de2ba0946/unknown => eni-0bb48f07f8e097136/10.109.163.138"}
{"level":"info","ts":"2022-07-31T11:35:29.173Z","caller":"datastore/data_store.go:483","msg":"AssignPodIPv4Address: Assign IP 10.109.162.56 to sandbox _migrated-from-cri/afe52528fb37be9d0776c44cc3d3023866e337bfc4ed4ac614ad9d0a9a3f1cc7/unknown"}
{"level":"debug","ts":"2022-07-31T11:35:29.173Z","caller":"ipamd/ipamd.go:509","msg":"Recovered _migrated-from-cri/afe52528fb37be9d0776c44cc3d3023866e337bfc4ed4ac614ad9d0a9a3f1cc7/unknown => eni-0bb48f07f8e097136/10.109.162.56"}
{"level":"info","ts":"2022-07-31T11:35:29.173Z","caller":"datastore/data_store.go:483","msg":"AssignPodIPv4Address: Assign IP 10.109.161.223 to sandbox _migrated-from-cri/ce8b7f18b62b1317879b1c6a34789f14d9f9a59ca960f9a14d558344d0a6e405/unknown"}
{"level":"debug","ts":"2022-07-31T11:35:29.173Z","caller":"ipamd/ipamd.go:509","msg":"Recovered _migrated-from-cri/ce8b7f18b62b1317879b1c6a34789f14d9f9a59ca960f9a14d558344d0a6e405/unknown => eni-0bb48f07f8e097136/10.109.161.223"}
{"level":"info","ts":"2022-07-31T11:35:29.173Z","caller":"datastore/data_store.go:483","msg":"AssignPodIPv4Address: Assign IP 10.109.166.12 to sandbox _migrated-from-cri/570d7d7a78a583303e4e0c4346de10b4972dca06ecdc203b2d32d5e101f50014/unknown"}
{"level":"debug","ts":"2022-07-31T11:35:29.173Z","caller":"ipamd/ipamd.go:509","msg":"Recovered _migrated-from-cri/570d7d7a78a583303e4e0c4346de10b4972dca06ecdc203b2d32d5e101f50014/unknown => eni-0bb48f07f8e097136/10.109.166.12"}
{"level":"info","ts":"2022-07-31T11:35:29.173Z","caller":"datastore/data_store.go:483","msg":"AssignPodIPv4Address: Assign IP 10.109.166.45 to sandbox _migrated-from-cri/bbce29efa2b9a0bf7238828003b13dbd2bf019d9e2e99a4e75da3a6d9cb6a6d7/unknown"}
{"level":"debug","ts":"2022-07-31T11:35:29.173Z","caller":"ipamd/ipamd.go:509","msg":"Recovered _migrated-from-cri/bbce29efa2b9a0bf7238828003b13dbd2bf019d9e2e99a4e75da3a6d9cb6a6d7/unknown => eni-0bb48f07f8e097136/10.109.166.45"}
{"level":"info","ts":"2022-07-31T11:35:29.173Z","caller":"datastore/data_store.go:483","msg":"AssignPodIPv4Address: Assign IP 10.109.160.14 to sandbox _migrated-from-cri/a3c5b33a1f742b935c84448c69c63527327a9e0ef95311b6fedae1b0400c658f/unknown"}
{"level":"debug","ts":"2022-07-31T11:35:29.173Z","caller":"ipamd/ipamd.go:509","msg":"Recovered _migrated-from-cri/a3c5b33a1f742b935c84448c69c63527327a9e0ef95311b6fedae1b0400c658f/unknown => eni-0bb48f07f8e097136/10.109.160.14"}
{"level":"info","ts":"2022-07-31T11:35:29.173Z","caller":"datastore/data_store.go:483","msg":"AssignPodIPv4Address: Assign IP 10.109.166.243 to sandbox _migrated-from-cri/39032e83a36e9214c77f668f92750efd1b977ecd015917b8d7d161c845336334/unknown"}
{"level":"debug","ts":"2022-07-31T11:35:29.173Z","caller":"ipamd/ipamd.go:509","msg":"Recovered _migrated-from-cri/39032e83a36e9214c77f668f92750efd1b977ecd015917b8d7d161c845336334/unknown => eni-0bb48f07f8e097136/10.109.166.243"}
{"level":"info","ts":"2022-07-31T11:35:29.174Z","caller":"datastore/data_store.go:483","msg":"AssignPodIPv4Address: Assign IP 10.109.164.251 to sandbox _migrated-from-cri/1b5854f94f358b38f07b40be06d549707db5376f7d3b403558603572cffa27ac/unknown"}
{"level":"debug","ts":"2022-07-31T11:35:29.174Z","caller":"ipamd/ipamd.go:509","msg":"Recovered _migrated-from-cri/1b5854f94f358b38f07b40be06d549707db5376f7d3b403558603572cffa27ac/unknown => eni-0baf81407690ff8c1/10.109.164.251"}
{"level":"info","ts":"2022-07-31T11:35:29.174Z","caller":"datastore/data_store.go:483","msg":"AssignPodIPv4Address: Assign IP 10.109.165.194 to sandbox _migrated-from-cri/a0f858f864d7fbb0464889b51bef0b9bd10cbd3b6fec7806149c66d4edc7a2cb/unknown"}
{"level":"debug","ts":"2022-07-31T11:35:29.174Z","caller":"ipamd/ipamd.go:509","msg":"Recovered _migrated-from-cri/a0f858f864d7fbb0464889b51bef0b9bd10cbd3b6fec7806149c66d4edc7a2cb/unknown => eni-0bb48f07f8e097136/10.109.165.194"}
{"level":"info","ts":"2022-07-31T11:35:29.174Z","caller":"datastore/data_store.go:483","msg":"AssignPodIPv4Address: Assign IP 10.109.167.108 to sandbox _migrated-from-cri/c5b8baf95e927ef636ea24501f932da9c114457d9bd3c13930d7d784d27d23db/unknown"}
{"level":"debug","ts":"2022-07-31T11:35:29.174Z","caller":"ipamd/ipamd.go:509","msg":"Recovered _migrated-from-cri/c5b8baf95e927ef636ea24501f932da9c114457d9bd3c13930d7d784d27d23db/unknown => eni-0baf81407690ff8c1/10.109.167.108"}
{"level":"info","ts":"2022-07-31T11:35:29.174Z","caller":"datastore/data_store.go:483","msg":"AssignPodIPv4Address: Assign IP 10.109.164.54 to sandbox _migrated-from-cri/514b331fb6d9b54b20db37124a8d371a21deaffccc0350209673b9646b4ad217/unknown"}
{"level":"debug","ts":"2022-07-31T11:35:29.174Z","caller":"ipamd/ipamd.go:509","msg":"Recovered _migrated-from-cri/514b331fb6d9b54b20db37124a8d371a21deaffccc0350209673b9646b4ad217/unknown => eni-0bb48f07f8e097136/10.109.164.54"}
{"level":"debug","ts":"2022-07-31T11:35:29.174Z","caller":"ipamd/ipamd.go:509","msg":"Completed ipam state recovery"}

Hence delete failed and 10.109.167.233 is free because we never recovered the container but the rules would have been recovered ->

{"level":"info","ts":"2022-07-31T11:36:10.189Z","caller":"rpc/rpc.pb.go:731","msg":"Received DelNetwork for Sandbox 2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69"}
{"level":"debug","ts":"2022-07-31T11:36:10.189Z","caller":"rpc/rpc.pb.go:731","msg":"DelNetworkRequest: K8S_POD_NAME:\"ad-geoloc-84c9dd9cc5-5s86q\" K8S_POD_NAMESPACE:\"prod-annonce\" K8S_POD_INFRA_CONTAINER_ID:\"2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69\" Reason:\"PodDeleted\" ContainerID:\"2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69\" IfName:\"eth0\" NetworkName:\"aws-cni\""}
{"level":"debug","ts":"2022-07-31T11:36:10.189Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: IP address pool stats: total:30, assigned 8, sandbox aws-cni/2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69/eth0"}
{"level":"debug","ts":"2022-07-31T11:36:10.189Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: Failed to find IPAM entry under full key, trying CRI-migrated version"}
{"level":"warn","ts":"2022-07-31T11:36:10.189Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: Failed to find sandbox _migrated-from-cri/2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69/unknown"}

We should have cleaned up the old rules before adding new one, will check if that is in place.

jayanthvn · 2022-08-15T21:57:03Z

Ideally when the aws-node restarts we will have to clean up the rules associated with free IPs if one exists.

cdemonchy · 2022-08-16T15:57:52Z

We managed to mitigate the issue by updating cluster auto scaler configuration to not evict daemonset pods. We didn't have leaked rules since the update.

aws-node is not evicted by cluster autoscaler and don't restart. This confirm that the problem occurs when pod are deleted during a restart of aws-node.

Regards,

jdn5126 · 2022-10-05T19:46:44Z

@cdemonchy were you previously enabling daemonset eviction? It should be disabled by default

cdemonchy · 2022-10-06T08:39:05Z

Hi @jdn5126

The daemonset eviction has been enabled automatically when we've upgraded cluster autoscaler from v1.20.2 and to v1.21.2

To solve our issue we've upgraded to cluster autoscaler v1.21.3 and added the flag --daemonset-eviction-for-occupied-nodes=false

jdn5126 · 2022-10-06T17:12:05Z

Ah, thank you for pointing that out, I see that autoscaler has indeed changed the default behavior for occupied nodes. aws-node is not a daemonset meant to be evicted, so we will discuss the best way to handle this

github-actions · 2022-12-06T00:02:57Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

jayanthvn · 2022-12-06T00:21:42Z

/notstale

jdn5126 · 2023-01-12T19:57:58Z

Closing as PR has merged

github-actions · 2023-01-12T19:58:21Z

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

cdemonchy added the bug label Aug 1, 2022

jdn5126 self-assigned this Sep 27, 2022

jdn5126 mentioned this issue Oct 28, 2022

[CNI]: Teardown pod networking resources without IPAMD when possible #2125

Closed

jdn5126 mentioned this issue Nov 17, 2022

[CNI]: Teardown pod network when IPAMD connection fails #2145

Merged

github-actions bot added the stale Issue or PR is stale label Dec 6, 2022

jdn5126 removed the stale Issue or PR is stale label Dec 6, 2022

jdn5126 closed this as completed Jan 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ip rule leaked under some conditions #2048

ip rule leaked under some conditions #2048

cdemonchy commented Aug 1, 2022

jayanthvn commented Aug 1, 2022

cdemonchy commented Aug 1, 2022

jayanthvn commented Aug 9, 2022

jayanthvn commented Aug 15, 2022

cdemonchy commented Aug 16, 2022

jdn5126 commented Oct 5, 2022

cdemonchy commented Oct 6, 2022

jdn5126 commented Oct 6, 2022

github-actions bot commented Dec 6, 2022

jayanthvn commented Dec 6, 2022

jdn5126 commented Jan 12, 2023

github-actions bot commented Jan 12, 2023