Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ip rule leaked under some conditions #2048

Closed
cdemonchy opened this issue Aug 1, 2022 · 12 comments
Closed

ip rule leaked under some conditions #2048

cdemonchy opened this issue Aug 1, 2022 · 12 comments
Assignees
Labels

Comments

@cdemonchy
Copy link

Hi,

What happened:

Sometimes the ip rule associated to a pod are not deleted when the pod is deleted. The leaked rules are never deleted but the ip is free in the ipam.

The ip can be reallocated on another node in the cluster. When this happens, the pods on the node with the leaked rules can't reach the pod with the reallocated ip.

We've the case when cluster autoscaler fails to evict a node :

  • cluster autoscaler wants to evict a node
  • pods on the nodes are stopped calling the cni plugin. The pod aws-node is stopped and restart quickly.
  • for some pods the cni plugin can't call ipamd before the container is deleted. In this case the rule are leaked.
  • After 2 minutes, as it didn't success to move all the pod quickly (due to a very conservative pdb). The eviction is stopped and the node came back in the pool. (cluster austoscaler faq)

Attach logs

The log for a pod during the delete phase :

{"level":"debug","ts":"2022-07-31T11:35:11.678Z","caller":"routed-eni-cni-plugin/cni.go:303","msg":"Prev Result: &{0.4.0 [{Name:eni054dbc5e65c Mac: Sandbox:} {Name:eth0 Mac: Sandbox:/proc/12533/ns/net}] [{Version:4 Interface:0xc000270280 Address:{IP:10.109.167.233 Mask:ffffffff} Gateway:<nil>}] [] {[]  [] []}}\n"}
{"level":"info","ts":"2022-07-31T11:35:11.678Z","caller":"routed-eni-cni-plugin/cni.go:303","msg":"Received CNI del request: ContainerID(2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69) Netns(/proc/12533/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=prod-annonce;K8S_POD_NAME=XXX-84c9dd9cc5-5s86q;K8S_POD_INFRA_CONTAINER_ID=2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69) Path(/opt/cni/bin) argsStdinData({\"cniVersion\":\"0.4.0\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"DEBUG\",\"podSGEnforcingMode\":\"strict\",\"prevResult\":{\"cniVersion\":\"0.4.0\",\"interfaces\":[{\"name\":\"eni054dbc5e65c\"},{\"name\":\"eth0\",\"sandbox\":\"/proc/12533/ns/net\"}],\"ips\":[{\"version\":\"4\",\"interface\":1,\"address\":\"10.109.167.233/32\"}],\"dns\":{}},\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})"}
{"level":"error","ts":"2022-07-31T11:35:11.678Z","caller":"routed-eni-cni-plugin/cni.go:303","msg":"Error received from DelNetwork gRPC call for container 2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused\""}
{"level":"debug","ts":"2022-07-31T11:35:12.404Z","caller":"routed-eni-cni-plugin/cni.go:303","msg":"Prev Result: &{0.4.0 [{Name:eni054dbc5e65c Mac: Sandbox:} {Name:eth0 Mac: Sandbox:/proc/12533/ns/net}] [{Version:4 Interface:0xc0002e0280 Address:{IP:10.109.167.233 Mask:ffffffff} Gateway:<nil>}] [] {[]  [] []}}\n"}
{"level":"info","ts":"2022-07-31T11:35:12.404Z","caller":"routed-eni-cni-plugin/cni.go:303","msg":"Received CNI del request: ContainerID(2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69) Netns() IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=prod-annonce;K8S_POD_NAME=XXX-84c9dd9cc5-5s86q;K8S_POD_INFRA_CONTAINER_ID=2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69) Path(/opt/cni/bin) argsStdinData({\"cniVersion\":\"0.4.0\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"DEBUG\",\"podSGEnforcingMode\":\"strict\",\"prevResult\":{\"cniVersion\":\"0.4.0\",\"interfaces\":[{\"name\":\"eni054dbc5e65c\"},{\"name\":\"eth0\",\"sandbox\":\"/proc/12533/ns/net\"}],\"ips\":[{\"version\":\"4\",\"interface\":1,\"address\":\"10.109.167.233/32\"}],\"dns\":{}},\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})"}
{"level":"error","ts":"2022-07-31T11:35:12.404Z","caller":"routed-eni-cni-plugin/cni.go:303","msg":"Error received from DelNetwork gRPC call for container 2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused\""}
{"level":"debug","ts":"2022-07-31T11:35:13.476Z","caller":"routed-eni-cni-plugin/cni.go:303","msg":"Prev Result: &{0.4.0 [{Name:eni054dbc5e65c Mac: Sandbox:} {Name:eth0 Mac: Sandbox:/proc/12533/ns/net}] [{Version:4 Interface:0xc00027a2e0 Address:{IP:10.109.167.233 Mask:ffffffff} Gateway:<nil>}] [] {[]  [] []}}\n"}
{"level":"info","ts":"2022-07-31T11:35:13.476Z","caller":"routed-eni-cni-plugin/cni.go:303","msg":"Received CNI del request: ContainerID(2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69) Netns() IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=prod-annonce;K8S_POD_NAME=XXX-84c9dd9cc5-5s86q;K8S_POD_INFRA_CONTAINER_ID=2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69) Path(/opt/cni/bin) argsStdinData({\"cniVersion\":\"0.4.0\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"DEBUG\",\"podSGEnforcingMode\":\"strict\",\"prevResult\":{\"cniVersion\":\"0.4.0\",\"interfaces\":[{\"name\":\"eni054dbc5e65c\"},{\"name\":\"eth0\",\"sandbox\":\"/proc/12533/ns/net\"}],\"ips\":[{\"version\":\"4\",\"interface\":1,\"address\":\"10.109.167.233/32\"}],\"dns\":{}},\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})"}
{"level":"error","ts":"2022-07-31T11:35:13.477Z","caller":"routed-eni-cni-plugin/cni.go:303","msg":"Error received from DelNetwork gRPC call for container 2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused\""}
{"level":"debug","ts":"2022-07-31T11:36:10.189Z","caller":"routed-eni-cni-plugin/cni.go:303","msg":"Prev Result: &{0.4.0 [{Name:eni054dbc5e65c Mac: Sandbox:} {Name:eth0 Mac: Sandbox:/proc/12533/ns/net}] [{Version:4 Interface:0xc0002a0a70 Address:{IP:10.109.167.233 Mask:ffffffff} Gateway:<nil>}] [] {[]  [] []}}\n"}
{"level":"info","ts":"2022-07-31T11:36:10.189Z","caller":"routed-eni-cni-plugin/cni.go:303","msg":"Received CNI del request: ContainerID(2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69) Netns() IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=prod-annonce;K8S_POD_NAME=XXX-84c9dd9cc5-5s86q;K8S_POD_INFRA_CONTAINER_ID=2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69) Path(/opt/cni/bin) argsStdinData({\"cniVersion\":\"0.4.0\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"DEBUG\",\"podSGEnforcingMode\":\"strict\",\"prevResult\":{\"cniVersion\":\"0.4.0\",\"interfaces\":[{\"name\":\"eni054dbc5e65c\"},{\"name\":\"eth0\",\"sandbox\":\"/proc/12533/ns/net\"}],\"ips\":[{\"version\":\"4\",\"interface\":1,\"address\":\"10.109.167.233/32\"}],\"dns\":{}},\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})"}
{"level":"info","ts":"2022-07-31T11:36:10.190Z","caller":"routed-eni-cni-plugin/cni.go:303","msg":"Container 2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69 not found"}

We can see 3 attempts with a connection refused when connecting to ipamd. The last attempt fails has the container has been removed.

When looking at the rule tables, the rules associated to the ip are still present.

# ip rule | grep 10.109.167.233
from all to 10.109.167.233 lookup main 
from 10.109.167.233 lookup 2

I'll send the complete support tarball by mail.

What you expected to happen:
We expect that the rules are removed when a container is stopped.

Anything else we need to know?:

When looking at the del function del, the logic seems to be :

In our case, during the first call the function tryDelWithPrevResult didn't do anything as the code seems to only handle pod with a security group attached. We're not attaching security group to the pod we're using Individual IP addresses assigned to network interface mode.

During the last call, the call to ipamd seems to succeed the cleaning from the del function is not called : cni.go#L405 as we've a container not found error before : cni.go#L364.

Maybe it's possible to do the clean in the tryDelWithPrevResult like for branch eni ?

Environment:

  • Kubernetes version (use kubectl version): 1.21
  • CNI Version : v1.11.2
@cdemonchy cdemonchy added the bug label Aug 1, 2022
@jayanthvn
Copy link
Contributor

Hi @cdemonchy , We will land here cni.go#L364. because of

if strings.Contains(err.Error(), datastore.ErrUnknownPod.Error()) {
which is because the IPAMD state file is missing the container. One possible cause is, when IPAMD started in the last retry, the container was already deleted so IPAMD marked it as stale, not added to state file and IP is free. Now when CNI tries to reach IPAMD the container is not available in the state file so returned unknown pod error. I can confirm this once we have the logs. I think one solution we can do is in the plugin, when there is a new add we delete if any stale rule is present and then program the new rule. We do similarly for SGPP pods clan rules.

@cdemonchy
Copy link
Author

I forgot to send the log, I've just sent them.

@jayanthvn
Copy link
Contributor

Hi @cdemonchy

Based on the logs, the CRI is missing the container hence the IP was never recovered -

Here is the snapshot of the events -

IPAMD started ->

{"level":"info","ts":"2022-07-31T05:05:07.289Z","caller":"aws-k8s-agent/main.go:27","msg":"Starting L-IPAMD   ..."}

Pod created ->

{"level":"info","ts":"2022-07-31T06:12:11.301Z","caller":"rpc/rpc.pb.go:713","msg":"Received AddNetwork for NS /proc/12533/ns/net, Sandbox 2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69, ifname eth0"}
{"level":"debug","ts":"2022-07-31T06:12:11.301Z","caller":"rpc/rpc.pb.go:713","msg":"AddNetworkRequest: K8S_POD_NAME:\"ad-geoloc-84c9dd9cc5-5s86q\" K8S_POD_NAMESPACE:\"prod-annonce\" K8S_POD_INFRA_CONTAINER_ID:\"2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69\" ContainerID:\"2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69\" IfName:\"eth0\" NetworkName:\"aws-cni\" Netns:\"/proc/12533/ns/net\""}
{"level":"debug","ts":"2022-07-31T06:12:11.301Z","caller":"datastore/data_store.go:1396","msg":"Found a free IP not in DB - 10.109.167.233"}
{"level":"debug","ts":"2022-07-31T06:12:11.301Z","caller":"datastore/data_store.go:790","msg":"Returning Free IP 10.109.167.233"}
{"level":"debug","ts":"2022-07-31T06:12:11.301Z","caller":"datastore/data_store.go:713","msg":"New IP from CIDR pool- 10.109.167.233"}
{"level":"info","ts":"2022-07-31T06:12:11.301Z","caller":"datastore/data_store.go:817","msg":"AssignPodIPv4Address: Assign IP 10.109.167.233 to sandbox aws-cni/2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69/eth0"}

Restart of IPAMD ->

{"level":"info","ts":"2022-07-31T11:35:25.785Z","caller":"aws-k8s-agent/main.go:27","msg":"Starting L-IPAMD   ..."}

Container 2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69 is missing during recovery ->

{"level":"info","ts":"2022-07-31T11:35:29.123Z","caller":"ipamd/ipamd.go:509","msg":"Reading ipam state from CRI"}
{"level":"debug","ts":"2022-07-31T11:35:29.123Z","caller":"datastore/data_store.go:389","msg":"Getting running pod sandboxes from \"unix:///var/run/dockershim.sock\""}
{"level":"debug","ts":"2022-07-31T11:35:29.128Z","caller":"datastore/data_store.go:389","msg":"Ignoring sandbox 9f2cd1aa9ce3aecbdf5028ce9bc92728eef0f304a4a49f7a25f5fb77e3f1462d with non-pod netns mode NODE"}
{"level":"debug","ts":"2022-07-31T11:35:29.129Z","caller":"datastore/data_store.go:389","msg":"Ignoring sandbox 53251f6714ee1fce8a0623625e8efb3bb44a24c74996fc15f0e303e861f5ca2c with non-pod netns mode NODE"}
{"level":"debug","ts":"2022-07-31T11:35:29.130Z","caller":"datastore/data_store.go:389","msg":"Ignoring sandbox 272a4c341017e88f48123834b81cafffb61d67bfa2e90af689283cc9ef05652c with non-pod netns mode NODE"}
{"level":"debug","ts":"2022-07-31T11:35:29.130Z","caller":"datastore/data_store.go:389","msg":"Ignoring sandbox 62eb1680e11e345c9c8f9fca2d72219a040d902ea75f73265936e9ca4ef008cb with non-pod netns mode NODE"}
{"level":"debug","ts":"2022-07-31T11:35:29.172Z","caller":"ipamd/ipamd.go:509","msg":"Adding container ID: 30a9bf31700eea808a0d1186d616d22da860fac60abb8f0eb635194de2ba0946"}
{"level":"debug","ts":"2022-07-31T11:35:29.172Z","caller":"ipamd/ipamd.go:509","msg":"Adding container ID: afe52528fb37be9d0776c44cc3d3023866e337bfc4ed4ac614ad9d0a9a3f1cc7"}
{"level":"debug","ts":"2022-07-31T11:35:29.172Z","caller":"ipamd/ipamd.go:509","msg":"Adding container ID: ce8b7f18b62b1317879b1c6a34789f14d9f9a59ca960f9a14d558344d0a6e405"}
{"level":"debug","ts":"2022-07-31T11:35:29.172Z","caller":"ipamd/ipamd.go:509","msg":"Adding container ID: 570d7d7a78a583303e4e0c4346de10b4972dca06ecdc203b2d32d5e101f50014"}
{"level":"debug","ts":"2022-07-31T11:35:29.172Z","caller":"ipamd/ipamd.go:509","msg":"Adding container ID: bbce29efa2b9a0bf7238828003b13dbd2bf019d9e2e99a4e75da3a6d9cb6a6d7"}
{"level":"debug","ts":"2022-07-31T11:35:29.172Z","caller":"ipamd/ipamd.go:509","msg":"Adding container ID: a3c5b33a1f742b935c84448c69c63527327a9e0ef95311b6fedae1b0400c658f"}
{"level":"debug","ts":"2022-07-31T11:35:29.172Z","caller":"ipamd/ipamd.go:509","msg":"Adding container ID: 39032e83a36e9214c77f668f92750efd1b977ecd015917b8d7d161c845336334"}
{"level":"debug","ts":"2022-07-31T11:35:29.172Z","caller":"ipamd/ipamd.go:509","msg":"Adding container ID: 1b5854f94f358b38f07b40be06d549707db5376f7d3b403558603572cffa27ac"}
{"level":"debug","ts":"2022-07-31T11:35:29.172Z","caller":"ipamd/ipamd.go:509","msg":"Adding container ID: a0f858f864d7fbb0464889b51bef0b9bd10cbd3b6fec7806149c66d4edc7a2cb"}
{"level":"debug","ts":"2022-07-31T11:35:29.172Z","caller":"ipamd/ipamd.go:509","msg":"Adding container ID: c5b8baf95e927ef636ea24501f932da9c114457d9bd3c13930d7d784d27d23db"}
{"level":"debug","ts":"2022-07-31T11:35:29.172Z","caller":"ipamd/ipamd.go:509","msg":"Adding container ID: 514b331fb6d9b54b20db37124a8d371a21deaffccc0350209673b9646b4ad217"}
{"level":"info","ts":"2022-07-31T11:35:29.172Z","caller":"datastore/data_store.go:483","msg":"AssignPodIPv4Address: Assign IP 10.109.163.138 to sandbox _migrated-from-cri/30a9bf31700eea808a0d1186d616d22da860fac60abb8f0eb635194de2ba0946/unknown"}
{"level":"debug","ts":"2022-07-31T11:35:29.172Z","caller":"ipamd/ipamd.go:509","msg":"Recovered _migrated-from-cri/30a9bf31700eea808a0d1186d616d22da860fac60abb8f0eb635194de2ba0946/unknown => eni-0bb48f07f8e097136/10.109.163.138"}
{"level":"info","ts":"2022-07-31T11:35:29.173Z","caller":"datastore/data_store.go:483","msg":"AssignPodIPv4Address: Assign IP 10.109.162.56 to sandbox _migrated-from-cri/afe52528fb37be9d0776c44cc3d3023866e337bfc4ed4ac614ad9d0a9a3f1cc7/unknown"}
{"level":"debug","ts":"2022-07-31T11:35:29.173Z","caller":"ipamd/ipamd.go:509","msg":"Recovered _migrated-from-cri/afe52528fb37be9d0776c44cc3d3023866e337bfc4ed4ac614ad9d0a9a3f1cc7/unknown => eni-0bb48f07f8e097136/10.109.162.56"}
{"level":"info","ts":"2022-07-31T11:35:29.173Z","caller":"datastore/data_store.go:483","msg":"AssignPodIPv4Address: Assign IP 10.109.161.223 to sandbox _migrated-from-cri/ce8b7f18b62b1317879b1c6a34789f14d9f9a59ca960f9a14d558344d0a6e405/unknown"}
{"level":"debug","ts":"2022-07-31T11:35:29.173Z","caller":"ipamd/ipamd.go:509","msg":"Recovered _migrated-from-cri/ce8b7f18b62b1317879b1c6a34789f14d9f9a59ca960f9a14d558344d0a6e405/unknown => eni-0bb48f07f8e097136/10.109.161.223"}
{"level":"info","ts":"2022-07-31T11:35:29.173Z","caller":"datastore/data_store.go:483","msg":"AssignPodIPv4Address: Assign IP 10.109.166.12 to sandbox _migrated-from-cri/570d7d7a78a583303e4e0c4346de10b4972dca06ecdc203b2d32d5e101f50014/unknown"}
{"level":"debug","ts":"2022-07-31T11:35:29.173Z","caller":"ipamd/ipamd.go:509","msg":"Recovered _migrated-from-cri/570d7d7a78a583303e4e0c4346de10b4972dca06ecdc203b2d32d5e101f50014/unknown => eni-0bb48f07f8e097136/10.109.166.12"}
{"level":"info","ts":"2022-07-31T11:35:29.173Z","caller":"datastore/data_store.go:483","msg":"AssignPodIPv4Address: Assign IP 10.109.166.45 to sandbox _migrated-from-cri/bbce29efa2b9a0bf7238828003b13dbd2bf019d9e2e99a4e75da3a6d9cb6a6d7/unknown"}
{"level":"debug","ts":"2022-07-31T11:35:29.173Z","caller":"ipamd/ipamd.go:509","msg":"Recovered _migrated-from-cri/bbce29efa2b9a0bf7238828003b13dbd2bf019d9e2e99a4e75da3a6d9cb6a6d7/unknown => eni-0bb48f07f8e097136/10.109.166.45"}
{"level":"info","ts":"2022-07-31T11:35:29.173Z","caller":"datastore/data_store.go:483","msg":"AssignPodIPv4Address: Assign IP 10.109.160.14 to sandbox _migrated-from-cri/a3c5b33a1f742b935c84448c69c63527327a9e0ef95311b6fedae1b0400c658f/unknown"}
{"level":"debug","ts":"2022-07-31T11:35:29.173Z","caller":"ipamd/ipamd.go:509","msg":"Recovered _migrated-from-cri/a3c5b33a1f742b935c84448c69c63527327a9e0ef95311b6fedae1b0400c658f/unknown => eni-0bb48f07f8e097136/10.109.160.14"}
{"level":"info","ts":"2022-07-31T11:35:29.173Z","caller":"datastore/data_store.go:483","msg":"AssignPodIPv4Address: Assign IP 10.109.166.243 to sandbox _migrated-from-cri/39032e83a36e9214c77f668f92750efd1b977ecd015917b8d7d161c845336334/unknown"}
{"level":"debug","ts":"2022-07-31T11:35:29.173Z","caller":"ipamd/ipamd.go:509","msg":"Recovered _migrated-from-cri/39032e83a36e9214c77f668f92750efd1b977ecd015917b8d7d161c845336334/unknown => eni-0bb48f07f8e097136/10.109.166.243"}
{"level":"info","ts":"2022-07-31T11:35:29.174Z","caller":"datastore/data_store.go:483","msg":"AssignPodIPv4Address: Assign IP 10.109.164.251 to sandbox _migrated-from-cri/1b5854f94f358b38f07b40be06d549707db5376f7d3b403558603572cffa27ac/unknown"}
{"level":"debug","ts":"2022-07-31T11:35:29.174Z","caller":"ipamd/ipamd.go:509","msg":"Recovered _migrated-from-cri/1b5854f94f358b38f07b40be06d549707db5376f7d3b403558603572cffa27ac/unknown => eni-0baf81407690ff8c1/10.109.164.251"}
{"level":"info","ts":"2022-07-31T11:35:29.174Z","caller":"datastore/data_store.go:483","msg":"AssignPodIPv4Address: Assign IP 10.109.165.194 to sandbox _migrated-from-cri/a0f858f864d7fbb0464889b51bef0b9bd10cbd3b6fec7806149c66d4edc7a2cb/unknown"}
{"level":"debug","ts":"2022-07-31T11:35:29.174Z","caller":"ipamd/ipamd.go:509","msg":"Recovered _migrated-from-cri/a0f858f864d7fbb0464889b51bef0b9bd10cbd3b6fec7806149c66d4edc7a2cb/unknown => eni-0bb48f07f8e097136/10.109.165.194"}
{"level":"info","ts":"2022-07-31T11:35:29.174Z","caller":"datastore/data_store.go:483","msg":"AssignPodIPv4Address: Assign IP 10.109.167.108 to sandbox _migrated-from-cri/c5b8baf95e927ef636ea24501f932da9c114457d9bd3c13930d7d784d27d23db/unknown"}
{"level":"debug","ts":"2022-07-31T11:35:29.174Z","caller":"ipamd/ipamd.go:509","msg":"Recovered _migrated-from-cri/c5b8baf95e927ef636ea24501f932da9c114457d9bd3c13930d7d784d27d23db/unknown => eni-0baf81407690ff8c1/10.109.167.108"}
{"level":"info","ts":"2022-07-31T11:35:29.174Z","caller":"datastore/data_store.go:483","msg":"AssignPodIPv4Address: Assign IP 10.109.164.54 to sandbox _migrated-from-cri/514b331fb6d9b54b20db37124a8d371a21deaffccc0350209673b9646b4ad217/unknown"}
{"level":"debug","ts":"2022-07-31T11:35:29.174Z","caller":"ipamd/ipamd.go:509","msg":"Recovered _migrated-from-cri/514b331fb6d9b54b20db37124a8d371a21deaffccc0350209673b9646b4ad217/unknown => eni-0bb48f07f8e097136/10.109.164.54"}
{"level":"debug","ts":"2022-07-31T11:35:29.174Z","caller":"ipamd/ipamd.go:509","msg":"Completed ipam state recovery"}

Hence delete failed and 10.109.167.233 is free because we never recovered the container but the rules would have been recovered ->

{"level":"info","ts":"2022-07-31T11:36:10.189Z","caller":"rpc/rpc.pb.go:731","msg":"Received DelNetwork for Sandbox 2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69"}
{"level":"debug","ts":"2022-07-31T11:36:10.189Z","caller":"rpc/rpc.pb.go:731","msg":"DelNetworkRequest: K8S_POD_NAME:\"ad-geoloc-84c9dd9cc5-5s86q\" K8S_POD_NAMESPACE:\"prod-annonce\" K8S_POD_INFRA_CONTAINER_ID:\"2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69\" Reason:\"PodDeleted\" ContainerID:\"2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69\" IfName:\"eth0\" NetworkName:\"aws-cni\""}
{"level":"debug","ts":"2022-07-31T11:36:10.189Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: IP address pool stats: total:30, assigned 8, sandbox aws-cni/2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69/eth0"}
{"level":"debug","ts":"2022-07-31T11:36:10.189Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: Failed to find IPAM entry under full key, trying CRI-migrated version"}
{"level":"warn","ts":"2022-07-31T11:36:10.189Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: Failed to find sandbox _migrated-from-cri/2fffde85e1f7d2009a5a7b6bf2a8ea011a26aced26faf2ceea4123d14b655f69/unknown"}

We should have cleaned up the old rules before adding new one, will check if that is in place.

@jayanthvn
Copy link
Contributor

Ideally when the aws-node restarts we will have to clean up the rules associated with free IPs if one exists.

@cdemonchy
Copy link
Author

We managed to mitigate the issue by updating cluster auto scaler configuration to not evict daemonset pods. We didn't have leaked rules since the update.

aws-node is not evicted by cluster autoscaler and don't restart. This confirm that the problem occurs when pod are deleted during a restart of aws-node.

Regards,

@jdn5126 jdn5126 self-assigned this Sep 27, 2022
@jdn5126
Copy link
Contributor

jdn5126 commented Oct 5, 2022

@cdemonchy were you previously enabling daemonset eviction? It should be disabled by default

@cdemonchy
Copy link
Author

Hi @jdn5126

The daemonset eviction has been enabled automatically when we've upgraded cluster autoscaler from v1.20.2 and to v1.21.2

To solve our issue we've upgraded to cluster autoscaler v1.21.3 and added the flag --daemonset-eviction-for-occupied-nodes=false

@jdn5126
Copy link
Contributor

jdn5126 commented Oct 6, 2022

Ah, thank you for pointing that out, I see that autoscaler has indeed changed the default behavior for occupied nodes. aws-node is not a daemonset meant to be evicted, so we will discuss the best way to handle this

@github-actions
Copy link

github-actions bot commented Dec 6, 2022

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Dec 6, 2022
@jdn5126 jdn5126 removed the stale Issue or PR is stale label Dec 6, 2022
@jayanthvn
Copy link
Contributor

/notstale

@jdn5126
Copy link
Contributor

jdn5126 commented Jan 12, 2023

Closing as PR has merged

@jdn5126 jdn5126 closed this as completed Jan 12, 2023
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants