[CNI]: Teardown pod networking resources without IPAMD when possible #2125

jdn5126 · 2022-10-28T20:55:43Z

What type of PR is this?
bug

Which issue does this PR fix:
#2048

What does this PR do / Why do we need it:
This PR resolves an issue in which IP rules were leaked by the CNI. When processing a pod deletion, the CNI would wait for IPAMD response before tearing down pod networking resources. If IPAMD could not be reached, CNI would return error and wait for kubelet to retry the delete. If IPAMD were restarted, the state for this pod would be cleared without CNI tearing down the associated networking resources. The trigger for the linked issue was the k8s cluster autoscaler evicting the aws-node daemonset pod before other pods and then later cancelling the pod evictions. kubernetes/autoscaler#5240 was filed to ask for k8s cluster autoscaler to change its behavior.

With the changes in this PR, we now store state for cleanup in PrevResult for pods whether they use branch ENIs (SGPP mode) or not. During cleanup, we tear down networking resources whenever possible. If CNI cannot reach IPAMD, we still return error in non-SGPP mode and expect kubelet to retry the delete. Therefore, this delete is purely opportunistic in non-SGPP mode. It will only occur for pods created after the CNI version in which this change merges. This was done to avoid backward compatibility issues and preserve existing semantics, while also making an improvement.

Note that this also makes the CNI less dependent on IPAMD. The only behavior change to call out is that tryDelWithPrevResult now returns a skipIpamd boolean to indicate whether IPAMD request can definitely be skipped.

If an issue # is not available please add repro steps and logs from IPAMD/CNI showing the issue:

Testing done on this change:
Added more test cases to cni_test.go and verified that all CNI and IPAMD integration tests pass with this change. Also manually verified that fix with IPv4 and IPv6 clusters.

Automation added to e2e:
N/A

Will this PR introduce any new dependencies?:
No

Will this break upgrades or downgrades. Has updating a running cluster been tested?:
Upgrades and downgraded will not be broken. A running cluster has been tested.

Does this change require updates to the CNI daemonset config files to work?:
No

Does this PR introduce any user-facing change?:
Yes

Cleanup pod networking resources without IPAMD.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

M00nF1sh · 2022-11-02T18:30:48Z

cmd/routed-eni-cni-plugin/cni.go

+	handled, skipIpamd, err := tryDelWithPrevResult(driverClient, conf, k8sArgs, args.IfName, args.Netns, log)
+	// On error, return immediately if pod was created with branch ENI. Otherwise, let IPAMD handle cleanup. Note that we cannot
+	// check conf.PodSGEnforcingMode as the pod we are deleting may have been added before the mode switch.
+	if err != nil && skipIpamd {


what if err != nil and !skipIpamd, seems the error is ignored.

The error still gets printed in tryDelWithPrevResult, but this is a case where we would still want to go through IPAMD for backwards compatibility. If we cannot delete from PrevResult and this pod was not created with a branch ENI (or we are not sure), then we give IPAMD a chance to do the delete.

In theory, this should never happen, but I figured we wanted to preserve backwards compatibility.

The error is effectively ignored, as we are assuming that IPAMD would return the same error if there is a real error.

M00nF1sh · 2022-11-11T18:39:33Z

cmd/routed-eni-cni-plugin/cni.go

+	}
+
+	if err != nil {
+		return true, branchEni, err


since we return true here, does it mean we'll call ipamd to release IP but will skip teardownPodNetwork, and CNI won't retry due to we returned nil. seems we could leak some IP rules if TeardownPodNetwork failed for some reason.

if handled { return nil }

The TeardownPodNetwork call that we make based on the IPAMD response will be the same as the call we made here, but it does seem safer to try again. I will change this to return branchEni, branchEni, err.

jayanthvn · 2022-11-11T19:12:42Z

cmd/routed-eni-cni-plugin/cni.go

 		return errors.Wrap(err, "del cmd: failed to delete with prevResult")
 	}
 	if handled {
 		log.Infof("Handled CNI del request with prevResult: ContainerID(%s) Netns(%s) IfName(%s) PodNamespace(%s) PodName(%s)",
 			args.ContainerID, args.Netns, args.IfName, string(k8sArgs.K8S_POD_NAMESPACE), string(k8sArgs.K8S_POD_NAME))
-		return nil
+		// When deleting a pod with a branch ENI configured, IPAMD does not need to be notified, so we can return early.
+		if skipIpamd {


If handled we used to return nil. Why is the additional check?

We still need to call into IPAMD to release the IP allocation in the non-branch ENI case

I feel we just need to keep either handled or skipIpamd, returning 2 conditions with tryDelWithPrevResult can be changed.

achevuru · 2022-11-11T19:13:53Z

cmd/routed-eni-cni-plugin/cni.go

+	if branchEni {
+		if isNetnsEmpty(netNS) {
+			log.Infof("Ignoring TeardownPodENI as Netns is empty for SG pod: %s namespace: %s containerID: %s", k8sArgs.K8S_POD_NAME, k8sArgs.K8S_POD_NAMESPACE, k8sArgs.K8S_POD_INFRA_CONTAINER_ID)
+			return true, branchEni, nil


Ifnetns is already empty for a BranchENI pod, why don't we let IPAMD get the vlan information from API Server using the old flow (if the pod is still lingering around in etcd). If the current flow can get that for us, we can use that to clear the sgpp rules (if any). Will be a corner case and deferring it to IPAMD to get the vlan info might not be a bad idea for such cases...

Sure, I was only carrying this check over to preserve the existing behavior. Going through IPAMD when netns is already empty makes sense to me.

achevuru · 2022-11-11T19:39:46Z

General comment: We're changing the delete flow (i.e.,) tear down the pod network first before deciding whether to call IPAMD or not, to release the IP address from the assigned list. Current flow is the recommended sequence, so I believe we should stick to that and fall back to PrevResult for Non branch ENI pods only if IPAMD call fails. It makes sense to do this for Branch ENI pods as IPAMD is not the IP address manager for them anyways, so there is not much value in invoking IPAMD. By sticking to the current flow sequence for non Branch ENI pods, we can also do away with skipIpamd check as this will purely be a fall back mechanism for some edge scenarios....

jayanthvn · 2022-11-12T00:38:10Z

Agreed to what Apurup mentioned, we should keep the current flow and have a fall back when IPAMD call fails. The returning of 2 conditions from tryDelWithPrevResult would not be required then.

jdn5126 · 2022-11-12T00:43:47Z

General comment: We're changing the delete flow (i.e.,) tear down the pod network first before deciding whether to call IPAMD or not, to release the IP address from the assigned list. Current flow is the recommended sequence, so I believe we should stick to that and fall back to PrevResult for Non branch ENI pods only if IPAMD call fails. It makes sense to do this for Branch ENI pods as IPAMD is not the IP address manager for them anyways, so there is not much value in invoking IPAMD. By sticking to the current flow sequence for non Branch ENI pods, we can also do away with skipIpamd check as this will purely be a fall back mechanism for some edge scenarios....

I did consider that, but that would involve calling in three cases:

IPAMD is unreachable
IPAMD returns error
IPAMD returns failure (!success)

I am fine with that approach, what do you think?

achevuru · 2022-11-12T07:05:56Z

But is there a need to treat them all different? In all the above scenarios, we clean up what we can based on the PrevResult - similar to what you're trying to do in the PR before reaching out to IPAMD. If the IP is missing in the state file, IPAMD anyways gives us a relevant error message and we return success in such a scenario back to kubelet and for any other error scenarios, we just let kubelet retry..

jdn5126 · 2022-11-15T17:09:12Z

I thought about this, and I think the cleanest solution is to only call TeardownPodNetwork using PrevResult in the case where we cannot establish a connection with IPAMD. This means tryDelWithPrevResult would not be changed other than not returning an error when podVlanId == 0.

All IPAMD error conditions will cause kubelet to retry, but it is only the case where IPAMD is truly not running that the delete is critical to prevent stale rules. I will make these changes.

resolve ip rule leaking

d9e9f42

jdn5126 requested a review from a team as a code owner October 28, 2022 20:55

jdn5126 assigned M00nF1sh Oct 31, 2022

M00nF1sh reviewed Nov 2, 2022

View reviewed changes

M00nF1sh approved these changes Nov 11, 2022

View reviewed changes

M00nF1sh reviewed Nov 11, 2022

View reviewed changes

jayanthvn reviewed Nov 11, 2022

View reviewed changes

achevuru reviewed Nov 11, 2022

View reviewed changes

jdn5126 closed this Nov 17, 2022

jdn5126 mentioned this pull request Nov 17, 2022

[CNI]: Teardown pod network when IPAMD connection fails #2145

Merged

jdn5126 deleted the ip_rule_leak branch November 18, 2022 16:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CNI]: Teardown pod networking resources without IPAMD when possible #2125

[CNI]: Teardown pod networking resources without IPAMD when possible #2125

jdn5126 commented Oct 28, 2022

M00nF1sh Nov 2, 2022

jdn5126 Nov 2, 2022

jdn5126 Nov 2, 2022

M00nF1sh Nov 11, 2022

jdn5126 Nov 11, 2022

jayanthvn Nov 11, 2022

jdn5126 Nov 11, 2022

jayanthvn Nov 11, 2022

achevuru Nov 11, 2022

jdn5126 Nov 11, 2022

achevuru commented Nov 11, 2022 •

edited

Loading

jayanthvn commented Nov 12, 2022

jdn5126 commented Nov 12, 2022

achevuru commented Nov 12, 2022

jdn5126 commented Nov 15, 2022

[CNI]: Teardown pod networking resources without IPAMD when possible #2125

[CNI]: Teardown pod networking resources without IPAMD when possible #2125

Conversation

jdn5126 commented Oct 28, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

achevuru commented Nov 11, 2022 • edited Loading

jayanthvn commented Nov 12, 2022

jdn5126 commented Nov 12, 2022

achevuru commented Nov 12, 2022

jdn5126 commented Nov 15, 2022

achevuru commented Nov 11, 2022 •

edited

Loading