Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CNI]: Teardown pod networking resources without IPAMD when possible #2125

Closed
wants to merge 1 commit into from

Conversation

jdn5126
Copy link
Contributor

@jdn5126 jdn5126 commented Oct 28, 2022

What type of PR is this?
bug

Which issue does this PR fix:
#2048

What does this PR do / Why do we need it:
This PR resolves an issue in which IP rules were leaked by the CNI. When processing a pod deletion, the CNI would wait for IPAMD response before tearing down pod networking resources. If IPAMD could not be reached, CNI would return error and wait for kubelet to retry the delete. If IPAMD were restarted, the state for this pod would be cleared without CNI tearing down the associated networking resources. The trigger for the linked issue was the k8s cluster autoscaler evicting the aws-node daemonset pod before other pods and then later cancelling the pod evictions. kubernetes/autoscaler#5240 was filed to ask for k8s cluster autoscaler to change its behavior.

With the changes in this PR, we now store state for cleanup in PrevResult for pods whether they use branch ENIs (SGPP mode) or not. During cleanup, we tear down networking resources whenever possible. If CNI cannot reach IPAMD, we still return error in non-SGPP mode and expect kubelet to retry the delete. Therefore, this delete is purely opportunistic in non-SGPP mode. It will only occur for pods created after the CNI version in which this change merges. This was done to avoid backward compatibility issues and preserve existing semantics, while also making an improvement.

Note that this also makes the CNI less dependent on IPAMD. The only behavior change to call out is that tryDelWithPrevResult now returns a skipIpamd boolean to indicate whether IPAMD request can definitely be skipped.

If an issue # is not available please add repro steps and logs from IPAMD/CNI showing the issue:

Testing done on this change:
Added more test cases to cni_test.go and verified that all CNI and IPAMD integration tests pass with this change. Also manually verified that fix with IPv4 and IPv6 clusters.

Automation added to e2e:
N/A

Will this PR introduce any new dependencies?:
No

Will this break upgrades or downgrades. Has updating a running cluster been tested?:
Upgrades and downgraded will not be broken. A running cluster has been tested.

Does this change require updates to the CNI daemonset config files to work?:
No

Does this PR introduce any user-facing change?:
Yes

Cleanup pod networking resources without IPAMD.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@jdn5126 jdn5126 requested a review from a team as a code owner October 28, 2022 20:55
handled, skipIpamd, err := tryDelWithPrevResult(driverClient, conf, k8sArgs, args.IfName, args.Netns, log)
// On error, return immediately if pod was created with branch ENI. Otherwise, let IPAMD handle cleanup. Note that we cannot
// check conf.PodSGEnforcingMode as the pod we are deleting may have been added before the mode switch.
if err != nil && skipIpamd {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if err != nil and !skipIpamd, seems the error is ignored.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error still gets printed in tryDelWithPrevResult, but this is a case where we would still want to go through IPAMD for backwards compatibility. If we cannot delete from PrevResult and this pod was not created with a branch ENI (or we are not sure), then we give IPAMD a chance to do the delete.

In theory, this should never happen, but I figured we wanted to preserve backwards compatibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error is effectively ignored, as we are assuming that IPAMD would return the same error if there is a real error.

}

if err != nil {
return true, branchEni, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we return true here, does it mean we'll call ipamd to release IP but will skip teardownPodNetwork, and CNI won't retry due to we returned nil. seems we could leak some IP rules if TeardownPodNetwork failed for some reason.

if handled {
		return nil
	}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TeardownPodNetwork call that we make based on the IPAMD response will be the same as the call we made here, but it does seem safer to try again. I will change this to return branchEni, branchEni, err.

return errors.Wrap(err, "del cmd: failed to delete with prevResult")
}
if handled {
log.Infof("Handled CNI del request with prevResult: ContainerID(%s) Netns(%s) IfName(%s) PodNamespace(%s) PodName(%s)",
args.ContainerID, args.Netns, args.IfName, string(k8sArgs.K8S_POD_NAMESPACE), string(k8sArgs.K8S_POD_NAME))
return nil
// When deleting a pod with a branch ENI configured, IPAMD does not need to be notified, so we can return early.
if skipIpamd {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If handled we used to return nil. Why is the additional check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to call into IPAMD to release the IP allocation in the non-branch ENI case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we just need to keep either handled or skipIpamd, returning 2 conditions with tryDelWithPrevResult can be changed.

if branchEni {
if isNetnsEmpty(netNS) {
log.Infof("Ignoring TeardownPodENI as Netns is empty for SG pod: %s namespace: %s containerID: %s", k8sArgs.K8S_POD_NAME, k8sArgs.K8S_POD_NAMESPACE, k8sArgs.K8S_POD_INFRA_CONTAINER_ID)
return true, branchEni, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ifnetns is already empty for a BranchENI pod, why don't we let IPAMD get the vlan information from API Server using the old flow (if the pod is still lingering around in etcd). If the current flow can get that for us, we can use that to clear the sgpp rules (if any). Will be a corner case and deferring it to IPAMD to get the vlan info might not be a bad idea for such cases...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I was only carrying this check over to preserve the existing behavior. Going through IPAMD when netns is already empty makes sense to me.

@achevuru
Copy link
Contributor

achevuru commented Nov 11, 2022

General comment: We're changing the delete flow (i.e.,) tear down the pod network first before deciding whether to call IPAMD or not, to release the IP address from the assigned list. Current flow is the recommended sequence, so I believe we should stick to that and fall back to PrevResult for Non branch ENI pods only if IPAMD call fails. It makes sense to do this for Branch ENI pods as IPAMD is not the IP address manager for them anyways, so there is not much value in invoking IPAMD. By sticking to the current flow sequence for non Branch ENI pods, we can also do away with skipIpamd check as this will purely be a fall back mechanism for some edge scenarios....

@jayanthvn
Copy link
Contributor

Agreed to what Apurup mentioned, we should keep the current flow and have a fall back when IPAMD call fails. The returning of 2 conditions from tryDelWithPrevResult would not be required then.

@jdn5126
Copy link
Contributor Author

jdn5126 commented Nov 12, 2022

General comment: We're changing the delete flow (i.e.,) tear down the pod network first before deciding whether to call IPAMD or not, to release the IP address from the assigned list. Current flow is the recommended sequence, so I believe we should stick to that and fall back to PrevResult for Non branch ENI pods only if IPAMD call fails. It makes sense to do this for Branch ENI pods as IPAMD is not the IP address manager for them anyways, so there is not much value in invoking IPAMD. By sticking to the current flow sequence for non Branch ENI pods, we can also do away with skipIpamd check as this will purely be a fall back mechanism for some edge scenarios....

I did consider that, but that would involve calling in three cases:

  1. IPAMD is unreachable
  2. IPAMD returns error
  3. IPAMD returns failure (!success)

I am fine with that approach, what do you think?

@achevuru
Copy link
Contributor

But is there a need to treat them all different? In all the above scenarios, we clean up what we can based on the PrevResult - similar to what you're trying to do in the PR before reaching out to IPAMD. If the IP is missing in the state file, IPAMD anyways gives us a relevant error message and we return success in such a scenario back to kubelet and for any other error scenarios, we just let kubelet retry..

@jdn5126
Copy link
Contributor Author

jdn5126 commented Nov 15, 2022

I thought about this, and I think the cleanest solution is to only call TeardownPodNetwork using PrevResult in the case where we cannot establish a connection with IPAMD. This means tryDelWithPrevResult would not be changed other than not returning an error when podVlanId == 0.

All IPAMD error conditions will cause kubelet to retry, but it is only the case where IPAMD is truly not running that the delete is critical to prevent stale rules. I will make these changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants