-
Notifications
You must be signed in to change notification settings - Fork 754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CNI]: Teardown pod networking resources without IPAMD when possible #2125
Conversation
handled, skipIpamd, err := tryDelWithPrevResult(driverClient, conf, k8sArgs, args.IfName, args.Netns, log) | ||
// On error, return immediately if pod was created with branch ENI. Otherwise, let IPAMD handle cleanup. Note that we cannot | ||
// check conf.PodSGEnforcingMode as the pod we are deleting may have been added before the mode switch. | ||
if err != nil && skipIpamd { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if err != nil and !skipIpamd
, seems the error is ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error still gets printed in tryDelWithPrevResult
, but this is a case where we would still want to go through IPAMD for backwards compatibility. If we cannot delete from PrevResult
and this pod was not created with a branch ENI (or we are not sure), then we give IPAMD a chance to do the delete.
In theory, this should never happen, but I figured we wanted to preserve backwards compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error is effectively ignored, as we are assuming that IPAMD would return the same error if there is a real error.
} | ||
|
||
if err != nil { | ||
return true, branchEni, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we return true here, does it mean we'll call ipamd to release IP but will skip teardownPodNetwork, and CNI won't retry due to we returned nil. seems we could leak some IP rules if TeardownPodNetwork failed for some reason.
if handled {
return nil
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TeardownPodNetwork
call that we make based on the IPAMD response will be the same as the call we made here, but it does seem safer to try again. I will change this to return branchEni, branchEni, err
.
return errors.Wrap(err, "del cmd: failed to delete with prevResult") | ||
} | ||
if handled { | ||
log.Infof("Handled CNI del request with prevResult: ContainerID(%s) Netns(%s) IfName(%s) PodNamespace(%s) PodName(%s)", | ||
args.ContainerID, args.Netns, args.IfName, string(k8sArgs.K8S_POD_NAMESPACE), string(k8sArgs.K8S_POD_NAME)) | ||
return nil | ||
// When deleting a pod with a branch ENI configured, IPAMD does not need to be notified, so we can return early. | ||
if skipIpamd { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If handled we used to return nil
. Why is the additional check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need to call into IPAMD to release the IP allocation in the non-branch ENI case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we just need to keep either handled or skipIpamd, returning 2 conditions with tryDelWithPrevResult
can be changed.
if branchEni { | ||
if isNetnsEmpty(netNS) { | ||
log.Infof("Ignoring TeardownPodENI as Netns is empty for SG pod: %s namespace: %s containerID: %s", k8sArgs.K8S_POD_NAME, k8sArgs.K8S_POD_NAMESPACE, k8sArgs.K8S_POD_INFRA_CONTAINER_ID) | ||
return true, branchEni, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ifnetns
is already empty for a BranchENI pod, why don't we let IPAMD get the vlan information from API Server using the old flow (if the pod is still lingering around in etcd). If the current flow can get that for us, we can use that to clear the sgpp rules (if any). Will be a corner case and deferring it to IPAMD to get the vlan info might not be a bad idea for such cases...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I was only carrying this check over to preserve the existing behavior. Going through IPAMD when netns is already empty makes sense to me.
General comment: We're changing the delete flow (i.e.,) tear down the pod network first before deciding whether to call IPAMD or not, to release the IP address from the assigned list. Current flow is the recommended sequence, so I believe we should stick to that and fall back to |
Agreed to what Apurup mentioned, we should keep the current flow and have a fall back when IPAMD call fails. The returning of 2 conditions from |
I did consider that, but that would involve calling in three cases:
I am fine with that approach, what do you think? |
But is there a need to treat them all different? In all the above scenarios, we clean up what we can based on the PrevResult - similar to what you're trying to do in the PR before reaching out to IPAMD. If the IP is missing in the state file, IPAMD anyways gives us a relevant error message and we return |
I thought about this, and I think the cleanest solution is to only call All IPAMD error conditions will cause kubelet to retry, but it is only the case where IPAMD is truly not running that the delete is critical to prevent stale rules. I will make these changes. |
What type of PR is this?
bug
Which issue does this PR fix:
#2048
What does this PR do / Why do we need it:
This PR resolves an issue in which IP rules were leaked by the CNI. When processing a pod deletion, the CNI would wait for IPAMD response before tearing down pod networking resources. If IPAMD could not be reached, CNI would return error and wait for kubelet to retry the delete. If IPAMD were restarted, the state for this pod would be cleared without CNI tearing down the associated networking resources. The trigger for the linked issue was the k8s cluster autoscaler evicting the
aws-node
daemonset pod before other pods and then later cancelling the pod evictions. kubernetes/autoscaler#5240 was filed to ask for k8s cluster autoscaler to change its behavior.With the changes in this PR, we now store state for cleanup in
PrevResult
for pods whether they use branch ENIs (SGPP mode) or not. During cleanup, we tear down networking resources whenever possible. If CNI cannot reach IPAMD, we still return error in non-SGPP mode and expect kubelet to retry the delete. Therefore, this delete is purely opportunistic in non-SGPP mode. It will only occur for pods created after the CNI version in which this change merges. This was done to avoid backward compatibility issues and preserve existing semantics, while also making an improvement.Note that this also makes the CNI less dependent on IPAMD. The only behavior change to call out is that
tryDelWithPrevResult
now returns askipIpamd
boolean to indicate whether IPAMD request can definitely be skipped.If an issue # is not available please add repro steps and logs from IPAMD/CNI showing the issue:
Testing done on this change:
Added more test cases to
cni_test.go
and verified that all CNI and IPAMD integration tests pass with this change. Also manually verified that fix with IPv4 and IPv6 clusters.Automation added to e2e:
N/A
Will this PR introduce any new dependencies?:
No
Will this break upgrades or downgrades. Has updating a running cluster been tested?:
Upgrades and downgraded will not be broken. A running cluster has been tested.
Does this change require updates to the CNI daemonset config files to work?:
No
Does this PR introduce any user-facing change?:
Yes
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.