-
Notifications
You must be signed in to change notification settings - Fork 755
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry LinkByMac when link not found #346
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good.
Any more data from the soak tests?
// number of attempts to find an ENI by MAC address after it is attached | ||
maxAttemptsLinkByMac = 5 | ||
|
||
retryLinkByMacInterval = 5 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to refactor this to use exponential backoff instead, but this change is consistent with the current code and looks good for now. I'll create an issue for that.
Looking good so far, we haven’t encountered any further problems in our soak test. We are promoting to the next stage, and should have another update tomorrow. If you are considering follow-on changes, I wonder if it’s worth exploring how to exclude failed ENI attaches from entering the IPAM tool. All indications are we’ve covered the timing scenarios we know about, but I wonder if there are other potential failures in this code path. Running out of IPs on a node, with an error logged in the sandbox creation, would be much easier to diagnose than the disconnected running pods we hit. |
Good point, "fail-fast" is generally a something we want. |
Btw, I'd prefer to merge this into master, not the release-1.3 branch. @peterbroadhurst |
Fix for #204
Description of changes:
We have observed that the first time
LinkByMac
is run after attaching a new ENI, there is a timing condition where can fail to find the ENI. An error is reported in the logs, but processing on the node continues. The routing table for that ENI never get set up, although IP addresses associated with the ENI are included in the IPAM pool. Example logs showing the IPs being allocated to the failed ENI.This means that any pods scheduled with IPs on that ENI fail to perform networking. They are unable to talk to the internet through a NAT gateway (we use
AWS_VPC_K8S_CNI_EXTERNALSNAT
), or with other pods in the local k8s cluster.This manifests as difficult to diagnose and at first inspection quite random networking failures in the cluster. Hopefully, it is root cause for all of the various symptoms reported in #204.
The simple fix proposed here, is to follow the pattern of other functions in
network.go
and perform retry. To allow efficient unit testing, the retry delayretryLinkByMacInterval
is set as a code-configurable constant for callers from outside of the package, but passed as a parameter in the call so the unit test can set it to zero.In our environment, we have a pipeline for our fork and have installed it into a number of test clusters. As of 14th March we are continuing to soak test, but have already encountered one example where the retry was required and succeeded after the first 5s retry, hence raising the PR:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.