Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry LinkByMac when link not found #346

Closed

Conversation

peterbroadhurst
Copy link
Contributor

Fix for #204

Description of changes:

We have observed that the first time LinkByMac is run after attaching a new ENI, there is a timing condition where can fail to find the ENI. An error is reported in the logs, but processing on the node continues. The routing table for that ENI never get set up, although IP addresses associated with the ENI are included in the IPAM pool. Example logs showing the IPs being allocated to the failed ENI.

2019-03-10T13:11:16Z [INFO] Successfully created and attached a new eni eni-0a975df1b2b06bc79 to instance
2019-03-10T13:11:16Z [INFO] Trying to allocate 14 IP address on eni eni-0a975df1b2b06bc79
2019-03-10T13:11:16Z [DEBUG] Found eni: eni-0a975df1b2b06bc79, mac 02:5a:96:2b:9b:a2, device 3
2019-03-10T13:11:16Z [DEBUG] DataStore Add an ENI eni-0a975df1b2b06bc79
2019-03-10T13:11:16Z [DEBUG] AssignPodIPv4Address, skip ENI eni-0a975df1b2b06bc79 that does not have available addresses
2019-03-10T13:11:17Z [ERROR] Failed to increase pool size: failed to setup eni eni-0a975df1b2b06bc79 network: eni network setup: failed to find the link which uses mac address 02:5a:96:2b:9b:a2: no interface found which uses mac address 02:5a:96:2b:9b:a2 
2019-03-10T13:11:22Z [DEBUG] AssignPodIPv4Address, skip ENI eni-0a975df1b2b06bc79 that does not have available addresses
2019-03-10T13:11:22Z [DEBUG] Found eni eni-0a975df1b2b06bc79 that have less IP address allocated: cur=0, max=14
2019-03-10T13:11:22Z [DEBUG] Attempt again to allocate IP address for eni :eni-0a975df1b2b06bc79
2019-03-10T13:11:22Z [INFO] Trying to allocate 14 IP address on eni eni-0a975df1b2b06bc79
2019-03-10T13:11:22Z [DEBUG] Adding ENI(eni-0a975df1b2b06bc79)'s IPv4 address 10.10.79.106 to datastore

This means that any pods scheduled with IPs on that ENI fail to perform networking. They are unable to talk to the internet through a NAT gateway (we use AWS_VPC_K8S_CNI_EXTERNALSNAT), or with other pods in the local k8s cluster.

This manifests as difficult to diagnose and at first inspection quite random networking failures in the cluster. Hopefully, it is root cause for all of the various symptoms reported in #204.

The simple fix proposed here, is to follow the pattern of other functions in network.go and perform retry. To allow efficient unit testing, the retry delay retryLinkByMacInterval is set as a code-configurable constant for callers from outside of the package, but passed as a parameter in the call so the unit test can set it to zero.

In our environment, we have a pipeline for our fork and have installed it into a number of test clusters. As of 14th March we are continuing to soak test, but have already encountered one example where the retry was required and succeeded after the first 5s retry, hence raising the PR:

2019-03-14T03:46:32Z [DEBUG] no interface found which uses mac address 02:bd:cf:ff:a6:e6 (attempt 1/5)
2019-03-14T03:46:37Z [DEBUG] Found the Link that uses mac address 02:bd:cf:ff:a6:e6 and its index is 13 (attempt 2/5)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link
Contributor

@mogren mogren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good.

Any more data from the soak tests?

// number of attempts to find an ENI by MAC address after it is attached
maxAttemptsLinkByMac = 5

retryLinkByMacInterval = 5 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to refactor this to use exponential backoff instead, but this change is consistent with the current code and looks good for now. I'll create an issue for that.

@peterbroadhurst
Copy link
Contributor Author

Looking good so far, we haven’t encountered any further problems in our soak test. We are promoting to the next stage, and should have another update tomorrow.

If you are considering follow-on changes, I wonder if it’s worth exploring how to exclude failed ENI attaches from entering the IPAM tool. All indications are we’ve covered the timing scenarios we know about, but I wonder if there are other potential failures in this code path.

Running out of IPs on a node, with an error logged in the sandbox creation, would be much easier to diagnose than the disconnected running pods we hit.

@mogren
Copy link
Contributor

mogren commented Mar 15, 2019

Good point, "fail-fast" is generally a something we want.

@mogren
Copy link
Contributor

mogren commented Mar 18, 2019

Btw, I'd prefer to merge this into master, not the release-1.3 branch. @peterbroadhurst

@mogren mogren added this to the v1.4 milestone Mar 18, 2019
@peterbroadhurst
Copy link
Contributor Author

Thanks @mogren - I will close and re-open against master.
Note we have seen a follow-on issue #359 in a live environment, that we're chasing down.
However, we do not see any evidence it is related to this fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants