Retry LinkByMac when link not found #346

peterbroadhurst · 2019-03-14T04:59:17Z

Fix for #204

Description of changes:

We have observed that the first time LinkByMac is run after attaching a new ENI, there is a timing condition where can fail to find the ENI. An error is reported in the logs, but processing on the node continues. The routing table for that ENI never get set up, although IP addresses associated with the ENI are included in the IPAM pool. Example logs showing the IPs being allocated to the failed ENI.

2019-03-10T13:11:16Z [INFO] Successfully created and attached a new eni eni-0a975df1b2b06bc79 to instance
2019-03-10T13:11:16Z [INFO] Trying to allocate 14 IP address on eni eni-0a975df1b2b06bc79
2019-03-10T13:11:16Z [DEBUG] Found eni: eni-0a975df1b2b06bc79, mac 02:5a:96:2b:9b:a2, device 3
2019-03-10T13:11:16Z [DEBUG] DataStore Add an ENI eni-0a975df1b2b06bc79
2019-03-10T13:11:16Z [DEBUG] AssignPodIPv4Address, skip ENI eni-0a975df1b2b06bc79 that does not have available addresses
2019-03-10T13:11:17Z [ERROR] Failed to increase pool size: failed to setup eni eni-0a975df1b2b06bc79 network: eni network setup: failed to find the link which uses mac address 02:5a:96:2b:9b:a2: no interface found which uses mac address 02:5a:96:2b:9b:a2 
2019-03-10T13:11:22Z [DEBUG] AssignPodIPv4Address, skip ENI eni-0a975df1b2b06bc79 that does not have available addresses
2019-03-10T13:11:22Z [DEBUG] Found eni eni-0a975df1b2b06bc79 that have less IP address allocated: cur=0, max=14
2019-03-10T13:11:22Z [DEBUG] Attempt again to allocate IP address for eni :eni-0a975df1b2b06bc79
2019-03-10T13:11:22Z [INFO] Trying to allocate 14 IP address on eni eni-0a975df1b2b06bc79
2019-03-10T13:11:22Z [DEBUG] Adding ENI(eni-0a975df1b2b06bc79)'s IPv4 address 10.10.79.106 to datastore

This means that any pods scheduled with IPs on that ENI fail to perform networking. They are unable to talk to the internet through a NAT gateway (we use AWS_VPC_K8S_CNI_EXTERNALSNAT), or with other pods in the local k8s cluster.

This manifests as difficult to diagnose and at first inspection quite random networking failures in the cluster. Hopefully, it is root cause for all of the various symptoms reported in #204.

The simple fix proposed here, is to follow the pattern of other functions in network.go and perform retry. To allow efficient unit testing, the retry delay retryLinkByMacInterval is set as a code-configurable constant for callers from outside of the package, but passed as a parameter in the call so the unit test can set it to zero.

In our environment, we have a pipeline for our fork and have installed it into a number of test clusters. As of 14th March we are continuing to soak test, but have already encountered one example where the retry was required and succeeded after the first 5s retry, hence raising the PR:

2019-03-14T03:46:32Z [DEBUG] no interface found which uses mac address 02:bd:cf:ff:a6:e6 (attempt 1/5)
2019-03-14T03:46:37Z [DEBUG] Found the Link that uses mac address 02:bd:cf:ff:a6:e6 and its index is 13 (attempt 2/5)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

mogren

Thanks, looks good.

Any more data from the soak tests?

mogren · 2019-03-14T23:43:02Z

pkg/networkutils/network.go

+	// number of attempts to find an ENI by MAC address after it is attached
+	maxAttemptsLinkByMac = 5
+
+	retryLinkByMacInterval = 5 * time.Second


I would like to refactor this to use exponential backoff instead, but this change is consistent with the current code and looks good for now. I'll create an issue for that.

peterbroadhurst · 2019-03-15T00:24:18Z

Looking good so far, we haven’t encountered any further problems in our soak test. We are promoting to the next stage, and should have another update tomorrow.

If you are considering follow-on changes, I wonder if it’s worth exploring how to exclude failed ENI attaches from entering the IPAM tool. All indications are we’ve covered the timing scenarios we know about, but I wonder if there are other potential failures in this code path.

Running out of IPs on a node, with an error logged in the sandbox creation, would be much easier to diagnose than the disconnected running pods we hit.

mogren · 2019-03-15T18:45:31Z

Good point, "fail-fast" is generally a something we want.

mogren · 2019-03-18T20:40:33Z

Btw, I'd prefer to merge this into master, not the release-1.3 branch. @peterbroadhurst

peterbroadhurst · 2019-03-23T03:20:58Z

Thanks @mogren - I will close and re-open against master.
Note we have seen a follow-on issue #359 in a live environment, that we're chasing down.
However, we do not see any evidence it is related to this fix

Initial fix and UTs for issue-204

f27e1cb

mogren self-assigned this Mar 14, 2019

mogren added bug enhancement tech debt labels Mar 14, 2019

mogren approved these changes Mar 14, 2019

View reviewed changes

mogren added this to the v1.4 milestone Mar 18, 2019

peterbroadhurst closed this Mar 23, 2019

peterbroadhurst mentioned this pull request Mar 23, 2019

Retry LinkByMac when link not found (master) #360

Merged

tiffanyfay removed this from the v1.4 milestone Apr 2, 2019

mogren mentioned this pull request Apr 24, 2019

Internal kubernetes cluster communication issues using AWS CNI 1.3.0 and m4/m5/c4/c5/r5/i3 instances #318

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry LinkByMac when link not found #346

Retry LinkByMac when link not found #346

peterbroadhurst commented Mar 14, 2019

mogren left a comment

mogren Mar 14, 2019

peterbroadhurst commented Mar 15, 2019

mogren commented Mar 15, 2019

mogren commented Mar 18, 2019

peterbroadhurst commented Mar 23, 2019

Retry LinkByMac when link not found #346

Retry LinkByMac when link not found #346

Conversation

peterbroadhurst commented Mar 14, 2019

mogren left a comment

Choose a reason for hiding this comment

mogren Mar 14, 2019

Choose a reason for hiding this comment

peterbroadhurst commented Mar 15, 2019

mogren commented Mar 15, 2019

mogren commented Mar 18, 2019

peterbroadhurst commented Mar 23, 2019