-
Notifications
You must be signed in to change notification settings - Fork 558
Check outbount connection at the start of CSE #3083
Conversation
/assign @slack @JunSun17 @jackfrancis |
Discussion: How conservative do we need to be in retries and timeout? We need to find the right balance between failing fast if we cannot resolve bing.com and ubuntu.com but we also don't want to introduce false negatives (ie. network is flaky, it does eventually become available but we fail before it does). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks for working on this!
@CecileRobertMichon not sure what the right time budget for the network should be. Shot from the hip here, but if we haven't seen stable network in ~5m we should be safe to fail. Do we have any data on a VM coming up with network but then later calls being flaky? |
Data acquisition in progress |
My primary concern is that we implement timeout T for network to come up and then introduce add'l failure for existing cluster deployment operations that succeed with T + (insert reasonable unit of time) network availability delay. We'll test our way into figuring out real-world implications. For example, let's say we go from:
Then we probably need to eat the increase to our additional average deployment time (to, say, accommodate deployments that succeed because they're willing to retry during 10 mins of flaky network) in order to save our 99.99% success rate. |
parts/k8s/kubernetescustomscript.sh
Outdated
@@ -51,6 +52,10 @@ else | |||
REBOOTREQUIRED=false | |||
fi | |||
|
|||
function testOutboundConnection() { | |||
retrycmd_if_failure 20 5 20 curl www.bing.com || retrycmd_if_failure 20 5 20 curl www.ubuntu.com || exit $ERR_OUTBOUND_CONN_FAIL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest, to test simply "I have outbound internet access", we should do a port 53 test against a highly available DNS resolver (e.g., google public dns: 8.8.8.8
/8.8.4.4
, or one the root name servers, see a current root hints file here: http://www.internic.net/domain/named.root)
@khenidak @slack do you agree?
Also, are we in agreement that "we have outbound internet access" == "a single successful connection response within an interval of time"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, are we in agreement that "we have outbound internet access" == "a single successful connection response within an interval of time"?
This test is meant to quickly error out in the case of VNETs with firewalls / NSGs that disallow outbound connection (100% of the time) so yes in that case I would say "a single successful connection" is enough to determine that we are in that situation and we want to bail.
This test is not designed to catch transient network flakiness (for now), that would need to be checked differently.
39522ff
to
843d363
Compare
Codecov Report
@@ Coverage Diff @@
## master #3083 +/- ##
=======================================
Coverage 51.89% 51.89%
=======================================
Files 99 99
Lines 15161 15161
=======================================
Hits 7868 7868
Misses 6571 6571
Partials 722 722 |
parts/k8s/kubernetescustomscript.sh
Outdated
@@ -53,6 +54,10 @@ else | |||
REBOOTREQUIRED=false | |||
fi | |||
|
|||
function testOutboundConnection() { | |||
retrycmd_if_failure 20 5 20 nc -v 8.8.8.8 53 || retrycmd_if_failure 20 5 20 nc -v 8.8.4.4 53 || exit $ERR_OUTBOUND_CONN_FAIL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! Let's do more aggressive timeout and retry. How does 120 1 5 sound? That should give us between ~2 and ~10 min (depending on the type of error) to achieve a single successful response
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good let's try that and get success rate data for a few hours and adjust if needed
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jackfrancis The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it: In some deployments with a custom vnet, a network config on the vnet prevents us from installing dependencies such as docker and ubuntu packages. This PR adds a basic connectivity test at the beginning of every deployment and exits if we are not able to access Google public DNS.
Which issue this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close that issue when PR gets merged): fixes #2951Special notes for your reviewer:
If applicable:
Release note: