Skip to content
This repository has been archived by the owner on Jan 11, 2023. It is now read-only.

Check outbount connection at the start of CSE #3083

Merged
merged 5 commits into from
Jun 1, 2018

Conversation

CecileRobertMichon
Copy link
Contributor

@CecileRobertMichon CecileRobertMichon commented May 25, 2018

What this PR does / why we need it: In some deployments with a custom vnet, a network config on the vnet prevents us from installing dependencies such as docker and ubuntu packages. This PR adds a basic connectivity test at the beginning of every deployment and exits if we are not able to access Google public DNS.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #2951

Special notes for your reviewer:

If applicable:

  • documentation
  • unit tests
  • tested backward compatibility (ie. deploy with previous version, upgrade with this branch)

Release note:

@ghost ghost added the in progress label May 25, 2018
@CecileRobertMichon
Copy link
Contributor Author

/assign @slack @JunSun17 @jackfrancis

@CecileRobertMichon
Copy link
Contributor Author

Discussion: How conservative do we need to be in retries and timeout? We need to find the right balance between failing fast if we cannot resolve bing.com and ubuntu.com but we also don't want to introduce false negatives (ie. network is flaky, it does eventually become available but we fail before it does).

JunSun17
JunSun17 previously approved these changes May 25, 2018
Copy link
Collaborator

@JunSun17 JunSun17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for working on this!

@slack
Copy link
Contributor

slack commented May 25, 2018

@CecileRobertMichon not sure what the right time budget for the network should be. Shot from the hip here, but if we haven't seen stable network in ~5m we should be safe to fail.

Do we have any data on a VM coming up with network but then later calls being flaky?

@CecileRobertMichon
Copy link
Contributor Author

Do we have any data on a VM coming up with network but then later calls being flaky?

Data acquisition in progress

@jackfrancis
Copy link
Member

My primary concern is that we implement timeout T for network to come up and then introduce add'l failure for existing cluster deployment operations that succeed with T + (insert reasonable unit of time) network availability delay.

We'll test our way into figuring out real-world implications. For example, let's say we go from:

  • 99.99% deployment success w/ average deploy time of 9 mins
  • 97% deployment success w/ average deploy time of 8.5 mins

Then we probably need to eat the increase to our additional average deployment time (to, say, accommodate deployments that succeed because they're willing to retry during 10 mins of flaky network) in order to save our 99.99% success rate.

@@ -51,6 +52,10 @@ else
REBOOTREQUIRED=false
fi

function testOutboundConnection() {
retrycmd_if_failure 20 5 20 curl www.bing.com || retrycmd_if_failure 20 5 20 curl www.ubuntu.com || exit $ERR_OUTBOUND_CONN_FAIL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest, to test simply "I have outbound internet access", we should do a port 53 test against a highly available DNS resolver (e.g., google public dns: 8.8.8.8/8.8.4.4, or one the root name servers, see a current root hints file here: http://www.internic.net/domain/named.root)

@khenidak @slack do you agree?

Also, are we in agreement that "we have outbound internet access" == "a single successful connection response within an interval of time"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, are we in agreement that "we have outbound internet access" == "a single successful connection response within an interval of time"?

This test is meant to quickly error out in the case of VNETs with firewalls / NSGs that disallow outbound connection (100% of the time) so yes in that case I would say "a single successful connection" is enough to determine that we are in that situation and we want to bail.

This test is not designed to catch transient network flakiness (for now), that would need to be checked differently.

@codecov
Copy link

codecov bot commented May 31, 2018

Codecov Report

Merging #3083 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #3083   +/-   ##
=======================================
  Coverage   51.89%   51.89%           
=======================================
  Files          99       99           
  Lines       15161    15161           
=======================================
  Hits         7868     7868           
  Misses       6571     6571           
  Partials      722      722

@@ -53,6 +54,10 @@ else
REBOOTREQUIRED=false
fi

function testOutboundConnection() {
retrycmd_if_failure 20 5 20 nc -v 8.8.8.8 53 || retrycmd_if_failure 20 5 20 nc -v 8.8.4.4 53 || exit $ERR_OUTBOUND_CONN_FAIL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! Let's do more aggressive timeout and retry. How does 120 1 5 sound? That should give us between ~2 and ~10 min (depending on the type of error) to achieve a single successful response

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good let's try that and get success rate data for a few hours and adjust if needed

@jackfrancis
Copy link
Member

/approve
/lgtm

@acs-bot acs-bot added the lgtm label Jun 1, 2018
@acs-bot
Copy link

acs-bot commented Jun 1, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jackfrancis
Copy link
Member

@slack @amanohar @qike-ms FYI: CSE Exit Code 50 = no outbound internet access :)

@jackfrancis jackfrancis merged commit b6b0e12 into Azure:master Jun 1, 2018
@ghost ghost removed the in progress label Jun 1, 2018
@CecileRobertMichon CecileRobertMichon deleted the check-connection branch June 6, 2018 19:13
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Discuss: Fail fast when no outbound network connectivity.
5 participants