When docker takes too long to install, ensureK8s fails #2645

khaldoune · 2018-04-09T23:10:22Z

Is this a request for help?:
YES

Is this an ISSUE or FEATURE REQUEST? (choose one):
BOTH

What version of acs-engine?:
commit #d5df071

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes 1.9.6

What happened:
Deployment fails while cluster is healthy

What you expected to happen:
Deployment OK

How to reproduce it (as minimally and precisely as possible):

Do you have any explanation?

From my understanding:

We wait 900 seconds for docker install (ensureDockerInstallCompleted), and even if it is not installed, no error is thrown, we continue with ensureDocker() that calls systemctlEnableAndCheck that will try for max 900 seconds to enable docker, If it is not enabled, we exit with error code 5, if it is enabled, we try to start it (900 1 60).

That meens that docker has 900+900+60 seconds to start while k8s has only 600 seconds.

In some cases, when docker takes too long to start, we fail ensureK8s() because 600 seconds are not enough anymore for kubectl cluster-info to respond.

Still we have the understand why docker tooks 198 seconds to be installed on the succeeded node and much more in the failing one. If I take a look at cloud-init-output.log, I can see:

apt-get update
W: Failed to fetch http://azure.archive.ubuntu.com/ubuntu/dists/xenial/InRelease Could not connect to azure.archive.ubuntu.com:80 (51.137.52.58), connection timed out
W: Failed to fetch http://azure.archive.ubuntu.com/ubuntu/dists/xenial-updates/InRelease Unable to connect to azure.archive.ubuntu.com:http:
W: Failed to fetch http://azure.archive.ubuntu.com/ubuntu/dists/xenial-backports/InRelease Unable to connect to azure.archive.ubuntu.com:http:
W: Some index files failed to download. They have been ignored, or old ones used instead.

Anything else we need to know:
@CecileRobertMichon @jackfrancis

CecileRobertMichon · 2018-04-09T23:14:49Z

@khaldoune we have been seeing a lot of those issues today, #2641 was just merged to ensure we wait long enough for Docker to be installed. This is due mainly to networking errors while running apt-get update in cloud-init (as shown by the log you pasted). Next step is to figure out why those errors are happening and how we can fix/improve it so we don't have to wait an hour for a working cluster....

edit: note that ensureK8s() runs after ensureDocker() (not in parallel) so if Docker fails to be enabled before the timeout we will exit before even checking for nodes.

khaldoune · 2018-04-10T08:59:09Z

@CecileRobertMichon On the same page for 1 and 2.
2 means that 10 minutes are sometimes not enough to get a response for kubectl cluster-info.
Increase to 15?

CecileRobertMichon · 2018-04-10T16:47:54Z

By the time ensureK8s() runs, everything should be installed so 10 minutes should be plenty enough. The issue we were seeing was ensureDocker() running in parallel with

acs-engine/parts/k8s/kubernetesagentcustomdata.yml

Line 176 in af09bed

runcmd:

which meant that ensureDocker() sometimes would timeout before Docker was installed. We increased the timeout to 60 minutes which should be enough to catch > 99.9% of network flakiness. Did I answer your question?

khaldoune · 2018-04-10T20:26:50Z

Thanks @CecileRobertMichon

khaldoune mentioned this issue Apr 9, 2018

apply PodSecurityPolicy before check for ready nodes #2633

Merged

3 tasks

khaldoune closed this as completed Apr 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When docker takes too long to install, ensureK8s fails #2645

When docker takes too long to install, ensureK8s fails #2645

khaldoune commented Apr 9, 2018

CecileRobertMichon commented Apr 9, 2018 •

edited

Loading

khaldoune commented Apr 10, 2018

CecileRobertMichon commented Apr 10, 2018

khaldoune commented Apr 10, 2018

When docker takes too long to install, ensureK8s fails #2645

When docker takes too long to install, ensureK8s fails #2645

Comments

khaldoune commented Apr 9, 2018

Is this a request for help?: YES

Is this an ISSUE or FEATURE REQUEST? (choose one): BOTH

What version of acs-engine?: commit #d5df071

CecileRobertMichon commented Apr 9, 2018 • edited Loading

khaldoune commented Apr 10, 2018

CecileRobertMichon commented Apr 10, 2018

khaldoune commented Apr 10, 2018

Is this a request for help?:
YES

Is this an ISSUE or FEATURE REQUEST? (choose one):
BOTH

What version of acs-engine?:
commit #d5df071

CecileRobertMichon commented Apr 9, 2018 •

edited

Loading