-
Notifications
You must be signed in to change notification settings - Fork 558
When docker takes too long to install, ensureK8s fails #2645
Comments
@khaldoune we have been seeing a lot of those issues today, #2641 was just merged to ensure we wait long enough for Docker to be installed. This is due mainly to networking errors while running apt-get update in cloud-init (as shown by the log you pasted). Next step is to figure out why those errors are happening and how we can fix/improve it so we don't have to wait an hour for a working cluster.... edit: note that ensureK8s() runs after ensureDocker() (not in parallel) so if Docker fails to be enabled before the timeout we will exit before even checking for nodes. |
@CecileRobertMichon On the same page for 1 and 2. |
By the time
ensureDocker() sometimes would timeout before Docker was installed. We increased the timeout to 60 minutes which should be enough to catch > 99.9% of network flakiness. Did I answer your question?
|
Thanks @CecileRobertMichon |
Is this a request for help?:
YES
Is this an ISSUE or FEATURE REQUEST? (choose one):
BOTH
What version of acs-engine?:
commit #d5df071
Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes 1.9.6
What happened:
Deployment fails while cluster is healthy
What you expected to happen:
Deployment OK
How to reproduce it (as minimally and precisely as possible):
Do you have any explanation?
From my understanding:
We wait 900 seconds for docker install (ensureDockerInstallCompleted), and even if it is not installed, no error is thrown, we continue with ensureDocker() that calls systemctlEnableAndCheck that will try for max 900 seconds to enable docker, If it is not enabled, we exit with error code 5, if it is enabled, we try to start it (900 1 60).
That meens that docker has 900+900+60 seconds to start while k8s has only 600 seconds.
In some cases, when docker takes too long to start, we fail ensureK8s() because 600 seconds are not enough anymore for kubectl cluster-info to respond.
Still we have the understand why docker tooks 198 seconds to be installed on the succeeded node and much more in the failing one. If I take a look at cloud-init-output.log, I can see:
apt-get update
W: Failed to fetch http://azure.archive.ubuntu.com/ubuntu/dists/xenial/InRelease Could not connect to azure.archive.ubuntu.com:80 (51.137.52.58), connection timed out
W: Failed to fetch http://azure.archive.ubuntu.com/ubuntu/dists/xenial-updates/InRelease Unable to connect to azure.archive.ubuntu.com:http:
W: Failed to fetch http://azure.archive.ubuntu.com/ubuntu/dists/xenial-backports/InRelease Unable to connect to azure.archive.ubuntu.com:http:
W: Some index files failed to download. They have been ignored, or old ones used instead.
Anything else we need to know:
@CecileRobertMichon @jackfrancis
The text was updated successfully, but these errors were encountered: