Skip to content
This repository has been archived by the owner on Jan 11, 2023. It is now read-only.

When docker takes too long to install, ensureK8s fails #2645

Closed
khaldoune opened this issue Apr 9, 2018 · 4 comments
Closed

When docker takes too long to install, ensureK8s fails #2645

khaldoune opened this issue Apr 9, 2018 · 4 comments

Comments

@khaldoune
Copy link

Is this a request for help?:
YES

Is this an ISSUE or FEATURE REQUEST? (choose one):
BOTH

What version of acs-engine?:
commit #d5df071

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes 1.9.6

What happened:
Deployment fails while cluster is healthy

What you expected to happen:
Deployment OK

How to reproduce it (as minimally and precisely as possible):

Do you have any explanation?

From my understanding:

We wait 900 seconds for docker install (ensureDockerInstallCompleted), and even if it is not installed, no error is thrown, we continue with ensureDocker() that calls systemctlEnableAndCheck that will try for max 900 seconds to enable docker, If it is not enabled, we exit with error code 5, if it is enabled, we try to start it (900 1 60).

That meens that docker has 900+900+60 seconds to start while k8s has only 600 seconds.

In some cases, when docker takes too long to start, we fail ensureK8s() because 600 seconds are not enough anymore for kubectl cluster-info to respond.

Still we have the understand why docker tooks 198 seconds to be installed on the succeeded node and much more in the failing one. If I take a look at cloud-init-output.log, I can see:

apt-get update
W: Failed to fetch http://azure.archive.ubuntu.com/ubuntu/dists/xenial/InRelease Could not connect to azure.archive.ubuntu.com:80 (51.137.52.58), connection timed out
W: Failed to fetch http://azure.archive.ubuntu.com/ubuntu/dists/xenial-updates/InRelease Unable to connect to azure.archive.ubuntu.com:http:
W: Failed to fetch http://azure.archive.ubuntu.com/ubuntu/dists/xenial-backports/InRelease Unable to connect to azure.archive.ubuntu.com:http:
W: Some index files failed to download. They have been ignored, or old ones used instead.

Anything else we need to know:
@CecileRobertMichon @jackfrancis

@CecileRobertMichon
Copy link
Contributor

CecileRobertMichon commented Apr 9, 2018

@khaldoune we have been seeing a lot of those issues today, #2641 was just merged to ensure we wait long enough for Docker to be installed. This is due mainly to networking errors while running apt-get update in cloud-init (as shown by the log you pasted). Next step is to figure out why those errors are happening and how we can fix/improve it so we don't have to wait an hour for a working cluster....

edit: note that ensureK8s() runs after ensureDocker() (not in parallel) so if Docker fails to be enabled before the timeout we will exit before even checking for nodes.

@khaldoune
Copy link
Author

@CecileRobertMichon On the same page for 1 and 2.
2 means that 10 minutes are sometimes not enough to get a response for kubectl cluster-info.
Increase to 15?

@CecileRobertMichon
Copy link
Contributor

By the time ensureK8s() runs, everything should be installed so 10 minutes should be plenty enough. The issue we were seeing was ensureDocker() running in parallel with

which meant that ensureDocker() sometimes would timeout before Docker was installed. We increased the timeout to 60 minutes which should be enough to catch > 99.9% of network flakiness. Did I answer your question?

@khaldoune
Copy link
Author

Thanks @CecileRobertMichon

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants