-
Notifications
You must be signed in to change notification settings - Fork 558
Race condition in master setup #2162
Comments
I posted this in #sig-azure, and this was also seen by @mwieczorek there, who also noted that there was an related error in the cloud-init logs. We also get the same:
This would seem to strengthen the idea that it is some type of race? |
The provisioning by cloud-init and VM extensions run in parallel. Relevant for the etcd setup, they perform the following things. VM extensionsRuns
cloud-init
During a failed deployment this morning, we saw that the cloud-init script starts right away when the machine boots, but it took 3 minutes longer for PR #2163 increases the timeout for waiting the certificates to become available and aborts |
Is this a request for help?:
No
Is this an ISSUE or FEATURE REQUEST? (choose one):
Issue
What version of acs-engine?:
master
Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes 1.9.2
What happened:
Cluster with three masters fails to deploy. Some masters (different number each deploy) fail with
Looking at the logs in /var/log/azure/cluster-provision.log, the masters that fail to come up drop out of the script here. However, this does not seem to be the root cause.
Investigating further, etcd never starts up, which turns out to be because /etc/kubernetes/certs/etcdserver.key and etcdpeer(number).key are owned by root. This is set here. On the machines that successfully start, the keys are owned by etcd, so there seems to be some race condition here. It seems the correct permissions are set by cloud-init? I can't easily see how these two interact, so the race condition is a guess from available data.
Anyway, is there a good reason for not setting etcd as the owner in the customscript.sh also?
What you expected to happen:
Successful deploy
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know:
This took a bit longer to diagnose than necessary since the previously run function ensureEtcd does not exit the script if it fails. Shouldn't it?
The text was updated successfully, but these errors were encountered: