Race condition in master setup #2162

carlpett · 2018-01-26T11:25:56Z

Is this a request for help?:
No

Is this an ISSUE or FEATURE REQUEST? (choose one):
Issue

What version of acs-engine?:
master

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes 1.9.2

What happened:
Cluster with three masters fails to deploy. Some masters (different number each deploy) fail with

statusMessage:{"status":"Failed","error":{"code":"ResourceOperationFailure","message":"The resource operation completed with terminal provisioning state 'Failed'.","details":[{"code":"VMExtensionProvisioningError","message":"VM has reported a failure when processing extension 'cse2'. Error message: \"Enable failed: failed to execute command: command terminated with exit status=3\n[stdout]\n\n[stderr]\n\"."}]}}

Looking at the logs in /var/log/azure/cluster-provision.log, the masters that fail to come up drop out of the script here. However, this does not seem to be the root cause.

Investigating further, etcd never starts up, which turns out to be because /etc/kubernetes/certs/etcdserver.key and etcdpeer(number).key are owned by root. This is set here. On the machines that successfully start, the keys are owned by etcd, so there seems to be some race condition here. It seems the correct permissions are set by cloud-init? I can't easily see how these two interact, so the race condition is a guess from available data.

Anyway, is there a good reason for not setting etcd as the owner in the customscript.sh also?

What you expected to happen:
Successful deploy

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know:
This took a bit longer to diagnose than necessary since the previously run function ensureEtcd does not exit the script if it fails. Shouldn't it?

The text was updated successfully, but these errors were encountered:

carlpett · 2018-01-26T11:50:46Z

I posted this in #sig-azure, and this was also seen by @mwieczorek there, who also noted that there was an related error in the cloud-init logs. We also get the same:

/var/log/cloud-init-output.log:
/bin/chown: cannot access '/etc/kubernetes/certs/etcdserver.key': No such file or directory
/bin/chown: cannot access '/etc/kubernetes/certs/etcdpeer2.key': No such file or directory

This would seem to strengthen the idea that it is some type of race?

duga3 · 2018-01-26T14:21:22Z

The provisioning by cloud-init and VM extensions run in parallel. Relevant for the etcd setup, they perform the following things.

VM extensions

Runs provision.sh which

Store certificates on the system
Set ownership to root.

cloud-init

Setup etcd (setup-etcd.sh)
Waits 60 seconds for the certificates be become available, after that timeout the script just continues
Set ownership of the certificates to etcd, that step fails as mentioned above. Without that step, etcd can't be started, but the provision.shdoesn't abort until the timeout of ensureApiserver

During a failed deployment this morning, we saw that the cloud-init script starts right away when the machine boots, but it took 3 minutes longer for provision.sh to be started. That is longer than the timeout in setup-etcd.sh, so the permissions for the certificates couldn't be set correctly since they were simply not available when the cloud-init script came to this step.

PR #2163 increases the timeout for waiting the certificates to become available and aborts provision.sh at the ensureEtcd step if Etcd didn't become available during the available time.

carlpett closed this as completed Jan 26, 2018

carlpett reopened this Jan 26, 2018

duga3 mentioned this issue Jan 26, 2018

Increase timeout for etcd certificates to become available #2163

Closed

CecileRobertMichon mentioned this issue Jan 26, 2018

Protect etcd tls from race conditions #2160

Merged

wbuchwalter mentioned this issue Jan 29, 2018

Unexpected error: class 'requests.exceptions.HTTPError' wbuchwalter/Kubernetes-acs-engine-autoscaler#75

Closed

msorby mentioned this issue Jan 30, 2018

VM has reported a failure when processing extension 'cse0' #1806

Closed

jackfrancis closed this as completed in #2160 Feb 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition in master setup #2162

Race condition in master setup #2162

carlpett commented Jan 26, 2018

carlpett commented Jan 26, 2018 •

edited

Loading

duga3 commented Jan 26, 2018

Race condition in master setup #2162

Race condition in master setup #2162

Comments

carlpett commented Jan 26, 2018

Is this a request for help?: No

Is this an ISSUE or FEATURE REQUEST? (choose one): Issue

carlpett commented Jan 26, 2018 • edited Loading

duga3 commented Jan 26, 2018

VM extensions

cloud-init

Is this a request for help?:
No

Is this an ISSUE or FEATURE REQUEST? (choose one):
Issue

carlpett commented Jan 26, 2018 •

edited

Loading