Skip to content
This repository has been archived by the owner on Jan 11, 2023. It is now read-only.

Race condition in master setup #2162

Closed
carlpett opened this issue Jan 26, 2018 · 2 comments · Fixed by #2160
Closed

Race condition in master setup #2162

carlpett opened this issue Jan 26, 2018 · 2 comments · Fixed by #2160

Comments

@carlpett
Copy link
Contributor

Is this a request for help?:
No

Is this an ISSUE or FEATURE REQUEST? (choose one):
Issue

What version of acs-engine?:
master


Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes 1.9.2

What happened:
Cluster with three masters fails to deploy. Some masters (different number each deploy) fail with

statusMessage:{"status":"Failed","error":{"code":"ResourceOperationFailure","message":"The resource operation completed with terminal provisioning state 'Failed'.","details":[{"code":"VMExtensionProvisioningError","message":"VM has reported a failure when processing extension 'cse2'. Error message: \"Enable failed: failed to execute command: command terminated with exit status=3\n[stdout]\n\n[stderr]\n\"."}]}}

Looking at the logs in /var/log/azure/cluster-provision.log, the masters that fail to come up drop out of the script here. However, this does not seem to be the root cause.

Investigating further, etcd never starts up, which turns out to be because /etc/kubernetes/certs/etcdserver.key and etcdpeer(number).key are owned by root. This is set here. On the machines that successfully start, the keys are owned by etcd, so there seems to be some race condition here. It seems the correct permissions are set by cloud-init? I can't easily see how these two interact, so the race condition is a guess from available data.

Anyway, is there a good reason for not setting etcd as the owner in the customscript.sh also?

What you expected to happen:
Successful deploy

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know:
This took a bit longer to diagnose than necessary since the previously run function ensureEtcd does not exit the script if it fails. Shouldn't it?

@carlpett
Copy link
Contributor Author

carlpett commented Jan 26, 2018

I posted this in #sig-azure, and this was also seen by @mwieczorek there, who also noted that there was an related error in the cloud-init logs. We also get the same:

/var/log/cloud-init-output.log:
/bin/chown: cannot access '/etc/kubernetes/certs/etcdserver.key': No such file or directory
/bin/chown: cannot access '/etc/kubernetes/certs/etcdpeer2.key': No such file or directory

This would seem to strengthen the idea that it is some type of race?

@duga3
Copy link

duga3 commented Jan 26, 2018

The provisioning by cloud-init and VM extensions run in parallel. Relevant for the etcd setup, they perform the following things.

VM extensions

Runs provision.sh which

  • Store certificates on the system
  • Set ownership to root.

cloud-init

  • Setup etcd (setup-etcd.sh)
  • Waits 60 seconds for the certificates be become available, after that timeout the script just continues
  • Set ownership of the certificates to etcd, that step fails as mentioned above. Without that step, etcd can't be started, but the provision.shdoesn't abort until the timeout of ensureApiserver

During a failed deployment this morning, we saw that the cloud-init script starts right away when the machine boots, but it took 3 minutes longer for provision.sh to be started. That is longer than the timeout in setup-etcd.sh, so the permissions for the certificates couldn't be set correctly since they were simply not available when the cloud-init script came to this step.

PR #2163 increases the timeout for waiting the certificates to become available and aborts provision.sh at the ensureEtcd step if Etcd didn't become available during the available time.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
2 participants