Control plane Expansion may leads to broken etcd cluster #2028

thomasdanan · 2019-11-08T11:50:08Z

Component:

'etcd', 'salt'

What happened:

While doing a control plane expansion (on vagrant environment), the system is trying to register the new node in etcd cluster even if etcd is not running in the new node. As a result, etcd cannot start and the cluster becomes unusable.

What was expected:

If for any reason, we are not able to start etcd on the new added control plane node, we should immediately abort and more importantly not try to register it to the existing etcd cluster.

Steps to reproduce

on Vagrant Environment
Deploy bootstrap
Add a node with infra / control plane / worker plane roles
Click on Deploy node and follow the progress
It ends up with the following error:

Error: Register the node into etcd cluster - Runner function 'state.orchestrate' failed.`

and etcd on bootstrap is not able to start anymore:

[root@bootstrap containers]# crictl ps -a | grep etcd
cb9f42e21e7a4       2c4adeb21b4ff       5 minutes ago       Exited              etcd                            13                  5eaa0b27d9c3e

etcd on newly added node was never scheduled (no logs available)

Restarting kubelet on both nodes is fixing the issue.

Resolution proposal (optional):

The text was updated successfully, but these errors were encountered:

NicolasT · 2019-12-17T22:58:11Z

Is the work on this (design of the fix) laid out somewhere?

gdemonet · 2019-12-19T13:19:21Z

Is the work on this (design of the fix) laid out somewhere?

No we forgot to add it here. The commit is small, maybe its message should be more detailed to explain the rationale.

In the meantime:

What I think we should do is:
• split up deploy_node a bit which makes it more general/versatile in 'higher level' scripted environment (it'd help for deploy automation)
• When using deploy_node proper, have it in more stages, where (after salt-ssh to get minion up) we first bring the node in metalk8s.roles.etcd.prepared or whatever state which will install all dependencies (kubelet, containerd,...) and pre-pull the etcd image but not create the manifest, then run the 'register etcd peer to cluster' step on the existing cluster, then apply the metalk8s.roles.etcd.running (or whatever) state on the new node which basically only creates the manifest YAML and waits for the member to be up and running
• Then, we have metalk8s.roles.etcd (as used by the highstate) which basically includes .not-running and .running

@NicolasT <private conversation>

thomasdanan added kind:bug Something isn't working topic:deployment Bugs in or enhancements to deployment stages moonshot labels Nov 8, 2019

thomasdanan added this to the MetalK8s 2.4.2 milestone Nov 8, 2019

Ebaneck self-assigned this Dec 16, 2019

Ebaneck mentioned this issue Dec 18, 2019

WIP: Split deploy_node and prepare etcd members for cluster addition #2147

Merged

bert-e closed this as completed in #2147 Dec 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Control plane Expansion may leads to broken etcd cluster #2028

Control plane Expansion may leads to broken etcd cluster #2028

thomasdanan commented Nov 8, 2019 •

edited

Loading

NicolasT commented Dec 17, 2019

gdemonet commented Dec 19, 2019

Control plane Expansion may leads to broken etcd cluster #2028

Control plane Expansion may leads to broken etcd cluster #2028

Comments

thomasdanan commented Nov 8, 2019 • edited Loading

NicolasT commented Dec 17, 2019

gdemonet commented Dec 19, 2019

thomasdanan commented Nov 8, 2019 •

edited

Loading