Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control plane Expansion may leads to broken etcd cluster #2028

Closed
thomasdanan opened this issue Nov 8, 2019 · 2 comments · Fixed by #2147
Closed

Control plane Expansion may leads to broken etcd cluster #2028

thomasdanan opened this issue Nov 8, 2019 · 2 comments · Fixed by #2147
Assignees
Labels
kind:bug Something isn't working topic:deployment Bugs in or enhancements to deployment stages

Comments

@thomasdanan
Copy link
Contributor

thomasdanan commented Nov 8, 2019

Component:

'etcd', 'salt'

What happened:

While doing a control plane expansion (on vagrant environment), the system is trying to register the new node in etcd cluster even if etcd is not running in the new node. As a result, etcd cannot start and the cluster becomes unusable.

What was expected:

If for any reason, we are not able to start etcd on the new added control plane node, we should immediately abort and more importantly not try to register it to the existing etcd cluster.

Steps to reproduce

on Vagrant Environment
Deploy bootstrap
Add a node with infra / control plane / worker plane roles
Click on Deploy node and follow the progress
It ends up with the following error:

Error: Register the node into etcd cluster - Runner function 'state.orchestrate' failed.`

and etcd on bootstrap is not able to start anymore:

[root@bootstrap containers]# crictl ps -a | grep etcd
cb9f42e21e7a4       2c4adeb21b4ff       5 minutes ago       Exited              etcd                            13                  5eaa0b27d9c3e

etcd on newly added node was never scheduled (no logs available)

Restarting kubelet on both nodes is fixing the issue.

Resolution proposal (optional):

@thomasdanan thomasdanan added kind:bug Something isn't working topic:deployment Bugs in or enhancements to deployment stages moonshot labels Nov 8, 2019
@thomasdanan thomasdanan added this to the MetalK8s 2.4.2 milestone Nov 8, 2019
@Ebaneck Ebaneck self-assigned this Dec 16, 2019
@NicolasT
Copy link
Contributor

Is the work on this (design of the fix) laid out somewhere?

@gdemonet
Copy link
Contributor

Is the work on this (design of the fix) laid out somewhere?

No we forgot to add it here. The commit is small, maybe its message should be more detailed to explain the rationale.

In the meantime:

What I think we should do is:
• split up deploy_node a bit which makes it more general/versatile in 'higher level' scripted environment (it'd help for deploy automation)
• When using deploy_node proper, have it in more stages, where (after salt-ssh to get minion up) we first bring the node in metalk8s.roles.etcd.prepared or whatever state which will install all dependencies (kubelet, containerd,...) and pre-pull the etcd image but not create the manifest, then run the 'register etcd peer to cluster' step on the existing cluster, then apply the metalk8s.roles.etcd.running (or whatever) state on the new node which basically only creates the manifest YAML and waits for the member to be up and running
• Then, we have metalk8s.roles.etcd (as used by the highstate) which basically includes .not-running and .running

@NicolasT <private conversation>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug Something isn't working topic:deployment Bugs in or enhancements to deployment stages
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants