You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While doing a control plane expansion (on vagrant environment), the system is trying to register the new node in etcd cluster even if etcd is not running in the new node. As a result, etcd cannot start and the cluster becomes unusable.
What was expected:
If for any reason, we are not able to start etcd on the new added control plane node, we should immediately abort and more importantly not try to register it to the existing etcd cluster.
Steps to reproduce
on Vagrant Environment
Deploy bootstrap
Add a node with infra / control plane / worker plane roles
Click on Deploy node and follow the progress
It ends up with the following error:
Error: Register the node into etcd cluster - Runner function 'state.orchestrate' failed.`
and etcd on bootstrap is not able to start anymore:
[root@bootstrap containers]# crictl ps -a | grep etcd
cb9f42e21e7a4 2c4adeb21b4ff 5 minutes ago Exited etcd 13 5eaa0b27d9c3e
etcd on newly added node was never scheduled (no logs available)
Restarting kubelet on both nodes is fixing the issue.
Resolution proposal (optional):
The text was updated successfully, but these errors were encountered:
Is the work on this (design of the fix) laid out somewhere?
No we forgot to add it here. The commit is small, maybe its message should be more detailed to explain the rationale.
In the meantime:
What I think we should do is:
• split up deploy_node a bit which makes it more general/versatile in 'higher level' scripted environment (it'd help for deploy automation)
• When using deploy_node proper, have it in more stages, where (after salt-ssh to get minion up) we first bring the node in metalk8s.roles.etcd.prepared or whatever state which will install all dependencies (kubelet, containerd,...) and pre-pull the etcd image but not create the manifest, then run the 'register etcd peer to cluster' step on the existing cluster, then apply the metalk8s.roles.etcd.running (or whatever) state on the new node which basically only creates the manifest YAML and waits for the member to be up and running
• Then, we have metalk8s.roles.etcd (as used by the highstate) which basically includes .not-running and .running
Component:
'etcd', 'salt'
What happened:
While doing a control plane expansion (on vagrant environment), the system is trying to register the new node in etcd cluster even if etcd is not running in the new node. As a result, etcd cannot start and the cluster becomes unusable.
What was expected:
If for any reason, we are not able to start etcd on the new added control plane node, we should immediately abort and more importantly not try to register it to the existing etcd cluster.
Steps to reproduce
on Vagrant Environment
Deploy bootstrap
Add a node with infra / control plane / worker plane roles
Click on Deploy node and follow the progress
It ends up with the following error:
and etcd on bootstrap is not able to start anymore:
etcd on newly added node was never scheduled (no logs available)
Restarting kubelet on both nodes is fixing the issue.
Resolution proposal (optional):
The text was updated successfully, but these errors were encountered: