You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#9672 fixed a bug that was introduced in 0.11 that caused nodes with no bootstrap_expect or a bootstrap_expect value of 0 to bootstrap as standalone nodes rather than joining a cluster.
In our experience, in 0.11 new instances would successfully join an existing cluster about 30% of the time. In 0.12, our experience was that every node would fail to auto-join a cluster.
Given the potential severity of this and the lack of a correct workaround, I think this fix should be backported to 0.11 & 0.12 if these versions are considered at-all viable to be run in a production setting.
Use-cases
Operators should be able to safely auto-join nodes to an existing cluster by setting bootstrap_expect to 0.
Setting bootstrap_expect to 0 is recommended to avoid potential split-brain scenarios where multiple Nomad clusters register in Consul. As I'm sure you know, this can cause anything from confusion to a major outage. The only workaround for the bug that is fixed in #9672 is to set bootstrap_expect to a higher value, which introduces the risk of a split-brain.
In addition to making a common, intended operational mode unusable, this bug is extremely hard to identify, because Nomad will start without errors on the affected node and register as healthy in Consul. The only indication of a problem will be nodes that have ACLs enabled, because any interaction with the agent will receive a 403, though again, Nomad will register as completely healthy.
Attempted Solutions
The only solutions are to increase bootstrap_expect or to manually join nodes to clusters, both of which degrade (or eliminate) the use of auto-joining nodes to a running cluster.
The text was updated successfully, but these errors were encountered:
Hi @luckymike! Thanks for raising it. I agree that the issue is severe and warrants a backport. I'll cut a 0.12 release with backporting 9672 later this week.
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Proposal
#9672 fixed a bug that was introduced in 0.11 that caused nodes with no
bootstrap_expect
or abootstrap_expect
value of0
to bootstrap as standalone nodes rather than joining a cluster.In our experience, in 0.11 new instances would successfully join an existing cluster about 30% of the time. In 0.12, our experience was that every node would fail to auto-join a cluster.
Given the potential severity of this and the lack of a correct workaround, I think this fix should be backported to 0.11 & 0.12 if these versions are considered at-all viable to be run in a production setting.
Use-cases
Operators should be able to safely auto-join nodes to an existing cluster by setting
bootstrap_expect
to0
.Setting
bootstrap_expect
to0
is recommended to avoid potential split-brain scenarios where multiple Nomad clusters register in Consul. As I'm sure you know, this can cause anything from confusion to a major outage. The only workaround for the bug that is fixed in #9672 is to setbootstrap_expect
to a higher value, which introduces the risk of a split-brain.In addition to making a common, intended operational mode unusable, this bug is extremely hard to identify, because Nomad will start without errors on the affected node and register as healthy in Consul. The only indication of a problem will be nodes that have ACLs enabled, because any interaction with the agent will receive a 403, though again, Nomad will register as completely healthy.
Attempted Solutions
The only solutions are to increase
bootstrap_expect
or to manually join nodes to clusters, both of which degrade (or eliminate) the use of auto-joining nodes to a running cluster.The text was updated successfully, but these errors were encountered: