Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove randomized startup delays #1198

Merged
merged 2 commits into from
Jun 10, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 11 additions & 65 deletions site/cluster-formation.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,12 +121,6 @@ so nodes don't have to (or cannot) explicitly register. However, the list of clu
is not predefined. Such backends usually include a no-op registration step
and apply one of the [race condition mitigation mechanisms](#initial-formation-race-condition) described below.

When a cluster is first formed and there are no registered nodes yet,
a natural race condition between booting
nodes occurs. Different backends [address this problem](#initial-formation-race-condition)
differently: some try to acquire a lock with an external service, others rely on randomized
delays. This problem does not apply to the backends that require listing all nodes ahead of time.

When the configured backend supports registration, nodes unregister when they stop.

If peer discovery isn't configured, or it [repeatedly fails](#discovery-retries),
Expand Down Expand Up @@ -216,9 +210,6 @@ The most basic way for a node to discover its cluster peers is to read a list
of nodes from the config file. The set of cluster members is assumed to be known at deployment
time.

[Race condition during initial cluster formation](#initial-formation-race-condition) is addressed
by using a randomized startup delay.

### Configuration

The peer nodes are listed using the `cluster_formation.classic_config.nodes` config setting:
Expand Down Expand Up @@ -296,10 +287,6 @@ Both methods rely on AWS-specific APIs (endpoints) and features and thus cannot
other IaaS environments. Once a list of cluster member instances is retrieved,
final node names are computed using instance hostnames or IP addresses.

When the AWS peer discovery mechanism is used, nodes will
delay their startup for a randomly picked value to reduce the
probability of a [race condition during initial cluster formation](#initial-formation-race-condition).

### <a id="peer-discovery-aws-credentials" class="anchor" href="#peer-discovery-aws-credentials">Configuration and Credentials</a>

Before a node can perform any operations on AWS, it needs to have a set of
Expand Down Expand Up @@ -445,9 +432,6 @@ This can lead to data loss and higher network traffic volume due to more frequen
data synchronisation of both [quorum queues](/quorum-queues.html)
and [classic queue mirrors](/ha.html) on newly joining nodes.

Stateless sets are also prone to the [natural race condition](#initial-formation-race-condition) during initial
cluster formation, unlike stateful sets that initialise pods [one by one](https://kubernetes.io/docs/tasks/run-application/run-replicated-stateful-application/#understanding-stateful-pod-initialization).

#### Use Persistent Volumes

How [storage is configured](https://kubernetes.io/docs/concepts/storage/persistent-volumes/)
Expand All @@ -465,25 +449,12 @@ and so on.
It is therefore highly recommended that `/etc/rabbitmq` is mounted as writeable and owned by
RabbitMQ's effective user (typically `rabbitmq`).

#### Use `OrderedReady` Pod Management Policy

Peer discovery mechanism will filter out nodes whose pods are not yet ready
(initialised) according to their readiness probe as reported by the Kubernetes API.
For example, if [pod management policy](https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/#pod-management-policy)
of a stateful set is set to `Parallel`, some nodes may be discovered but will not be joined.
To work around this, the Kubernetes peer discovery plugin uses [randomized startup delays](/cluster-formation.html#initial-formation-race-condition).

Deployments that use the `OrderedReady` pod management policy start pods one by one and therefore
all discovered nodes will be ready to join. This policy is used by default by Kubernetes.

However, such deployments can run into a deadlock if the readiness probe used expects
the node to be fully booted. This is covered in the following section.
#### Use Most Basic Health Checks for RabbitMQ Pod Readiness Probes

A readiness probe that expects the node to be fully booted and have rejoined its cluster peers
can prevent a deployment that restarts all RabbitMQ pods and relies on the `OrderedReady` pod management policy.
Deployments that use the `Parallel` pod management policy
will not be affected but must worry about the [natural race condition during initial cluster formation](#initial-formation-race-condition).
will not be affected.

One health check that does not expect a node to be fully booted and have schema tables synced is

Expand Down Expand Up @@ -623,23 +594,6 @@ cluster_formation.k8s.namespace_path = /var/run/secrets/kubernetes.io/serviceacc
cluster_formation.k8s.service_name = rmq-qa
</pre>

As mentioned above, stateful sets is the recommended way of running RabbitMQ on Kubernetes.
Stateful set pods are initialised one at a time. That effectively addresses
the natural [race condition during the initial cluster formation](#initial-formation-race-condition).
Randomized startup delay in such scenarios can use a significantly lower delay value range (e.g. 0 to 1 second):

<pre class="lang-ini">
cluster_formation.randomized_startup_delay_range.min = 0
cluster_formation.randomized_startup_delay_range.max = 2

cluster_formation.peer_discovery_backend = k8s

cluster_formation.k8s.host = kubernetes.default.example.local

# ...
</pre>


## <a id="peer-discovery-consul" class="anchor" href="#peer-discovery-consul">Peer Discovery Using Consul</a>

### Consul Peer Discovery Overview
Expand Down Expand Up @@ -1238,26 +1192,18 @@ cluster_formation.etcd.ssl_options.ciphers.12 = DHE-DSS-AES128-GCM-SHA256

## <a id="initial-formation-race-condition" class="anchor" href="#initial-formation-race-condition">Race Conditions During Initial Cluster Formation</a>

Consider a deployment where the entire cluster is provisioned at once and all nodes
start in parallel. In this case there's a natural race
condition between node registration and more than one node
can become "first to register" (discovers no existing peers
and thus starts as standalone).

Different peer discovery backends use different approaches to
minimize the probability of such scenario. Some use locking
(etcd, Consul), others use a technique known as randomized startup delay.
With randomized startup delay nodes will delay their startup
for a randomly picked value (between 5 and 60 seconds by default).

Some backends (config file, DNS) rely on a pre-configured set of peers and avoid
the issue that way.

Effective delay interval, if used, is logged on node boot.
For successful cluster formation, only one node should form the cluster initially, that is,
start as a standalone node and initializing its database. If this was not the case, multiple
clusters would be formed instead of just one, violating operator expectations.

Lastly, some mechanism rely on ordered node startup provided by the underlying
provisioning and orchestration tool. [Kubernetes stateful sets](https://kubernetes.io/docs/tasks/run-application/run-replicated-stateful-application/#understanding-stateful-pod-initialization) is one example of an environment that offers such a guarantee.
Consider a deployment where the entire cluster is provisioned at once and all nodes start in parallel.
In this case, a natural race condition occurs between the starting nodes.
To prevent multiple nodes forming separate clusters, peer discovery backends try to acquire a lock when either
forming the cluster (seeding) or joining a peer. What locks are used varies from backend to backend:

* Classic config file, K8s, and AWS backends use a built-in [locking library](https://erlang.org/doc/man/global.html#set_lock-3) provided by the runtime
* The Consul peer discovery backend sets a lock in Consul
* The etcd peer discovery backend sets a lock in etcd

## <a id="node-health-checks-and-cleanup" class="anchor" href="#node-health-checks-and-cleanup">Node Health Checks and Forced Removal</a>

Expand Down