Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay start by 30 seconds #1156

Merged
merged 2 commits into from
Oct 10, 2022
Merged

Delay start by 30 seconds #1156

merged 2 commits into from
Oct 10, 2022

Conversation

ansd
Copy link
Member

@ansd ansd commented Oct 7, 2022

The rabbitmq/cluster-operator relies on

PublishNotReadyAddresses: true,

in its headless service for peer discovery such that the RabbitMQ cluster gets formed before the Pods are being marked as Ready.

However, as explained in kubernetes/kubernetes#92559 this promise is broken in Kubernetes, specifically in CoreDNS: By default DNS entries of the Pods are not updated correctly within the first 30 seconds.

These networking issues early during RabbitMQ boot cause a series of issues:

In some of above issues, the solution taken was to retry N times within RabbitMQ during the boot process.
In other cases simply retrying within RabbitMQ is not even possible because things get stuck due to bugs in Erlang's global module (more details in above issues).

Instead of using brittle workarounds within RabbitMQ, I suggest we only start RabbitMQ once we know that the Pods' DNS entries are up-to-date.

How can we ensure that the Pods' DNS entries are up-to-date?

One possibility is to query the DNS service in the init container as demonstrated in rabbitmq/rabbitmq-server#5322 (comment). However, since some DNS services use a K8s Service object in front, it's not certain that all the DNS backend pods are queried. So, the init container could falsely get positive results while during RabbitMQ boot process, other DNS servers with outdated caches are queried.

Therefore, to make things simple, I suggest the new default will be to artificially delay starting RabbitMQ by 30 seconds. I chose 30 seconds because that's CoreDNS default cache value. It's an inelegant solution which I've tried to avoid until now. However, reliable rolling updates (and cluster formations) are more important than 30 seconds more downtime when a node is being updated.

This new value will be configurable via spec.DelayStartSeconds. Users who never experienced RabbitMQ boot issues (for example because they use DNS backends which actually respect publishNotReadyAddresses=True or because they explicitly decrease CoreDNS default cache value) can safely reduce this artificial delay or disable it by setting the value to 0.

I tested 50 5-node cluster formations and rolling updates. With the new 30 seconds delay between creating the Pod and starting RabbitMQ, I never saw any clustering or upgrade issues.

ansd added 2 commits October 7, 2022 08:54
to make the vulnerability checker in GitHub actions happy
@ansd ansd marked this pull request as ready for review October 7, 2022 15:00
@ansd ansd requested a review from mkuratczyk October 7, 2022 15:02
@ansd
Copy link
Member Author

ansd commented Oct 10, 2022

Without this PR, I sometimes see the following in the logs of a node being booted during a rolling update:

2022-10-10 09:51:52.533123+00:00 [debug] <0.606.0> Getting epmd port node 'rabbit', 10 retries left
2022-10-10 09:54:07.004227+00:00 [debug] <0.606.0> Getting epmd port node 'rabbit', 9 retries left

... more than 2 minutes delay. In that regards this PR makes booting even faster.

@MirahImage MirahImage merged commit 35a04b8 into main Oct 10, 2022
@MirahImage MirahImage deleted the delay-start branch October 10, 2022 10:30
ansd referenced this pull request in rabbitmq/rabbitmq-server Nov 8, 2023
Prior to this commit, global:sync/0 gets sometimes stuck when either
performing a rolling update on Kubernetes or when creating a new
RabbitMQ cluster on Kubernetes.

When performing a rolling update, the node being booted will be stuck
in:
```
2022-07-26 10:49:58.891896+00:00 [debug] <0.226.0> == Plugins (prelaunch phase) ==
2022-07-26 10:49:58.891908+00:00 [debug] <0.226.0> Setting plugins up
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> Loading the following plugins: [cowlib,cowboy,rabbitmq_web_dispatch,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 rabbitmq_management_agent,amqp_client,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 rabbitmq_management,quantile_estimator,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 prometheus,rabbitmq_peer_discovery_common,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 accept,rabbitmq_peer_discovery_k8s,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 rabbitmq_prometheus]
2022-07-26 10:49:58.926373+00:00 [debug] <0.226.0> Feature flags: REFRESHING after applications load...
2022-07-26 10:49:58.926416+00:00 [debug] <0.372.0> Feature flags: registering controller globally before proceeding with task: refresh_after_app_load
2022-07-26 10:49:58.926450+00:00 [debug] <0.372.0> Feature flags: [global sync] @ [email protected]
```

During cluster creation, an example log of global:sync/0 being stuck can
be found in bullet point 2 of
#5331 (review)

When global:sync/0 is stuck, it never receives a message in line
https://github.com/erlang/otp/blob/bd05b07f973f11d73c4fc77d59b69f212f121c2d/lib/kernel/src/global.erl#L2942

This issue can be observed in both `kind` and GKE.

`kind` uses CoreDNS, GKE uses kubedns.

CoreDNS does not resolve the hostname of RabbitMQ and its peers
correctly for up to 30 seconds after node startup.
This is because the default cache value of CoreDNS is 30 seconds and
CoreDNS has a bug described in
kubernetes/kubernetes#92559

global:sync/0 is known to be buggy "in the presence of network failures"
unless the kernel parameter `prevent_overlapping_partitions` is set to
`true`.

When either:
1. setting CoreDNS cache value to 1 second (see
#5322 (comment)
on how to set this value), or
2. setting the kernel parameter `prevent_overlapping_partitions` to `true`

rolling updates do NOT get stuck anymore.

This means we are hitting here a combination of:
1. Kubernetes DNS bug not updating DNS caches promptly for headless
   services with `publishNotReadyAddresses: true`, and
2. Erlang bug which causes global:sync/0 to hang forever in the presence
   of network failures.

The Erlang bug is fixed by setting `prevent_overlapping_partitions` to `true` (default in Erlang/OTP 25).

In RabbitMQ however, we explicitly set `prevent_overlapping_partitions`
to `false` because we fear other issues could arise if we set this parameter to `true`.

Luckily, to resolve this issue of global:sync/0 being stuck, we can just
call function rabbit_node_monitor:global_sync/0 which provides a
workaround. This function was introduced 8 years ago in
9fcb31f

With this commit applied, rolling updates are not stuck anymore and we
see in the debug log the workaround sometimes being applied.

(cherry picked from commit 4bf78d8)
dumbbell added a commit to rabbitmq/rabbitmq-server that referenced this pull request Nov 10, 2023
[Why]
This work started as an effort to add peer discovery support to our
Khepri integration. Indeed, as part of the task to integrate Khepri, we
missed the fact that `rabbit_peer_discovery:maybe_create_cluster/1` was
called from the Mnesia-specific code only. Even though we knew about it
because we hit many issues caused by the fact the `join_cluster` and
peer discovery use different code path to create a cluster.

To add support for Khepri, the first version of this patch was to move
the call to `rabbit_peer_discovery:maybe_create_cluster/1` from
`rabbit_db_cluster` instead of `rabbit_mnesia`. To achieve that, it made
sense to unify the code and simply call `rabbit_db_cluster:join/2`
instead of duplicating the work.

Unfortunately, doing so highlighted another issue: the way the node to
cluster with was selected. Indeed, it could cause situations where
multiple clusters are created instead of one, without resorting to
out-of-band counter-measures, like a 30-second delay added in the
Kubernetes operator (rabbitmq/cluster-operator#1156). This problem was
even more frequent when we tried to unify the code path and call
`join_cluster`.

After several iterations on the patch and even more discussions with the
team, we decided to rewrite the algorithm to make node selection more
robust and still use `rabbit_db_cluster:join/2` to create the cluster.

[How]
This commit is only about the rewrite of the algorithm. Calling peer
discovery from `rabbit_db_cluster` instead of `rabbit_mnesia` (and thus
making peer discovery work with Khepri) will be done in a follow-up
commit.

We wanted the new algorithm to fulfill the following properties:

1. `rabbit_peer_discovery` should provide the ability to re-trigger it
   easily to re-evaluate the cluster. The new public API is
   `rabbit_peer_discovery:sync_desired_cluster/0`.

2. The selection of the node to join should be designed in a way that
   all nodes select the same, regardless of the order in which they
   become available. The adopted solution is to sort the list of
   discovered nodes with the following criterias (in that order):

    1. the size of the cluster a discovered node is part of; sorted from
       bigger to smaller clusters
    2. the start time of a discovered node; sorted from older to younger
       nodes
    3. the name of a discovered node; sorted alphabetically

   The first node in that list will not join anyone and simply proceed
   with its boot process. Other nodes will try to join the first node.

3. To reduce the chance of incorrectly having multiple standalone nodes
   because the discovery backend returned only a single node, we want to
   apply the following constraints to the list of nodes after it is
   filtered and sorted (see property 2 above):

    * The list must contain `node()` (i.e. the node running peer
      discovery itself).
    * If the RabbitMQ's cluster size hint is greater than 1, the list
      must have at least two nodes. The cluster size hint is the maximum
      between the configured target cluster size hint and the number of
      elements in the nodes list returned by the backend.

   If one of the constraint is not met, the entire peer discovery
   process is restarted after a delay.

4. The lock is acquired only to protect the actual join, not the
   discovery step where the backend is queried to get the list of peers.
   With the node selection described above, this will let the first node
   to start without acquiring the lock.

5. The cluster membership views queried as part of the algorithm to sort
   the list of nodes will be used to detect additional clusters or
   standalone nodes that did not cluster correctly. These nodes will be
   asked to re-evaluate peer discovery to increase the chance of forming
   a single cluster.

6. After some delay, peer discovery will be re-evaluated to further
   eliminate the chances of having multiple clusters instead of one.

This commit covers properties from point 1 to point 4. Remaining
properties will be the scope of additional pull requests after this one
works.

If there is a failure at any point during discovery, filtering/sorting,
locking or joining, the entire process is restarted after a delay. This
is configured using the following parameters:
* cluster_formation.discovery_retry_limit
* cluster_formation.discovery_retry_interval

The default parameters were bumped to 30 retries with a delay of 1
second between each.

The locking retries/interval parameters are not used by the new
algorithm anymore.

There are extra minor changes that come with the rewrite:
* The configured backend is cached in a persistent term. The goal is to
  make sure we use the same backend throughout the entire process and
  when we call `maybe_unregister/0` even if the configuration changed
  for whatever reason in between.
* `maybe_register/0` is called from `rabbit_db_cluster` instead of at
  the end of a successful peer discovery process. `rabbit_db_cluster`
  had to call `maybe_register/0` if the node was not virgin anyway. So
  make it simpler and always call it in `rabbit_db_cluster` regardless
  of the state of the node.
* `log_configured_backend/0` is gone. `maybe_init/0` can log the backend
  directly. There is no need to explicitly call another function for
  that.
* Messages are logged using `?LOG_*()` macros instead of the old
  `rabbit_log` module.
dumbbell added a commit to rabbitmq/rabbitmq-server that referenced this pull request Nov 13, 2023
[Why]
This work started as an effort to add peer discovery support to our
Khepri integration. Indeed, as part of the task to integrate Khepri, we
missed the fact that `rabbit_peer_discovery:maybe_create_cluster/1` was
called from the Mnesia-specific code only. Even though we knew about it
because we hit many issues caused by the fact the `join_cluster` and
peer discovery use different code path to create a cluster.

To add support for Khepri, the first version of this patch was to move
the call to `rabbit_peer_discovery:maybe_create_cluster/1` from
`rabbit_db_cluster` instead of `rabbit_mnesia`. To achieve that, it made
sense to unify the code and simply call `rabbit_db_cluster:join/2`
instead of duplicating the work.

Unfortunately, doing so highlighted another issue: the way the node to
cluster with was selected. Indeed, it could cause situations where
multiple clusters are created instead of one, without resorting to
out-of-band counter-measures, like a 30-second delay added in the
Kubernetes operator (rabbitmq/cluster-operator#1156). This problem was
even more frequent when we tried to unify the code path and call
`join_cluster`.

After several iterations on the patch and even more discussions with the
team, we decided to rewrite the algorithm to make node selection more
robust and still use `rabbit_db_cluster:join/2` to create the cluster.

[How]
This commit is only about the rewrite of the algorithm. Calling peer
discovery from `rabbit_db_cluster` instead of `rabbit_mnesia` (and thus
making peer discovery work with Khepri) will be done in a follow-up
commit.

We wanted the new algorithm to fulfill the following properties:

1. `rabbit_peer_discovery` should provide the ability to re-trigger it
   easily to re-evaluate the cluster. The new public API is
   `rabbit_peer_discovery:sync_desired_cluster/0`.

2. The selection of the node to join should be designed in a way that
   all nodes select the same, regardless of the order in which they
   become available. The adopted solution is to sort the list of
   discovered nodes with the following criterias (in that order):

    1. the size of the cluster a discovered node is part of; sorted from
       bigger to smaller clusters
    2. the start time of a discovered node; sorted from older to younger
       nodes
    3. the name of a discovered node; sorted alphabetically

   The first node in that list will not join anyone and simply proceed
   with its boot process. Other nodes will try to join the first node.

3. To reduce the chance of incorrectly having multiple standalone nodes
   because the discovery backend returned only a single node, we want to
   apply the following constraints to the list of nodes after it is
   filtered and sorted (see property 2 above):

    * The list must contain `node()` (i.e. the node running peer
      discovery itself).
    * If the RabbitMQ's cluster size hint is greater than 1, the list
      must have at least two nodes. The cluster size hint is the maximum
      between the configured target cluster size hint and the number of
      elements in the nodes list returned by the backend.

   If one of the constraint is not met, the entire peer discovery
   process is restarted after a delay.

4. The lock is acquired only to protect the actual join, not the
   discovery step where the backend is queried to get the list of peers.
   With the node selection described above, this will let the first node
   to start without acquiring the lock.

5. The cluster membership views queried as part of the algorithm to sort
   the list of nodes will be used to detect additional clusters or
   standalone nodes that did not cluster correctly. These nodes will be
   asked to re-evaluate peer discovery to increase the chance of forming
   a single cluster.

6. After some delay, peer discovery will be re-evaluated to further
   eliminate the chances of having multiple clusters instead of one.

This commit covers properties from point 1 to point 4. Remaining
properties will be the scope of additional pull requests after this one
works.

If there is a failure at any point during discovery, filtering/sorting,
locking or joining, the entire process is restarted after a delay. This
is configured using the following parameters:
* cluster_formation.discovery_retry_limit
* cluster_formation.discovery_retry_interval

The default parameters were bumped to 30 retries with a delay of 1
second between each.

The locking retries/interval parameters are not used by the new
algorithm anymore.

There are extra minor changes that come with the rewrite:
* The configured backend is cached in a persistent term. The goal is to
  make sure we use the same backend throughout the entire process and
  when we call `maybe_unregister/0` even if the configuration changed
  for whatever reason in between.
* `maybe_register/0` is called from `rabbit_db_cluster` instead of at
  the end of a successful peer discovery process. `rabbit_db_cluster`
  had to call `maybe_register/0` if the node was not virgin anyway. So
  make it simpler and always call it in `rabbit_db_cluster` regardless
  of the state of the node.
* `log_configured_backend/0` is gone. `maybe_init/0` can log the backend
  directly. There is no need to explicitly call another function for
  that.
* Messages are logged using `?LOG_*()` macros instead of the old
  `rabbit_log` module.
dumbbell added a commit to rabbitmq/rabbitmq-server that referenced this pull request Nov 16, 2023
[Why]
This work started as an effort to add peer discovery support to our
Khepri integration. Indeed, as part of the task to integrate Khepri, we
missed the fact that `rabbit_peer_discovery:maybe_create_cluster/1` was
called from the Mnesia-specific code only. Even though we knew about it
because we hit many issues caused by the fact the `join_cluster` and
peer discovery use different code path to create a cluster.

To add support for Khepri, the first version of this patch was to move
the call to `rabbit_peer_discovery:maybe_create_cluster/1` from
`rabbit_db_cluster` instead of `rabbit_mnesia`. To achieve that, it made
sense to unify the code and simply call `rabbit_db_cluster:join/2`
instead of duplicating the work.

Unfortunately, doing so highlighted another issue: the way the node to
cluster with was selected. Indeed, it could cause situations where
multiple clusters are created instead of one, without resorting to
out-of-band counter-measures, like a 30-second delay added in the
Kubernetes operator (rabbitmq/cluster-operator#1156). This problem was
even more frequent when we tried to unify the code path and call
`join_cluster`.

After several iterations on the patch and even more discussions with the
team, we decided to rewrite the algorithm to make node selection more
robust and still use `rabbit_db_cluster:join/2` to create the cluster.

[How]
This commit is only about the rewrite of the algorithm. Calling peer
discovery from `rabbit_db_cluster` instead of `rabbit_mnesia` (and thus
making peer discovery work with Khepri) will be done in a follow-up
commit.

We wanted the new algorithm to fulfill the following properties:

1. `rabbit_peer_discovery` should provide the ability to re-trigger it
   easily to re-evaluate the cluster. The new public API is
   `rabbit_peer_discovery:sync_desired_cluster/0`.

2. The selection of the node to join should be designed in a way that
   all nodes select the same, regardless of the order in which they
   become available. The adopted solution is to sort the list of
   discovered nodes with the following criterias (in that order):

    1. the size of the cluster a discovered node is part of; sorted from
       bigger to smaller clusters
    2. the start time of a discovered node; sorted from older to younger
       nodes
    3. the name of a discovered node; sorted alphabetically

   The first node in that list will not join anyone and simply proceed
   with its boot process. Other nodes will try to join the first node.

3. To reduce the chance of incorrectly having multiple standalone nodes
   because the discovery backend returned only a single node, we want to
   apply the following constraints to the list of nodes after it is
   filtered and sorted (see property 2 above):

    * The list must contain `node()` (i.e. the node running peer
      discovery itself).
    * If the RabbitMQ's cluster size hint is greater than 1, the list
      must have at least two nodes. The cluster size hint is the maximum
      between the configured target cluster size hint and the number of
      elements in the nodes list returned by the backend.

   If one of the constraint is not met, the entire peer discovery
   process is restarted after a delay.

4. The lock is acquired only to protect the actual join, not the
   discovery step where the backend is queried to get the list of peers.
   With the node selection described above, this will let the first node
   to start without acquiring the lock.

5. The cluster membership views queried as part of the algorithm to sort
   the list of nodes will be used to detect additional clusters or
   standalone nodes that did not cluster correctly. These nodes will be
   asked to re-evaluate peer discovery to increase the chance of forming
   a single cluster.

6. After some delay, peer discovery will be re-evaluated to further
   eliminate the chances of having multiple clusters instead of one.

This commit covers properties from point 1 to point 4. Remaining
properties will be the scope of additional pull requests after this one
works.

If there is a failure at any point during discovery, filtering/sorting,
locking or joining, the entire process is restarted after a delay. This
is configured using the following parameters:
* cluster_formation.discovery_retry_limit
* cluster_formation.discovery_retry_interval

The default parameters were bumped to 30 retries with a delay of 1
second between each.

The locking retries/interval parameters are not used by the new
algorithm anymore.

There are extra minor changes that come with the rewrite:
* The configured backend is cached in a persistent term. The goal is to
  make sure we use the same backend throughout the entire process and
  when we call `maybe_unregister/0` even if the configuration changed
  for whatever reason in between.
* `maybe_register/0` is called from `rabbit_db_cluster` instead of at
  the end of a successful peer discovery process. `rabbit_db_cluster`
  had to call `maybe_register/0` if the node was not virgin anyway. So
  make it simpler and always call it in `rabbit_db_cluster` regardless
  of the state of the node.
* `log_configured_backend/0` is gone. `maybe_init/0` can log the backend
  directly. There is no need to explicitly call another function for
  that.
* Messages are logged using `?LOG_*()` macros instead of the old
  `rabbit_log` module.
dumbbell added a commit to rabbitmq/rabbitmq-server that referenced this pull request Nov 23, 2023
[Why]
This work started as an effort to add peer discovery support to our
Khepri integration. Indeed, as part of the task to integrate Khepri, we
missed the fact that `rabbit_peer_discovery:maybe_create_cluster/1` was
called from the Mnesia-specific code only. Even though we knew about it
because we hit many issues caused by the fact the `join_cluster` and
peer discovery use different code path to create a cluster.

To add support for Khepri, the first version of this patch was to move
the call to `rabbit_peer_discovery:maybe_create_cluster/1` from
`rabbit_db_cluster` instead of `rabbit_mnesia`. To achieve that, it made
sense to unify the code and simply call `rabbit_db_cluster:join/2`
instead of duplicating the work.

Unfortunately, doing so highlighted another issue: the way the node to
cluster with was selected. Indeed, it could cause situations where
multiple clusters are created instead of one, without resorting to
out-of-band counter-measures, like a 30-second delay added in the
Kubernetes operator (rabbitmq/cluster-operator#1156). This problem was
even more frequent when we tried to unify the code path and call
`join_cluster`.

After several iterations on the patch and even more discussions with the
team, we decided to rewrite the algorithm to make node selection more
robust and still use `rabbit_db_cluster:join/2` to create the cluster.

[How]
This commit is only about the rewrite of the algorithm. Calling peer
discovery from `rabbit_db_cluster` instead of `rabbit_mnesia` (and thus
making peer discovery work with Khepri) will be done in a follow-up
commit.

We wanted the new algorithm to fulfill the following properties:

1. `rabbit_peer_discovery` should provide the ability to re-trigger it
   easily to re-evaluate the cluster. The new public API is
   `rabbit_peer_discovery:sync_desired_cluster/0`.

2. The selection of the node to join should be designed in a way that
   all nodes select the same, regardless of the order in which they
   become available. The adopted solution is to sort the list of
   discovered nodes with the following criterias (in that order):

    1. the size of the cluster a discovered node is part of; sorted from
       bigger to smaller clusters
    2. the start time of a discovered node; sorted from older to younger
       nodes
    3. the name of a discovered node; sorted alphabetically

   The first node in that list will not join anyone and simply proceed
   with its boot process. Other nodes will try to join the first node.

3. To reduce the chance of incorrectly having multiple standalone nodes
   because the discovery backend returned only a single node, we want to
   apply the following constraints to the list of nodes after it is
   filtered and sorted (see property 2 above):

    * The list must contain `node()` (i.e. the node running peer
      discovery itself).
    * If the RabbitMQ's cluster size hint is greater than 1, the list
      must have at least two nodes. The cluster size hint is the maximum
      between the configured target cluster size hint and the number of
      elements in the nodes list returned by the backend.

   If one of the constraint is not met, the entire peer discovery
   process is restarted after a delay.

4. The lock is acquired only to protect the actual join, not the
   discovery step where the backend is queried to get the list of peers.
   With the node selection described above, this will let the first node
   to start without acquiring the lock.

5. The cluster membership views queried as part of the algorithm to sort
   the list of nodes will be used to detect additional clusters or
   standalone nodes that did not cluster correctly. These nodes will be
   asked to re-evaluate peer discovery to increase the chance of forming
   a single cluster.

6. After some delay, peer discovery will be re-evaluated to further
   eliminate the chances of having multiple clusters instead of one.

This commit covers properties from point 1 to point 4. Remaining
properties will be the scope of additional pull requests after this one
works.

If there is a failure at any point during discovery, filtering/sorting,
locking or joining, the entire process is restarted after a delay. This
is configured using the following parameters:
* cluster_formation.discovery_retry_limit
* cluster_formation.discovery_retry_interval

The default parameters were bumped to 30 retries with a delay of 1
second between each.

The locking retries/interval parameters are not used by the new
algorithm anymore.

There are extra minor changes that come with the rewrite:
* The configured backend is cached in a persistent term. The goal is to
  make sure we use the same backend throughout the entire process and
  when we call `maybe_unregister/0` even if the configuration changed
  for whatever reason in between.
* `maybe_register/0` is called from `rabbit_db_cluster` instead of at
  the end of a successful peer discovery process. `rabbit_db_cluster`
  had to call `maybe_register/0` if the node was not virgin anyway. So
  make it simpler and always call it in `rabbit_db_cluster` regardless
  of the state of the node.
* `log_configured_backend/0` is gone. `maybe_init/0` can log the backend
  directly. There is no need to explicitly call another function for
  that.
* Messages are logged using `?LOG_*()` macros instead of the old
  `rabbit_log` module.
dumbbell added a commit to rabbitmq/rabbitmq-server that referenced this pull request Nov 27, 2023
[Why]
This work started as an effort to add peer discovery support to our
Khepri integration. Indeed, as part of the task to integrate Khepri, we
missed the fact that `rabbit_peer_discovery:maybe_create_cluster/1` was
called from the Mnesia-specific code only. Even though we knew about it
because we hit many issues caused by the fact the `join_cluster` and
peer discovery use different code path to create a cluster.

To add support for Khepri, the first version of this patch was to move
the call to `rabbit_peer_discovery:maybe_create_cluster/1` from
`rabbit_db_cluster` instead of `rabbit_mnesia`. To achieve that, it made
sense to unify the code and simply call `rabbit_db_cluster:join/2`
instead of duplicating the work.

Unfortunately, doing so highlighted another issue: the way the node to
cluster with was selected. Indeed, it could cause situations where
multiple clusters are created instead of one, without resorting to
out-of-band counter-measures, like a 30-second delay added in the
Kubernetes operator (rabbitmq/cluster-operator#1156). This problem was
even more frequent when we tried to unify the code path and call
`join_cluster`.

After several iterations on the patch and even more discussions with the
team, we decided to rewrite the algorithm to make node selection more
robust and still use `rabbit_db_cluster:join/2` to create the cluster.

[How]
This commit is only about the rewrite of the algorithm. Calling peer
discovery from `rabbit_db_cluster` instead of `rabbit_mnesia` (and thus
making peer discovery work with Khepri) will be done in a follow-up
commit.

We wanted the new algorithm to fulfill the following properties:

1. `rabbit_peer_discovery` should provide the ability to re-trigger it
   easily to re-evaluate the cluster. The new public API is
   `rabbit_peer_discovery:sync_desired_cluster/0`.

2. The selection of the node to join should be designed in a way that
   all nodes select the same, regardless of the order in which they
   become available. The adopted solution is to sort the list of
   discovered nodes with the following criterias (in that order):

    1. the size of the cluster a discovered node is part of; sorted from
       bigger to smaller clusters
    2. the start time of a discovered node; sorted from older to younger
       nodes
    3. the name of a discovered node; sorted alphabetically

   The first node in that list will not join anyone and simply proceed
   with its boot process. Other nodes will try to join the first node.

3. To reduce the chance of incorrectly having multiple standalone nodes
   because the discovery backend returned only a single node, we want to
   apply the following constraints to the list of nodes after it is
   filtered and sorted (see property 2 above):

    * The list must contain `node()` (i.e. the node running peer
      discovery itself).
    * If the RabbitMQ's cluster size hint is greater than 1, the list
      must have at least two nodes. The cluster size hint is the maximum
      between the configured target cluster size hint and the number of
      elements in the nodes list returned by the backend.

   If one of the constraint is not met, the entire peer discovery
   process is restarted after a delay.

4. The lock is acquired only to protect the actual join, not the
   discovery step where the backend is queried to get the list of peers.
   With the node selection described above, this will let the first node
   to start without acquiring the lock.

5. The cluster membership views queried as part of the algorithm to sort
   the list of nodes will be used to detect additional clusters or
   standalone nodes that did not cluster correctly. These nodes will be
   asked to re-evaluate peer discovery to increase the chance of forming
   a single cluster.

6. After some delay, peer discovery will be re-evaluated to further
   eliminate the chances of having multiple clusters instead of one.

This commit covers properties from point 1 to point 4. Remaining
properties will be the scope of additional pull requests after this one
works.

If there is a failure at any point during discovery, filtering/sorting,
locking or joining, the entire process is restarted after a delay. This
is configured using the following parameters:
* cluster_formation.discovery_retry_limit
* cluster_formation.discovery_retry_interval

The default parameters were bumped to 30 retries with a delay of 1
second between each.

The locking retries/interval parameters are not used by the new
algorithm anymore.

There are extra minor changes that come with the rewrite:
* The configured backend is cached in a persistent term. The goal is to
  make sure we use the same backend throughout the entire process and
  when we call `maybe_unregister/0` even if the configuration changed
  for whatever reason in between.
* `maybe_register/0` is called from `rabbit_db_cluster` instead of at
  the end of a successful peer discovery process. `rabbit_db_cluster`
  had to call `maybe_register/0` if the node was not virgin anyway. So
  make it simpler and always call it in `rabbit_db_cluster` regardless
  of the state of the node.
* `log_configured_backend/0` is gone. `maybe_init/0` can log the backend
  directly. There is no need to explicitly call another function for
  that.
* Messages are logged using `?LOG_*()` macros instead of the old
  `rabbit_log` module.
dumbbell added a commit to rabbitmq/rabbitmq-server that referenced this pull request Nov 29, 2023
[Why]
This work started as an effort to add peer discovery support to our
Khepri integration. Indeed, as part of the task to integrate Khepri, we
missed the fact that `rabbit_peer_discovery:maybe_create_cluster/1` was
called from the Mnesia-specific code only. Even though we knew about it
because we hit many issues caused by the fact the `join_cluster` and
peer discovery use different code path to create a cluster.

To add support for Khepri, the first version of this patch was to move
the call to `rabbit_peer_discovery:maybe_create_cluster/1` from
`rabbit_db_cluster` instead of `rabbit_mnesia`. To achieve that, it made
sense to unify the code and simply call `rabbit_db_cluster:join/2`
instead of duplicating the work.

Unfortunately, doing so highlighted another issue: the way the node to
cluster with was selected. Indeed, it could cause situations where
multiple clusters are created instead of one, without resorting to
out-of-band counter-measures, like a 30-second delay added in the
Kubernetes operator (rabbitmq/cluster-operator#1156). This problem was
even more frequent when we tried to unify the code path and call
`join_cluster`.

After several iterations on the patch and even more discussions with the
team, we decided to rewrite the algorithm to make node selection more
robust and still use `rabbit_db_cluster:join/2` to create the cluster.

[How]
This commit is only about the rewrite of the algorithm. Calling peer
discovery from `rabbit_db_cluster` instead of `rabbit_mnesia` (and thus
making peer discovery work with Khepri) will be done in a follow-up
commit.

We wanted the new algorithm to fulfill the following properties:

1. `rabbit_peer_discovery` should provide the ability to re-trigger it
   easily to re-evaluate the cluster. The new public API is
   `rabbit_peer_discovery:sync_desired_cluster/0`.

2. The selection of the node to join should be designed in a way that
   all nodes select the same, regardless of the order in which they
   become available. The adopted solution is to sort the list of
   discovered nodes with the following criterias (in that order):

    1. the size of the cluster a discovered node is part of; sorted from
       bigger to smaller clusters
    2. the start time of a discovered node; sorted from older to younger
       nodes
    3. the name of a discovered node; sorted alphabetically

   The first node in that list will not join anyone and simply proceed
   with its boot process. Other nodes will try to join the first node.

3. To reduce the chance of incorrectly having multiple standalone nodes
   because the discovery backend returned only a single node, we want to
   apply the following constraints to the list of nodes after it is
   filtered and sorted (see property 2 above):

    * The list must contain `node()` (i.e. the node running peer
      discovery itself).
    * If the RabbitMQ's cluster size hint is greater than 1, the list
      must have at least two nodes. The cluster size hint is the maximum
      between the configured target cluster size hint and the number of
      elements in the nodes list returned by the backend.

   If one of the constraint is not met, the entire peer discovery
   process is restarted after a delay.

4. The lock is acquired only to protect the actual join, not the
   discovery step where the backend is queried to get the list of peers.
   With the node selection described above, this will let the first node
   to start without acquiring the lock.

5. The cluster membership views queried as part of the algorithm to sort
   the list of nodes will be used to detect additional clusters or
   standalone nodes that did not cluster correctly. These nodes will be
   asked to re-evaluate peer discovery to increase the chance of forming
   a single cluster.

6. After some delay, peer discovery will be re-evaluated to further
   eliminate the chances of having multiple clusters instead of one.

This commit covers properties from point 1 to point 4. Remaining
properties will be the scope of additional pull requests after this one
works.

If there is a failure at any point during discovery, filtering/sorting,
locking or joining, the entire process is restarted after a delay. This
is configured using the following parameters:
* cluster_formation.discovery_retry_limit
* cluster_formation.discovery_retry_interval

The default parameters were bumped to 30 retries with a delay of 1
second between each.

The locking retries/interval parameters are not used by the new
algorithm anymore.

There are extra minor changes that come with the rewrite:
* The configured backend is cached in a persistent term. The goal is to
  make sure we use the same backend throughout the entire process and
  when we call `maybe_unregister/0` even if the configuration changed
  for whatever reason in between.
* `maybe_register/0` is called from `rabbit_db_cluster` instead of at
  the end of a successful peer discovery process. `rabbit_db_cluster`
  had to call `maybe_register/0` if the node was not virgin anyway. So
  make it simpler and always call it in `rabbit_db_cluster` regardless
  of the state of the node.
* `log_configured_backend/0` is gone. `maybe_init/0` can log the backend
  directly. There is no need to explicitly call another function for
  that.
* Messages are logged using `?LOG_*()` macros instead of the old
  `rabbit_log` module.
dumbbell added a commit to rabbitmq/rabbitmq-server that referenced this pull request Nov 30, 2023
[Why]
This work started as an effort to add peer discovery support to our
Khepri integration. Indeed, as part of the task to integrate Khepri, we
missed the fact that `rabbit_peer_discovery:maybe_create_cluster/1` was
called from the Mnesia-specific code only. Even though we knew about it
because we hit many issues caused by the fact the `join_cluster` and
peer discovery use different code path to create a cluster.

To add support for Khepri, the first version of this patch was to move
the call to `rabbit_peer_discovery:maybe_create_cluster/1` from
`rabbit_db_cluster` instead of `rabbit_mnesia`. To achieve that, it made
sense to unify the code and simply call `rabbit_db_cluster:join/2`
instead of duplicating the work.

Unfortunately, doing so highlighted another issue: the way the node to
cluster with was selected. Indeed, it could cause situations where
multiple clusters are created instead of one, without resorting to
out-of-band counter-measures, like a 30-second delay added in the
Kubernetes operator (rabbitmq/cluster-operator#1156). This problem was
even more frequent when we tried to unify the code path and call
`join_cluster`.

After several iterations on the patch and even more discussions with the
team, we decided to rewrite the algorithm to make node selection more
robust and still use `rabbit_db_cluster:join/2` to create the cluster.

[How]
This commit is only about the rewrite of the algorithm. Calling peer
discovery from `rabbit_db_cluster` instead of `rabbit_mnesia` (and thus
making peer discovery work with Khepri) will be done in a follow-up
commit.

We wanted the new algorithm to fulfill the following properties:

1. `rabbit_peer_discovery` should provide the ability to re-trigger it
   easily to re-evaluate the cluster. The new public API is
   `rabbit_peer_discovery:sync_desired_cluster/0`.

2. The selection of the node to join should be designed in a way that
   all nodes select the same, regardless of the order in which they
   become available. The adopted solution is to sort the list of
   discovered nodes with the following criterias (in that order):

    1. the size of the cluster a discovered node is part of; sorted from
       bigger to smaller clusters
    2. the start time of a discovered node; sorted from older to younger
       nodes
    3. the name of a discovered node; sorted alphabetically

   The first node in that list will not join anyone and simply proceed
   with its boot process. Other nodes will try to join the first node.

3. To reduce the chance of incorrectly having multiple standalone nodes
   because the discovery backend returned only a single node, we want to
   apply the following constraints to the list of nodes after it is
   filtered and sorted (see property 2 above):

    * The list must contain `node()` (i.e. the node running peer
      discovery itself).
    * If the RabbitMQ's cluster size hint is greater than 1, the list
      must have at least two nodes. The cluster size hint is the maximum
      between the configured target cluster size hint and the number of
      elements in the nodes list returned by the backend.

   If one of the constraint is not met, the entire peer discovery
   process is restarted after a delay.

4. The lock is acquired only to protect the actual join, not the
   discovery step where the backend is queried to get the list of peers.
   With the node selection described above, this will let the first node
   to start without acquiring the lock.

5. The cluster membership views queried as part of the algorithm to sort
   the list of nodes will be used to detect additional clusters or
   standalone nodes that did not cluster correctly. These nodes will be
   asked to re-evaluate peer discovery to increase the chance of forming
   a single cluster.

6. After some delay, peer discovery will be re-evaluated to further
   eliminate the chances of having multiple clusters instead of one.

This commit covers properties from point 1 to point 4. Remaining
properties will be the scope of additional pull requests after this one
works.

If there is a failure at any point during discovery, filtering/sorting,
locking or joining, the entire process is restarted after a delay. This
is configured using the following parameters:
* cluster_formation.discovery_retry_limit
* cluster_formation.discovery_retry_interval

The default parameters were bumped to 30 retries with a delay of 1
second between each.

The locking retries/interval parameters are not used by the new
algorithm anymore.

There are extra minor changes that come with the rewrite:
* The configured backend is cached in a persistent term. The goal is to
  make sure we use the same backend throughout the entire process and
  when we call `maybe_unregister/0` even if the configuration changed
  for whatever reason in between.
* `maybe_register/0` is called from `rabbit_db_cluster` instead of at
  the end of a successful peer discovery process. `rabbit_db_cluster`
  had to call `maybe_register/0` if the node was not virgin anyway. So
  make it simpler and always call it in `rabbit_db_cluster` regardless
  of the state of the node.
* `log_configured_backend/0` is gone. `maybe_init/0` can log the backend
  directly. There is no need to explicitly call another function for
  that.
* Messages are logged using `?LOG_*()` macros instead of the old
  `rabbit_log` module.
dumbbell added a commit to rabbitmq/rabbitmq-server that referenced this pull request Dec 4, 2023
[Why]
This work started as an effort to add peer discovery support to our
Khepri integration. Indeed, as part of the task to integrate Khepri, we
missed the fact that `rabbit_peer_discovery:maybe_create_cluster/1` was
called from the Mnesia-specific code only. Even though we knew about it
because we hit many issues caused by the fact the `join_cluster` and
peer discovery use different code path to create a cluster.

To add support for Khepri, the first version of this patch was to move
the call to `rabbit_peer_discovery:maybe_create_cluster/1` from
`rabbit_db_cluster` instead of `rabbit_mnesia`. To achieve that, it made
sense to unify the code and simply call `rabbit_db_cluster:join/2`
instead of duplicating the work.

Unfortunately, doing so highlighted another issue: the way the node to
cluster with was selected. Indeed, it could cause situations where
multiple clusters are created instead of one, without resorting to
out-of-band counter-measures, like a 30-second delay added in the
Kubernetes operator (rabbitmq/cluster-operator#1156). This problem was
even more frequent when we tried to unify the code path and call
`join_cluster`.

After several iterations on the patch and even more discussions with the
team, we decided to rewrite the algorithm to make node selection more
robust and still use `rabbit_db_cluster:join/2` to create the cluster.

[How]
This commit is only about the rewrite of the algorithm. Calling peer
discovery from `rabbit_db_cluster` instead of `rabbit_mnesia` (and thus
making peer discovery work with Khepri) will be done in a follow-up
commit.

We wanted the new algorithm to fulfill the following properties:

1. `rabbit_peer_discovery` should provide the ability to re-trigger it
   easily to re-evaluate the cluster. The new public API is
   `rabbit_peer_discovery:sync_desired_cluster/0`.

2. The selection of the node to join should be designed in a way that
   all nodes select the same, regardless of the order in which they
   become available. The adopted solution is to sort the list of
   discovered nodes with the following criterias (in that order):

    1. the size of the cluster a discovered node is part of; sorted from
       bigger to smaller clusters
    2. the start time of a discovered node; sorted from older to younger
       nodes
    3. the name of a discovered node; sorted alphabetically

   The first node in that list will not join anyone and simply proceed
   with its boot process. Other nodes will try to join the first node.

3. To reduce the chance of incorrectly having multiple standalone nodes
   because the discovery backend returned only a single node, we want to
   apply the following constraints to the list of nodes after it is
   filtered and sorted (see property 2 above):

    * The list must contain `node()` (i.e. the node running peer
      discovery itself).
    * If the RabbitMQ's cluster size hint is greater than 1, the list
      must have at least two nodes. The cluster size hint is the maximum
      between the configured target cluster size hint and the number of
      elements in the nodes list returned by the backend.

   If one of the constraint is not met, the entire peer discovery
   process is restarted after a delay.

4. The lock is acquired only to protect the actual join, not the
   discovery step where the backend is queried to get the list of peers.
   With the node selection described above, this will let the first node
   to start without acquiring the lock.

5. The cluster membership views queried as part of the algorithm to sort
   the list of nodes will be used to detect additional clusters or
   standalone nodes that did not cluster correctly. These nodes will be
   asked to re-evaluate peer discovery to increase the chance of forming
   a single cluster.

6. After some delay, peer discovery will be re-evaluated to further
   eliminate the chances of having multiple clusters instead of one.

This commit covers properties from point 1 to point 4. Remaining
properties will be the scope of additional pull requests after this one
works.

If there is a failure at any point during discovery, filtering/sorting,
locking or joining, the entire process is restarted after a delay. This
is configured using the following parameters:
* cluster_formation.discovery_retry_limit
* cluster_formation.discovery_retry_interval

The default parameters were bumped to 30 retries with a delay of 1
second between each.

The locking retries/interval parameters are not used by the new
algorithm anymore.

There are extra minor changes that come with the rewrite:
* The configured backend is cached in a persistent term. The goal is to
  make sure we use the same backend throughout the entire process and
  when we call `maybe_unregister/0` even if the configuration changed
  for whatever reason in between.
* `maybe_register/0` is called from `rabbit_db_cluster` instead of at
  the end of a successful peer discovery process. `rabbit_db_cluster`
  had to call `maybe_register/0` if the node was not virgin anyway. So
  make it simpler and always call it in `rabbit_db_cluster` regardless
  of the state of the node.
* `log_configured_backend/0` is gone. `maybe_init/0` can log the backend
  directly. There is no need to explicitly call another function for
  that.
* Messages are logged using `?LOG_*()` macros instead of the old
  `rabbit_log` module.
dumbbell added a commit to rabbitmq/rabbitmq-server that referenced this pull request Dec 5, 2023
[Why]
This work started as an effort to add peer discovery support to our
Khepri integration. Indeed, as part of the task to integrate Khepri, we
missed the fact that `rabbit_peer_discovery:maybe_create_cluster/1` was
called from the Mnesia-specific code only. Even though we knew about it
because we hit many issues caused by the fact the `join_cluster` and
peer discovery use different code path to create a cluster.

To add support for Khepri, the first version of this patch was to move
the call to `rabbit_peer_discovery:maybe_create_cluster/1` from
`rabbit_db_cluster` instead of `rabbit_mnesia`. To achieve that, it made
sense to unify the code and simply call `rabbit_db_cluster:join/2`
instead of duplicating the work.

Unfortunately, doing so highlighted another issue: the way the node to
cluster with was selected. Indeed, it could cause situations where
multiple clusters are created instead of one, without resorting to
out-of-band counter-measures, like a 30-second delay added in the
Kubernetes operator (rabbitmq/cluster-operator#1156). This problem was
even more frequent when we tried to unify the code path and call
`join_cluster`.

After several iterations on the patch and even more discussions with the
team, we decided to rewrite the algorithm to make node selection more
robust and still use `rabbit_db_cluster:join/2` to create the cluster.

[How]
This commit is only about the rewrite of the algorithm. Calling peer
discovery from `rabbit_db_cluster` instead of `rabbit_mnesia` (and thus
making peer discovery work with Khepri) will be done in a follow-up
commit.

We wanted the new algorithm to fulfill the following properties:

1. `rabbit_peer_discovery` should provide the ability to re-trigger it
   easily to re-evaluate the cluster. The new public API is
   `rabbit_peer_discovery:sync_desired_cluster/0`.

2. The selection of the node to join should be designed in a way that
   all nodes select the same, regardless of the order in which they
   become available. The adopted solution is to sort the list of
   discovered nodes with the following criterias (in that order):

    1. the size of the cluster a discovered node is part of; sorted from
       bigger to smaller clusters
    2. the start time of a discovered node; sorted from older to younger
       nodes
    3. the name of a discovered node; sorted alphabetically

   The first node in that list will not join anyone and simply proceed
   with its boot process. Other nodes will try to join the first node.

3. To reduce the chance of incorrectly having multiple standalone nodes
   because the discovery backend returned only a single node, we want to
   apply the following constraints to the list of nodes after it is
   filtered and sorted (see property 2 above):

    * The list must contain `node()` (i.e. the node running peer
      discovery itself).
    * If the RabbitMQ's cluster size hint is greater than 1, the list
      must have at least two nodes. The cluster size hint is the maximum
      between the configured target cluster size hint and the number of
      elements in the nodes list returned by the backend.

   If one of the constraint is not met, the entire peer discovery
   process is restarted after a delay.

4. The lock is acquired only to protect the actual join, not the
   discovery step where the backend is queried to get the list of peers.
   With the node selection described above, this will let the first node
   to start without acquiring the lock.

5. The cluster membership views queried as part of the algorithm to sort
   the list of nodes will be used to detect additional clusters or
   standalone nodes that did not cluster correctly. These nodes will be
   asked to re-evaluate peer discovery to increase the chance of forming
   a single cluster.

6. After some delay, peer discovery will be re-evaluated to further
   eliminate the chances of having multiple clusters instead of one.

This commit covers properties from point 1 to point 4. Remaining
properties will be the scope of additional pull requests after this one
works.

If there is a failure at any point during discovery, filtering/sorting,
locking or joining, the entire process is restarted after a delay. This
is configured using the following parameters:
* cluster_formation.discovery_retry_limit
* cluster_formation.discovery_retry_interval

The default parameters were bumped to 30 retries with a delay of 1
second between each.

The locking retries/interval parameters are not used by the new
algorithm anymore.

There are extra minor changes that come with the rewrite:
* The configured backend is cached in a persistent term. The goal is to
  make sure we use the same backend throughout the entire process and
  when we call `maybe_unregister/0` even if the configuration changed
  for whatever reason in between.
* `maybe_register/0` is called from `rabbit_db_cluster` instead of at
  the end of a successful peer discovery process. `rabbit_db_cluster`
  had to call `maybe_register/0` if the node was not virgin anyway. So
  make it simpler and always call it in `rabbit_db_cluster` regardless
  of the state of the node.
* `log_configured_backend/0` is gone. `maybe_init/0` can log the backend
  directly. There is no need to explicitly call another function for
  that.
* Messages are logged using `?LOG_*()` macros instead of the old
  `rabbit_log` module.
dumbbell added a commit to rabbitmq/rabbitmq-server that referenced this pull request Dec 5, 2023
[Why]
This work started as an effort to add peer discovery support to our
Khepri integration. Indeed, as part of the task to integrate Khepri, we
missed the fact that `rabbit_peer_discovery:maybe_create_cluster/1` was
called from the Mnesia-specific code only. Even though we knew about it
because we hit many issues caused by the fact the `join_cluster` and
peer discovery use different code path to create a cluster.

To add support for Khepri, the first version of this patch was to move
the call to `rabbit_peer_discovery:maybe_create_cluster/1` from
`rabbit_db_cluster` instead of `rabbit_mnesia`. To achieve that, it made
sense to unify the code and simply call `rabbit_db_cluster:join/2`
instead of duplicating the work.

Unfortunately, doing so highlighted another issue: the way the node to
cluster with was selected. Indeed, it could cause situations where
multiple clusters are created instead of one, without resorting to
out-of-band counter-measures, like a 30-second delay added in the
Kubernetes operator (rabbitmq/cluster-operator#1156). This problem was
even more frequent when we tried to unify the code path and call
`join_cluster`.

After several iterations on the patch and even more discussions with the
team, we decided to rewrite the algorithm to make node selection more
robust and still use `rabbit_db_cluster:join/2` to create the cluster.

[How]
This commit is only about the rewrite of the algorithm. Calling peer
discovery from `rabbit_db_cluster` instead of `rabbit_mnesia` (and thus
making peer discovery work with Khepri) will be done in a follow-up
commit.

We wanted the new algorithm to fulfill the following properties:

1. `rabbit_peer_discovery` should provide the ability to re-trigger it
   easily to re-evaluate the cluster. The new public API is
   `rabbit_peer_discovery:sync_desired_cluster/0`.

2. The selection of the node to join should be designed in a way that
   all nodes select the same, regardless of the order in which they
   become available. The adopted solution is to sort the list of
   discovered nodes with the following criterias (in that order):

    1. the size of the cluster a discovered node is part of; sorted from
       bigger to smaller clusters
    2. the start time of a discovered node; sorted from older to younger
       nodes
    3. the name of a discovered node; sorted alphabetically

   The first node in that list will not join anyone and simply proceed
   with its boot process. Other nodes will try to join the first node.

3. To reduce the chance of incorrectly having multiple standalone nodes
   because the discovery backend returned only a single node, we want to
   apply the following constraints to the list of nodes after it is
   filtered and sorted (see property 2 above):

    * The list must contain `node()` (i.e. the node running peer
      discovery itself).
    * If the RabbitMQ's cluster size hint is greater than 1, the list
      must have at least two nodes. The cluster size hint is the maximum
      between the configured target cluster size hint and the number of
      elements in the nodes list returned by the backend.

   If one of the constraint is not met, the entire peer discovery
   process is restarted after a delay.

4. The lock is acquired only to protect the actual join, not the
   discovery step where the backend is queried to get the list of peers.
   With the node selection described above, this will let the first node
   to start without acquiring the lock.

5. The cluster membership views queried as part of the algorithm to sort
   the list of nodes will be used to detect additional clusters or
   standalone nodes that did not cluster correctly. These nodes will be
   asked to re-evaluate peer discovery to increase the chance of forming
   a single cluster.

6. After some delay, peer discovery will be re-evaluated to further
   eliminate the chances of having multiple clusters instead of one.

This commit covers properties from point 1 to point 4. Remaining
properties will be the scope of additional pull requests after this one
works.

If there is a failure at any point during discovery, filtering/sorting,
locking or joining, the entire process is restarted after a delay. This
is configured using the following parameters:
* cluster_formation.discovery_retry_limit
* cluster_formation.discovery_retry_interval

The default parameters were bumped to 30 retries with a delay of 1
second between each.

The locking retries/interval parameters are not used by the new
algorithm anymore.

There are extra minor changes that come with the rewrite:
* The configured backend is cached in a persistent term. The goal is to
  make sure we use the same backend throughout the entire process and
  when we call `maybe_unregister/0` even if the configuration changed
  for whatever reason in between.
* `maybe_register/0` is called from `rabbit_db_cluster` instead of at
  the end of a successful peer discovery process. `rabbit_db_cluster`
  had to call `maybe_register/0` if the node was not virgin anyway. So
  make it simpler and always call it in `rabbit_db_cluster` regardless
  of the state of the node.
* `log_configured_backend/0` is gone. `maybe_init/0` can log the backend
  directly. There is no need to explicitly call another function for
  that.
* Messages are logged using `?LOG_*()` macros instead of the old
  `rabbit_log` module.
dumbbell added a commit to rabbitmq/rabbitmq-server that referenced this pull request Dec 6, 2023
[Why]
This work started as an effort to add peer discovery support to our
Khepri integration. Indeed, as part of the task to integrate Khepri, we
missed the fact that `rabbit_peer_discovery:maybe_create_cluster/1` was
called from the Mnesia-specific code only. Even though we knew about it
because we hit many issues caused by the fact the `join_cluster` and
peer discovery use different code path to create a cluster.

To add support for Khepri, the first version of this patch was to move
the call to `rabbit_peer_discovery:maybe_create_cluster/1` from
`rabbit_db_cluster` instead of `rabbit_mnesia`. To achieve that, it made
sense to unify the code and simply call `rabbit_db_cluster:join/2`
instead of duplicating the work.

Unfortunately, doing so highlighted another issue: the way the node to
cluster with was selected. Indeed, it could cause situations where
multiple clusters are created instead of one, without resorting to
out-of-band counter-measures, like a 30-second delay added in the
Kubernetes operator (rabbitmq/cluster-operator#1156). This problem was
even more frequent when we tried to unify the code path and call
`join_cluster`.

After several iterations on the patch and even more discussions with the
team, we decided to rewrite the algorithm to make node selection more
robust and still use `rabbit_db_cluster:join/2` to create the cluster.

[How]
This commit is only about the rewrite of the algorithm. Calling peer
discovery from `rabbit_db_cluster` instead of `rabbit_mnesia` (and thus
making peer discovery work with Khepri) will be done in a follow-up
commit.

We wanted the new algorithm to fulfill the following properties:

1. `rabbit_peer_discovery` should provide the ability to re-trigger it
   easily to re-evaluate the cluster. The new public API is
   `rabbit_peer_discovery:sync_desired_cluster/0`.

2. The selection of the node to join should be designed in a way that
   all nodes select the same, regardless of the order in which they
   become available. The adopted solution is to sort the list of
   discovered nodes with the following criterias (in that order):

    1. the size of the cluster a discovered node is part of; sorted from
       bigger to smaller clusters
    2. the start time of a discovered node; sorted from older to younger
       nodes
    3. the name of a discovered node; sorted alphabetically

   The first node in that list will not join anyone and simply proceed
   with its boot process. Other nodes will try to join the first node.

3. To reduce the chance of incorrectly having multiple standalone nodes
   because the discovery backend returned only a single node, we want to
   apply the following constraints to the list of nodes after it is
   filtered and sorted (see property 2 above):

    * The list must contain `node()` (i.e. the node running peer
      discovery itself).
    * If the RabbitMQ's cluster size hint is greater than 1, the list
      must have at least two nodes. The cluster size hint is the maximum
      between the configured target cluster size hint and the number of
      elements in the nodes list returned by the backend.

   If one of the constraint is not met, the entire peer discovery
   process is restarted after a delay.

4. The lock is acquired only to protect the actual join, not the
   discovery step where the backend is queried to get the list of peers.
   With the node selection described above, this will let the first node
   to start without acquiring the lock.

5. The cluster membership views queried as part of the algorithm to sort
   the list of nodes will be used to detect additional clusters or
   standalone nodes that did not cluster correctly. These nodes will be
   asked to re-evaluate peer discovery to increase the chance of forming
   a single cluster.

6. After some delay, peer discovery will be re-evaluated to further
   eliminate the chances of having multiple clusters instead of one.

This commit covers properties from point 1 to point 4. Remaining
properties will be the scope of additional pull requests after this one
works.

If there is a failure at any point during discovery, filtering/sorting,
locking or joining, the entire process is restarted after a delay. This
is configured using the following parameters:
* cluster_formation.discovery_retry_limit
* cluster_formation.discovery_retry_interval

The default parameters were bumped to 30 retries with a delay of 1
second between each.

The locking retries/interval parameters are not used by the new
algorithm anymore.

There are extra minor changes that come with the rewrite:
* The configured backend is cached in a persistent term. The goal is to
  make sure we use the same backend throughout the entire process and
  when we call `maybe_unregister/0` even if the configuration changed
  for whatever reason in between.
* `maybe_register/0` is called from `rabbit_db_cluster` instead of at
  the end of a successful peer discovery process. `rabbit_db_cluster`
  had to call `maybe_register/0` if the node was not virgin anyway. So
  make it simpler and always call it in `rabbit_db_cluster` regardless
  of the state of the node.
* `log_configured_backend/0` is gone. `maybe_init/0` can log the backend
  directly. There is no need to explicitly call another function for
  that.
* Messages are logged using `?LOG_*()` macros instead of the old
  `rabbit_log` module.
dumbbell added a commit to rabbitmq/rabbitmq-server that referenced this pull request Dec 7, 2023
[Why]
This work started as an effort to add peer discovery support to our
Khepri integration. Indeed, as part of the task to integrate Khepri, we
missed the fact that `rabbit_peer_discovery:maybe_create_cluster/1` was
called from the Mnesia-specific code only. Even though we knew about it
because we hit many issues caused by the fact the `join_cluster` and
peer discovery use different code path to create a cluster.

To add support for Khepri, the first version of this patch was to move
the call to `rabbit_peer_discovery:maybe_create_cluster/1` from
`rabbit_db_cluster` instead of `rabbit_mnesia`. To achieve that, it made
sense to unify the code and simply call `rabbit_db_cluster:join/2`
instead of duplicating the work.

Unfortunately, doing so highlighted another issue: the way the node to
cluster with was selected. Indeed, it could cause situations where
multiple clusters are created instead of one, without resorting to
out-of-band counter-measures, like a 30-second delay added in the
Kubernetes operator (rabbitmq/cluster-operator#1156). This problem was
even more frequent when we tried to unify the code path and call
`join_cluster`.

After several iterations on the patch and even more discussions with the
team, we decided to rewrite the algorithm to make node selection more
robust and still use `rabbit_db_cluster:join/2` to create the cluster.

[How]
This commit is only about the rewrite of the algorithm. Calling peer
discovery from `rabbit_db_cluster` instead of `rabbit_mnesia` (and thus
making peer discovery work with Khepri) will be done in a follow-up
commit.

We wanted the new algorithm to fulfill the following properties:

1. `rabbit_peer_discovery` should provide the ability to re-trigger it
   easily to re-evaluate the cluster. The new public API is
   `rabbit_peer_discovery:sync_desired_cluster/0`.

2. The selection of the node to join should be designed in a way that
   all nodes select the same, regardless of the order in which they
   become available. The adopted solution is to sort the list of
   discovered nodes with the following criterias (in that order):

    1. the size of the cluster a discovered node is part of; sorted from
       bigger to smaller clusters
    2. the start time of a discovered node; sorted from older to younger
       nodes
    3. the name of a discovered node; sorted alphabetically

   The first node in that list will not join anyone and simply proceed
   with its boot process. Other nodes will try to join the first node.

3. To reduce the chance of incorrectly having multiple standalone nodes
   because the discovery backend returned only a single node, we want to
   apply the following constraints to the list of nodes after it is
   filtered and sorted (see property 2 above):

    * The list must contain `node()` (i.e. the node running peer
      discovery itself).
    * If the RabbitMQ's cluster size hint is greater than 1, the list
      must have at least two nodes. The cluster size hint is the maximum
      between the configured target cluster size hint and the number of
      elements in the nodes list returned by the backend.

   If one of the constraint is not met, the entire peer discovery
   process is restarted after a delay.

4. The lock is acquired only to protect the actual join, not the
   discovery step where the backend is queried to get the list of peers.
   With the node selection described above, this will let the first node
   to start without acquiring the lock.

5. The cluster membership views queried as part of the algorithm to sort
   the list of nodes will be used to detect additional clusters or
   standalone nodes that did not cluster correctly. These nodes will be
   asked to re-evaluate peer discovery to increase the chance of forming
   a single cluster.

6. After some delay, peer discovery will be re-evaluated to further
   eliminate the chances of having multiple clusters instead of one.

This commit covers properties from point 1 to point 4. Remaining
properties will be the scope of additional pull requests after this one
works.

If there is a failure at any point during discovery, filtering/sorting,
locking or joining, the entire process is restarted after a delay. This
is configured using the following parameters:
* cluster_formation.discovery_retry_limit
* cluster_formation.discovery_retry_interval

The default parameters were bumped to 30 retries with a delay of 1
second between each.

The locking retries/interval parameters are not used by the new
algorithm anymore.

There are extra minor changes that come with the rewrite:
* The configured backend is cached in a persistent term. The goal is to
  make sure we use the same backend throughout the entire process and
  when we call `maybe_unregister/0` even if the configuration changed
  for whatever reason in between.
* `maybe_register/0` is called from `rabbit_db_cluster` instead of at
  the end of a successful peer discovery process. `rabbit_db_cluster`
  had to call `maybe_register/0` if the node was not virgin anyway. So
  make it simpler and always call it in `rabbit_db_cluster` regardless
  of the state of the node.
* `log_configured_backend/0` is gone. `maybe_init/0` can log the backend
  directly. There is no need to explicitly call another function for
  that.
* Messages are logged using `?LOG_*()` macros instead of the old
  `rabbit_log` module.
michaelklishin pushed a commit to rabbitmq/rabbitmq-server that referenced this pull request Feb 29, 2024
[Why]
This work started as an effort to add peer discovery support to our
Khepri integration. Indeed, as part of the task to integrate Khepri, we
missed the fact that `rabbit_peer_discovery:maybe_create_cluster/1` was
called from the Mnesia-specific code only. Even though we knew about it
because we hit many issues caused by the fact the `join_cluster` and
peer discovery use different code path to create a cluster.

To add support for Khepri, the first version of this patch was to move
the call to `rabbit_peer_discovery:maybe_create_cluster/1` from
`rabbit_db_cluster` instead of `rabbit_mnesia`. To achieve that, it made
sense to unify the code and simply call `rabbit_db_cluster:join/2`
instead of duplicating the work.

Unfortunately, doing so highlighted another issue: the way the node to
cluster with was selected. Indeed, it could cause situations where
multiple clusters are created instead of one, without resorting to
out-of-band counter-measures, like a 30-second delay added in the
Kubernetes operator (rabbitmq/cluster-operator#1156). This problem was
even more frequent when we tried to unify the code path and call
`join_cluster`.

After several iterations on the patch and even more discussions with the
team, we decided to rewrite the algorithm to make node selection more
robust and still use `rabbit_db_cluster:join/2` to create the cluster.

[How]
This commit is only about the rewrite of the algorithm. Calling peer
discovery from `rabbit_db_cluster` instead of `rabbit_mnesia` (and thus
making peer discovery work with Khepri) will be done in a follow-up
commit.

We wanted the new algorithm to fulfill the following properties:

1. `rabbit_peer_discovery` should provide the ability to re-trigger it
   easily to re-evaluate the cluster. The new public API is
   `rabbit_peer_discovery:sync_desired_cluster/0`.

2. The selection of the node to join should be designed in a way that
   all nodes select the same, regardless of the order in which they
   become available. The adopted solution is to sort the list of
   discovered nodes with the following criterias (in that order):

    1. the size of the cluster a discovered node is part of; sorted from
       bigger to smaller clusters
    2. the start time of a discovered node; sorted from older to younger
       nodes
    3. the name of a discovered node; sorted alphabetically

   The first node in that list will not join anyone and simply proceed
   with its boot process. Other nodes will try to join the first node.

3. To reduce the chance of incorrectly having multiple standalone nodes
   because the discovery backend returned only a single node, we want to
   apply the following constraints to the list of nodes after it is
   filtered and sorted (see property 2 above):

    * The list must contain `node()` (i.e. the node running peer
      discovery itself).
    * If the RabbitMQ's cluster size hint is greater than 1, the list
      must have at least two nodes. The cluster size hint is the maximum
      between the configured target cluster size hint and the number of
      elements in the nodes list returned by the backend.

   If one of the constraint is not met, the entire peer discovery
   process is restarted after a delay.

4. The lock is acquired only to protect the actual join, not the
   discovery step where the backend is queried to get the list of peers.
   With the node selection described above, this will let the first node
   to start without acquiring the lock.

5. The cluster membership views queried as part of the algorithm to sort
   the list of nodes will be used to detect additional clusters or
   standalone nodes that did not cluster correctly. These nodes will be
   asked to re-evaluate peer discovery to increase the chance of forming
   a single cluster.

6. After some delay, peer discovery will be re-evaluated to further
   eliminate the chances of having multiple clusters instead of one.

This commit covers properties from point 1 to point 4. Remaining
properties will be the scope of additional pull requests after this one
works.

If there is a failure at any point during discovery, filtering/sorting,
locking or joining, the entire process is restarted after a delay. This
is configured using the following parameters:
* cluster_formation.discovery_retry_limit
* cluster_formation.discovery_retry_interval

The default parameters were bumped to 30 retries with a delay of 1
second between each.

The locking retries/interval parameters are not used by the new
algorithm anymore.

There are extra minor changes that come with the rewrite:
* The configured backend is cached in a persistent term. The goal is to
  make sure we use the same backend throughout the entire process and
  when we call `maybe_unregister/0` even if the configuration changed
  for whatever reason in between.
* `maybe_register/0` is called from `rabbit_db_cluster` instead of at
  the end of a successful peer discovery process. `rabbit_db_cluster`
  had to call `maybe_register/0` if the node was not virgin anyway. So
  make it simpler and always call it in `rabbit_db_cluster` regardless
  of the state of the node.
* `log_configured_backend/0` is gone. `maybe_init/0` can log the backend
  directly. There is no need to explicitly call another function for
  that.
* Messages are logged using `?LOG_*()` macros instead of the old
  `rabbit_log` module.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants