Skip to content

Commit

Permalink
rabbit_peer_discovery: Rewrite the core logic
Browse files Browse the repository at this point in the history
[Why]
This work started as an effort to add peer discovery support to our
Khepri integration. Indeed, as part of the task to integrate Khepri, we
missed the fact that `rabbit_peer_discovery:maybe_create_cluster/1` was
called from the Mnesia-specific code only. Even though we knew about it
because we hit many issues caused by the fact the `join_cluster` and
peer discovery use different code path to create a cluster.

To add support for Khepri, the first version of this patch was to move
the call to `rabbit_peer_discovery:maybe_create_cluster/1` from
`rabbit_db_cluster` instead of `rabbit_mnesia`. To achieve that, it made
sense to unify the code and simply call `rabbit_db_cluster:join/2`
instead of duplicating the work.

Unfortunately, doing so highlighted another issue: the way the node to
cluster with was selected. Indeed, it could cause situations where
multiple clusters are created instead of one, without resorting to
out-of-band counter-measures, like a 30-second delay added in the
Kubernetes operator (rabbitmq/cluster-operator#1156). This problem was
even more frequent when we tried to unify the code path and call
`join_cluster`.

After several iterations on the patch and even more discussions with the
team, we decided to rewrite the algorithm to make node selection more
robust and still use `rabbit_db_cluster:join/2` to create the cluster.

[How]
This commit is only about the rewrite of the algorithm. Calling peer
discovery from `rabbit_db_cluster` instead of `rabbit_mnesia` (and thus
making peer discovery work with Khepri) will be done in a follow-up
commit.

We wanted the new algorithm to fulfill the following properties:

1. `rabbit_peer_discovery` should provide the ability to re-trigger it
   easily to re-evaluate the cluster. The new public API is
   `rabbit_peer_discovery:sync_desired_cluster/0`.

2. The selection of the node to join should be designed in a way that
   all nodes select the same, regardless of the order in which they
   become available. The adopted solution is to sort the list of
   discovered nodes with the following criterias (in that order):

    1. the size of the cluster a discovered node is part of; sorted from
       bigger to smaller clusters
    2. the start time of a discovered node; sorted from older to younger
       nodes
    3. the name of a discovered node; sorted alphabetically

   The first node in that list will not join anyone and simply proceed
   with its boot process. Other nodes will try to join the first node.

3. To reduce the chance of incorrectly having multiple standalone nodes
   because the discovery backend returned only a single node, we want to
   apply the following constraints to the list of nodes after it is
   filtered and sorted (see property 2 above):

    * The list must contain `node()` (i.e. the node running peer
      discovery itself).
    * If the RabbitMQ's cluster size hint is greater than 1, the list
      must have at least two nodes. The cluster size hint is the maximum
      between the configured target cluster size hint and the number of
      elements in the nodes list returned by the backend.

   If one of the constraint is not met, the entire peer discovery
   process is restarted after a delay.

4. The lock is acquired only to protect the actual join, not the
   discovery step where the backend is queried to get the list of peers.
   With the node selection described above, this will let the first node
   to start without acquiring the lock.

5. The cluster membership views queried as part of the algorithm to sort
   the list of nodes will be used to detect additional clusters or
   standalone nodes that did not cluster correctly. These nodes will be
   asked to re-evaluate peer discovery to increase the chance of forming
   a single cluster.

6. After some delay, peer discovery will be re-evaluated to further
   eliminate the chances of having multiple clusters instead of one.

This commit covers properties from point 1 to point 4. Remaining
properties will be the scope of additional pull requests after this one
works.

If there is a failure at any point during discovery, filtering/sorting,
locking or joining, the entire process is restarted after a delay. This
is configured using the following parameters:
* cluster_formation.discovery_retry_limit
* cluster_formation.discovery_retry_interval

The default parameters were bumped to 30 retries with a delay of 1
second between each.

The locking retries/interval parameters are not used by the new
algorithm anymore.

There are extra minor changes that come with the rewrite:
* The configured backend is cached in a persistent term. The goal is to
  make sure we use the same backend throughout the entire process and
  when we call `maybe_unregister/0` even if the configuration changed
  for whatever reason in between.
* `maybe_register/0` is called from `rabbit_db_cluster` instead of at
  the end of a successful peer discovery process. `rabbit_db_cluster`
  had to call `maybe_register/0` if the node was not virgin anyway. So
  make it simpler and always call it in `rabbit_db_cluster` regardless
  of the state of the node.
* `log_configured_backend/0` is gone. `maybe_init/0` can log the backend
  directly. There is no need to explicitly call another function for
  that.
* Messages are logged using `?LOG_*()` macros instead of the old
  `rabbit_log` module.
  • Loading branch information
dumbbell committed Dec 4, 2023
1 parent 8a59ae3 commit 2a0a0df
Show file tree
Hide file tree
Showing 10 changed files with 916 additions and 340 deletions.
8 changes: 8 additions & 0 deletions deps/rabbit/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -958,6 +958,14 @@ rabbitmq_suite(
],
)

rabbitmq_suite(
name = "unit_cluster_formation_sort_nodes_SUITE",
size = "small",
deps = [
"@meck//:erlang_app",
],
)

rabbitmq_suite(
name = "unit_collections_SUITE",
size = "small",
Expand Down
8 changes: 8 additions & 0 deletions deps/rabbit/app.bzl
Original file line number Diff line number Diff line change
Expand Up @@ -1735,6 +1735,14 @@ def test_suite_beam_files(name = "test_suite_beam_files"):
app_name = "rabbit",
erlc_opts = "//:test_erlc_opts",
)
erlang_bytecode(
name = "unit_cluster_formation_sort_nodes_SUITE_beam_files",
testonly = True,
srcs = ["test/unit_cluster_formation_sort_nodes_SUITE.erl"],
outs = ["test/unit_cluster_formation_sort_nodes_SUITE.beam"],
app_name = "rabbit",
erlc_opts = "//:test_erlc_opts",
)
erlang_bytecode(
name = "unit_collections_SUITE_beam_files",
testonly = True,
Expand Down
9 changes: 1 addition & 8 deletions deps/rabbit/src/rabbit_db.erl
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,6 @@ init() ->
#{domain => ?RMQLOG_DOMAIN_DB}),

ensure_dir_exists(),
rabbit_peer_discovery:log_configured_backend(),
rabbit_peer_discovery:maybe_init(),

pre_init(IsVirgin),
Expand All @@ -66,8 +65,8 @@ init() ->
"DB: initialization successeful",
#{domain => ?RMQLOG_DOMAIN_DB}),

rabbit_peer_discovery:maybe_register(),
init_finished(),
post_init(IsVirgin),

ok;
Error ->
Expand All @@ -82,12 +81,6 @@ pre_init(IsVirgin) ->
OtherMembers = rabbit_nodes:nodes_excl_me(Members),
rabbit_db_cluster:ensure_feature_flags_are_in_sync(OtherMembers, IsVirgin).

post_init(false = _IsVirgin) ->
rabbit_peer_discovery:maybe_register();
post_init(true = _IsVirgin) ->
%% Registration handled by rabbit_peer_discovery.
ok.

init_using_mnesia() ->
?LOG_DEBUG(
"DB: initialize Mnesia",
Expand Down
32 changes: 13 additions & 19 deletions deps/rabbit/src/rabbit_mnesia.erl
Original file line number Diff line number Diff line change
Expand Up @@ -114,13 +114,24 @@
init() ->
ensure_mnesia_running(),
ensure_mnesia_dir(),
%% If this node is virgin, we call peer discovery to see if this node
%% should start as a standalone node or if it should join a cluster.
case is_virgin_node() of
true ->
rabbit_log:info("Node database directory at ~ts is empty. "
"Assuming we need to join an existing cluster or initialise from scratch...",
[dir()]),
rabbit_peer_discovery:maybe_create_cluster(
fun create_cluster_callback/2);
rabbit_peer_discovery:sync_desired_cluster();
false ->
ok
end,
%% Peer discovery may have been a no-op if it decided that all other nodes
%% should join this one. Therefore, we need to look at if this node is
%% still virgin and finish our use of Mnesia accordingly. In particular,
%% this second part crates all our Mnesia tables.
case is_virgin_node() of
true ->
init_db_and_upgrade([node()], disc, true, _Retry = true);
false ->
NodeType = node_type(),
case is_node_type_permitted(NodeType) of
Expand All @@ -141,23 +152,6 @@ init() ->
ok = rabbit_node_monitor:global_sync(),
ok.

create_cluster_callback(none, NodeType) ->
DiscNodes = [node()],
NodeType1 = case is_node_type_permitted(NodeType) of
false -> disc;
true -> NodeType
end,
init_db_and_upgrade(DiscNodes, NodeType1, true, _Retry = true),
ok;
create_cluster_callback(RemoteNode, NodeType) ->
{ok, {_, DiscNodes, _}} = discover_cluster0(RemoteNode),
NodeType1 = case is_node_type_permitted(NodeType) of
false -> disc;
true -> NodeType
end,
init_db_and_upgrade(DiscNodes, NodeType1, true, _Retry = true),
ok.

%% Make the node join a cluster. The node will be reset automatically
%% before we actually cluster it. The nodes provided will be used to
%% find out about the nodes in the cluster.
Expand Down
Loading

0 comments on commit 2a0a0df

Please sign in to comment.