Delay start by 30 seconds #1156

ansd · 2022-10-07T08:55:21Z

The rabbitmq/cluster-operator relies on

cluster-operator/internal/resource/headless_service.go

Line 72 in 8f4735b

PublishNotReadyAddresses: true,

in its headless service for peer discovery such that the RabbitMQ cluster gets formed before the Pods are being marked as Ready.

However, as explained in kubernetes/kubernetes#92559 this promise is broken in Kubernetes, specifically in CoreDNS: By default DNS entries of the Pods are not updated correctly within the first 30 seconds.

These networking issues early during RabbitMQ boot cause a series of issues:

In some of above issues, the solution taken was to retry N times within RabbitMQ during the boot process.
In other cases simply retrying within RabbitMQ is not even possible because things get stuck due to bugs in Erlang's global module (more details in above issues).

Instead of using brittle workarounds within RabbitMQ, I suggest we only start RabbitMQ once we know that the Pods' DNS entries are up-to-date.

How can we ensure that the Pods' DNS entries are up-to-date?

One possibility is to query the DNS service in the init container as demonstrated in rabbitmq/rabbitmq-server#5322 (comment). However, since some DNS services use a K8s Service object in front, it's not certain that all the DNS backend pods are queried. So, the init container could falsely get positive results while during RabbitMQ boot process, other DNS servers with outdated caches are queried.

Therefore, to make things simple, I suggest the new default will be to artificially delay starting RabbitMQ by 30 seconds. I chose 30 seconds because that's CoreDNS default cache value. It's an inelegant solution which I've tried to avoid until now. However, reliable rolling updates (and cluster formations) are more important than 30 seconds more downtime when a node is being updated.

This new value will be configurable via spec.DelayStartSeconds. Users who never experienced RabbitMQ boot issues (for example because they use DNS backends which actually respect publishNotReadyAddresses=True or because they explicitly decrease CoreDNS default cache value) can safely reduce this artificial delay or disable it by setting the value to 0.

I tested 50 5-node cluster formations and rolling updates. With the new 30 seconds delay between creating the Pod and starting RabbitMQ, I never saw any clustering or upgrade issues.

to make the vulnerability checker in GitHub actions happy

config/crd/bases/rabbitmq.com_rabbitmqclusters.yaml

ansd · 2022-10-10T09:58:36Z

Without this PR, I sometimes see the following in the logs of a node being booted during a rolling update:

2022-10-10 09:51:52.533123+00:00 [debug] <0.606.0> Getting epmd port node 'rabbit', 10 retries left
2022-10-10 09:54:07.004227+00:00 [debug] <0.606.0> Getting epmd port node 'rabbit', 9 retries left

... more than 2 minutes delay. In that regards this PR makes booting even faster.

Prior to this commit, global:sync/0 gets sometimes stuck when either performing a rolling update on Kubernetes or when creating a new RabbitMQ cluster on Kubernetes. When performing a rolling update, the node being booted will be stuck in: ``` 2022-07-26 10:49:58.891896+00:00 [debug] <0.226.0> == Plugins (prelaunch phase) == 2022-07-26 10:49:58.891908+00:00 [debug] <0.226.0> Setting plugins up 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> Loading the following plugins: [cowlib,cowboy,rabbitmq_web_dispatch, 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> rabbitmq_management_agent,amqp_client, 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> rabbitmq_management,quantile_estimator, 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> prometheus,rabbitmq_peer_discovery_common, 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> accept,rabbitmq_peer_discovery_k8s, 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> rabbitmq_prometheus] 2022-07-26 10:49:58.926373+00:00 [debug] <0.226.0> Feature flags: REFRESHING after applications load... 2022-07-26 10:49:58.926416+00:00 [debug] <0.372.0> Feature flags: registering controller globally before proceeding with task: refresh_after_app_load 2022-07-26 10:49:58.926450+00:00 [debug] <0.372.0> Feature flags: [global sync] @ [email protected] ``` During cluster creation, an example log of global:sync/0 being stuck can be found in bullet point 2 of #5331 (review) When global:sync/0 is stuck, it never receives a message in line https://github.com/erlang/otp/blob/bd05b07f973f11d73c4fc77d59b69f212f121c2d/lib/kernel/src/global.erl#L2942 This issue can be observed in both `kind` and GKE. `kind` uses CoreDNS, GKE uses kubedns. CoreDNS does not resolve the hostname of RabbitMQ and its peers correctly for up to 30 seconds after node startup. This is because the default cache value of CoreDNS is 30 seconds and CoreDNS has a bug described in kubernetes/kubernetes#92559 global:sync/0 is known to be buggy "in the presence of network failures" unless the kernel parameter `prevent_overlapping_partitions` is set to `true`. When either: 1. setting CoreDNS cache value to 1 second (see #5322 (comment) on how to set this value), or 2. setting the kernel parameter `prevent_overlapping_partitions` to `true` rolling updates do NOT get stuck anymore. This means we are hitting here a combination of: 1. Kubernetes DNS bug not updating DNS caches promptly for headless services with `publishNotReadyAddresses: true`, and 2. Erlang bug which causes global:sync/0 to hang forever in the presence of network failures. The Erlang bug is fixed by setting `prevent_overlapping_partitions` to `true` (default in Erlang/OTP 25). In RabbitMQ however, we explicitly set `prevent_overlapping_partitions` to `false` because we fear other issues could arise if we set this parameter to `true`. Luckily, to resolve this issue of global:sync/0 being stuck, we can just call function rabbit_node_monitor:global_sync/0 which provides a workaround. This function was introduced 8 years ago in 9fcb31f With this commit applied, rolling updates are not stuck anymore and we see in the debug log the workaround sometimes being applied. (cherry picked from commit 4bf78d8)

[Why] This work started as an effort to add peer discovery support to our Khepri integration. Indeed, as part of the task to integrate Khepri, we missed the fact that `rabbit_peer_discovery:maybe_create_cluster/1` was called from the Mnesia-specific code only. Even though we knew about it because we hit many issues caused by the fact the `join_cluster` and peer discovery use different code path to create a cluster. To add support for Khepri, the first version of this patch was to move the call to `rabbit_peer_discovery:maybe_create_cluster/1` from `rabbit_db_cluster` instead of `rabbit_mnesia`. To achieve that, it made sense to unify the code and simply call `rabbit_db_cluster:join/2` instead of duplicating the work. Unfortunately, doing so highlighted another issue: the way the node to cluster with was selected. Indeed, it could cause situations where multiple clusters are created instead of one, without resorting to out-of-band counter-measures, like a 30-second delay added in the Kubernetes operator (rabbitmq/cluster-operator#1156). This problem was even more frequent when we tried to unify the code path and call `join_cluster`. After several iterations on the patch and even more discussions with the team, we decided to rewrite the algorithm to make node selection more robust and still use `rabbit_db_cluster:join/2` to create the cluster. [How] This commit is only about the rewrite of the algorithm. Calling peer discovery from `rabbit_db_cluster` instead of `rabbit_mnesia` (and thus making peer discovery work with Khepri) will be done in a follow-up commit. We wanted the new algorithm to fulfill the following properties: 1. `rabbit_peer_discovery` should provide the ability to re-trigger it easily to re-evaluate the cluster. The new public API is `rabbit_peer_discovery:sync_desired_cluster/0`. 2. The selection of the node to join should be designed in a way that all nodes select the same, regardless of the order in which they become available. The adopted solution is to sort the list of discovered nodes with the following criterias (in that order): 1. the size of the cluster a discovered node is part of; sorted from bigger to smaller clusters 2. the start time of a discovered node; sorted from older to younger nodes 3. the name of a discovered node; sorted alphabetically The first node in that list will not join anyone and simply proceed with its boot process. Other nodes will try to join the first node. 3. To reduce the chance of incorrectly having multiple standalone nodes because the discovery backend returned only a single node, we want to apply the following constraints to the list of nodes after it is filtered and sorted (see property 2 above): * The list must contain `node()` (i.e. the node running peer discovery itself). * If the RabbitMQ's cluster size hint is greater than 1, the list must have at least two nodes. The cluster size hint is the maximum between the configured target cluster size hint and the number of elements in the nodes list returned by the backend. If one of the constraint is not met, the entire peer discovery process is restarted after a delay. 4. The lock is acquired only to protect the actual join, not the discovery step where the backend is queried to get the list of peers. With the node selection described above, this will let the first node to start without acquiring the lock. 5. The cluster membership views queried as part of the algorithm to sort the list of nodes will be used to detect additional clusters or standalone nodes that did not cluster correctly. These nodes will be asked to re-evaluate peer discovery to increase the chance of forming a single cluster. 6. After some delay, peer discovery will be re-evaluated to further eliminate the chances of having multiple clusters instead of one. This commit covers properties from point 1 to point 4. Remaining properties will be the scope of additional pull requests after this one works. If there is a failure at any point during discovery, filtering/sorting, locking or joining, the entire process is restarted after a delay. This is configured using the following parameters: * cluster_formation.discovery_retry_limit * cluster_formation.discovery_retry_interval The default parameters were bumped to 30 retries with a delay of 1 second between each. The locking retries/interval parameters are not used by the new algorithm anymore. There are extra minor changes that come with the rewrite: * The configured backend is cached in a persistent term. The goal is to make sure we use the same backend throughout the entire process and when we call `maybe_unregister/0` even if the configuration changed for whatever reason in between. * `maybe_register/0` is called from `rabbit_db_cluster` instead of at the end of a successful peer discovery process. `rabbit_db_cluster` had to call `maybe_register/0` if the node was not virgin anyway. So make it simpler and always call it in `rabbit_db_cluster` regardless of the state of the node. * `log_configured_backend/0` is gone. `maybe_init/0` can log the backend directly. There is no need to explicitly call another function for that. * Messages are logged using `?LOG_*()` macros instead of the old `rabbit_log` module.

ansd added 2 commits October 7, 2022 08:54

Delay start by 30 seconds

5404a43

Update go version to 1.19.2

310db4b

to make the vulnerability checker in GitHub actions happy

ansd marked this pull request as ready for review October 7, 2022 15:00

ansd requested a review from mkuratczyk October 7, 2022 15:02

MirahImage reviewed Oct 7, 2022

View reviewed changes

config/crd/bases/rabbitmq.com_rabbitmqclusters.yaml Show resolved Hide resolved

MirahImage merged commit 35a04b8 into main Oct 10, 2022

MirahImage deleted the delay-start branch October 10, 2022 10:30

dumbbell mentioned this pull request Nov 10, 2023

rabbit_peer_discovery: Rewrite the core logic rabbitmq/rabbitmq-server#9797

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delay start by 30 seconds #1156

Delay start by 30 seconds #1156

ansd commented Oct 7, 2022 •

edited

Loading

ansd commented Oct 10, 2022

Delay start by 30 seconds #1156

Delay start by 30 seconds #1156

Conversation

ansd commented Oct 7, 2022 • edited Loading

ansd commented Oct 10, 2022

ansd commented Oct 7, 2022 •

edited

Loading