Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retry for erl_epmd:port_please #5331

Merged
merged 4 commits into from
Jul 26, 2022
Merged

Conversation

Gsantomaggio
Copy link
Member

@Gsantomaggio Gsantomaggio commented Jul 26, 2022

Fixes #5322.

The docker image is: pivotalrabbitmq/rabbitmq:empd_not_found_5322-otp-max
You should see:

my-rabbit-server-0 rabbitmq 2022-07-26 07:51:17.266838+00:00 [debug] <0.689.0> Getting epmd port node 'rabbit', 10 retries left
my-rabbit-server-3 rabbitmq 2022-07-26 07:51:18.596978+00:00 [debug] <0.668.0> Getting epmd port node 'rabbit', 10 retries left
my-rabbit-server-1 rabbitmq 2022-07-26 07:51:20.156612+00:00 [debug] <0.678.0> Getting epmd port node 'rabbit', 10 retries left
my-rabbit-server-2 rabbitmq 2022-07-26 07:51:27.894543+00:00 [debug] <0.701.0> Getting epmd port node 'rabbit', 10 retries left
my-rabbit-server-1 rabbitmq 2022-07-26 07:51:50.167880+00:00 [debug] <0.678.0> Getting epmd port node 'rabbit', 9 retries left

last line means that it waited for a bit then re-tried again:

my-rabbit-server-1 rabbitmq 2022-07-26 07:51:50.167880+00:00 [debug] <0.678.0> Getting epmd port node 'rabbit', 9 retries left

Proposed Changes

Types of Changes

What types of changes does your code introduce to this project?

  • Bug fix (non-breaking change which fixes issue #NNNN)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)
  • Build system and/or CI

Checklist

@Gsantomaggio Gsantomaggio marked this pull request as draft July 26, 2022 07:58
Fixes: #2722

Signed-off-by: Gabriele Santomaggio <[email protected]>
Signed-off-by: Gabriele Santomaggio <[email protected]>
@Gsantomaggio Gsantomaggio force-pushed the empd_not_found_5322 branch from 7d4b172 to 3c63075 Compare July 26, 2022 08:16
@Gsantomaggio Gsantomaggio marked this pull request as ready for review July 26, 2022 08:39
@Gsantomaggio Gsantomaggio requested a review from ansd July 26, 2022 08:40
Copy link
Member

@ansd ansd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I did a few tests as described in #5322 (comment) and could not observe that specific error anymore.

I still observed sporadically cluster formation issues.

  1. same as in Wait between retries if nxdomain #4233:
BOOT FAILED
2022-07-26 09:48:53.919237+00:00 [error] <0.131.0>
===========
2022-07-26 09:48:53.919237+00:00 [error] <0.131.0> BOOT FAILED
ERROR: epmd error for host r1-server-3.r1-nodes.default: nxdomain (non-existing domain)
2022-07-26 09:48:53.919237+00:00 [error] <0.131.0> ===========

2022-07-26 09:48:53.919237+00:00 [error] <0.131.0> ERROR: epmd error for host r1-server-3.r1-nodes.default: nxdomain (non-existing domain)
2022-07-26 09:48:53.919237+00:00 [error] <0.131.0>
2022-07-26 09:48:54.920058+00:00 [error] <0.131.0>     supervisor: {local,rabbit_prelaunch_sup}
2022-07-26 09:48:54.920058+00:00 [error] <0.131.0>     errorContext: start_error
2022-07-26 09:48:54.920058+00:00 [error] <0.131.0>     reason: {epmd_error,"r1-server-3.r1-nodes.default",nxdomain}
2022-07-26 09:48:54.920058+00:00 [error] <0.131.0>     offender: [{pid,undefined},
2022-07-26 09:48:54.920058+00:00 [error] <0.131.0>                {id,prelaunch},
2022-07-26 09:48:54.920058+00:00 [error] <0.131.0>                {mfargs,{rabbit_prelaunch,run_prelaunch_first_phase,[]}},
2022-07-26 09:48:54.920058+00:00 [error] <0.131.0>                {restart_type,transient},
2022-07-26 09:48:54.920058+00:00 [error] <0.131.0>                {significant,false},
2022-07-26 09:48:54.920058+00:00 [error] <0.131.0>                {shutdown,5000},
2022-07-26 09:48:54.920058+00:00 [error] <0.131.0>                {child_type,worker}]
2022-07-26 09:48:54.920058+00:00 [error] <0.131.0>
2022-07-26 09:48:54.921138+00:00 [error] <0.129.0>   crasher:
2022-07-26 09:48:54.921138+00:00 [error] <0.129.0>     initial call: application_master:init/4
2022-07-26 09:48:54.921138+00:00 [error] <0.129.0>     pid: <0.129.0>
2022-07-26 09:48:54.921138+00:00 [error] <0.129.0>     registered_name: []
2022-07-26 09:48:54.921138+00:00 [error] <0.129.0>     exception exit: {{shutdown,
2022-07-26 09:48:54.921138+00:00 [error] <0.129.0>                          {failed_to_start_child,prelaunch,
2022-07-26 09:48:54.921138+00:00 [error] <0.129.0>                              {epmd_error,"r1-server-3.r1-nodes.default",
2022-07-26 09:48:54.921138+00:00 [error] <0.129.0>                                  nxdomain}}},
2022-07-26 09:48:54.921138+00:00 [error] <0.129.0>                      {rabbit_prelaunch_app,start,[normal,[]]}}
2022-07-26 09:48:54.921138+00:00 [error] <0.129.0>       in function  application_master:init/4 (application_master.erl, line 142)
2022-07-26 09:48:54.921138+00:00 [error] <0.129.0>     ancestors: [<0.128.0>]
2022-07-26 09:48:54.921138+00:00 [error] <0.129.0>     message_queue_len: 1
2022-07-26 09:48:54.921138+00:00 [error] <0.129.0>     messages: [{'EXIT',<0.130.0>,normal}]
2022-07-26 09:48:54.921138+00:00 [error] <0.129.0>     links: [<0.128.0>,<0.44.0>]
2022-07-26 09:48:54.921138+00:00 [error] <0.129.0>     dictionary: []
2022-07-26 09:48:54.921138+00:00 [error] <0.129.0>     trap_exit: true
2022-07-26 09:48:54.921138+00:00 [error] <0.129.0>     status: running
2022-07-26 09:48:54.921138+00:00 [error] <0.129.0>     heap_size: 376
2022-07-26 09:48:54.921138+00:00 [error] <0.129.0>     stack_size: 28
2022-07-26 09:48:54.921138+00:00 [error] <0.129.0>     reductions: 168
2022-07-26 09:48:54.921138+00:00 [error] <0.129.0>   neighbours:
2022-07-26 09:48:54.921138+00:00 [error] <0.129.0>
2022-07-26 09:48:54.931629+00:00 [notice] <0.44.0> Application rabbitmq_prelaunch exited with reason: {{shutdown,{failed_to_start_child,prelaunch,{epmd_error,"r1-server-3.r1-nodes.default",nxdomain}}},{rabbit_prelaunch_app,start,[normal,[]]}}
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbitmq_prelaunch,{{shutdown,{failed_to_start_child,prelaunch,{epmd_error,\"r1-server-3.r1-nodes.default\",nxdomain}}},{rabbit_prelaunch_app,start,[normal,[]]}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,rabbitmq_prelaunch,{{shutdown,{failed_to_start_child,prelaunch,{epmd_error,"r1-server-3.r1-nodes.default",nxdomain}}},{rabbit_prelaunch_app,start,[normal,[]]}}})

It might be worth checking whether the "wait" times are large enough in that change.

  1. Unrelated to this change, and applies to the new feature flags controller (cc @dumbbell): Sometimes one pod was stuck in:
2022-07-26 09:52:32.565800+00:00 [debug] <0.226.0> Feature flags: scanned applications: [rabbit,rabbit_common,
2022-07-26 09:52:32.565800+00:00 [debug] <0.226.0>                                       rabbitmq_management,
2022-07-26 09:52:32.565800+00:00 [debug] <0.226.0>                                       rabbitmq_management_agent,
2022-07-26 09:52:32.565800+00:00 [debug] <0.226.0>                                       rabbitmq_peer_discovery_common,
2022-07-26 09:52:32.565800+00:00 [debug] <0.226.0>                                       rabbitmq_peer_discovery_k8s,
2022-07-26 09:52:32.565800+00:00 [debug] <0.226.0>                                       rabbitmq_prelaunch,rabbitmq_prometheus,
2022-07-26 09:52:32.565800+00:00 [debug] <0.226.0>                                       rabbitmq_web_dispatch]
2022-07-26 09:52:32.565800+00:00 [debug] <0.226.0> Feature flags: feature flag states written to disk: yes
2022-07-26 09:52:32.569385+00:00 [debug] <0.226.0> Feature flags: registry module ready, loading it (<0.226.0>)...
2022-07-26 09:52:32.569460+00:00 [debug] <0.226.0> Feature flags: acquired lock before reloading registry module (<0.226.0>)
2022-07-26 09:52:32.571891+00:00 [debug] <0.226.0> Feature flags: releasing lock after reloading registry module (<0.226.0>)
2022-07-26 09:52:32.571965+00:00 [debug] <0.226.0> Feature flags: registry module loaded (vsn: [9257440361403759229452931694312505263] -> [60314197729953461943764115135738606877])
2022-07-26 09:52:32.571989+00:00 [debug] <0.226.0> Feature flags: time to regen registry: 6347 us
2022-07-26 09:52:32.572002+00:00 [debug] <0.226.0> Feature flags: both sides have `feature_flags_v2` enabled; switching to controller's sync
2022-07-26 09:52:32.572040+00:00 [debug] <0.226.0> Feature flags: SYNCING FEATURE FLAGS in cluster...
2022-07-26 09:52:32.572059+00:00 [debug] <0.284.0> Feature flags: registering controller globally before proceeding with task: sync_cluster
2022-07-26 09:52:32.572092+00:00 [debug] <0.284.0> Feature flags: [global sync] @ [email protected]

while the another pod was stuck in:

2022-07-26 10:01:45.904021+00:00 [info] <0.229.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
2022-07-26 10:01:45.904041+00:00 [debug] <0.229.0> Peer discovery backend supports initialisation
2022-07-26 10:01:45.904055+00:00 [debug] <0.229.0> Peer discovery Kubernetes: initialising...
2022-07-26 10:01:45.904131+00:00 [debug] <0.229.0> HTTP client proxy is not configured
2022-07-26 10:01:45.904166+00:00 [debug] <0.229.0> Peer discovery backend initialisation succeeded
2022-07-26 10:01:45.904179+00:00 [info] <0.229.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2022-07-26 10:01:45.904467+00:00 [debug] <0.229.0> GET https://kubernetes.default:443/api/v1/namespaces/default/endpoints/r1-nodes
2022-07-26 10:01:45.904915+00:00 [warning] <0.329.0> Description: "Authenticity is not established by certificate path validation"
2022-07-26 10:01:45.904915+00:00 [warning] <0.329.0>      Reason: "Option {verify, verify_peer} and cacertfile/cacerts is missing"
2022-07-26 10:01:45.904915+00:00 [warning] <0.329.0>
2022-07-26 10:01:45.911550+00:00 [debug] <0.229.0> Response: {ok,{{"HTTP/1.1",200,"OK"},
2022-07-26 10:01:45.911550+00:00 [debug] <0.229.0>                [{"cache-control","no-cache, private"},
2022-07-26 10:01:45.911550+00:00 [debug] <0.229.0>                 {"date","Tue, 26 Jul 2022 10:01:45 GMT"},
2022-07-26 10:01:45.911550+00:00 [debug] <0.229.0>                 {"content-length","2112"},
2022-07-26 10:01:45.911550+00:00 [debug] <0.229.0>                 {"content-type","application/json"},
2022-07-26 10:01:45.911550+00:00 [debug] <0.229.0>                 {"x-kubernetes-pf-flowschema-uid",
2022-07-26 10:01:45.911550+00:00 [debug] <0.229.0>                  "ae88cc17-8d1e-4639-8969-29b7789e177e"},
2022-07-26 10:01:45.911550+00:00 [debug] <0.229.0>                 {"x-kubernetes-pf-prioritylevel-uid",
2022-07-26 10:01:45.911550+00:00 [debug] <0.229.0>                  "5745bb1e-755a-4df5-a9f0-c22d7d88590e"}],
2022-07-26 10:01:45.911550+00:00 [debug] <0.229.0>                "{\"kind\":\"Endpoints\",\"apiVersion\":\"v1\",\"metadata\":{\"name\":\"r1-nodes\",\"namespace\":\"default\",\"uid\":\"7acc07ad-b3e2-4a7c-a97b-35ff3a6c46c5\",\"resourceVersion\":\"7362\",\"creationTimestamp\":\"2022-07-26T10:01:28Z\",\"labels\":{\"app.kubernetes.io/component\":\"rabbitmq\",\"app.kubernetes.io/name\":\"r1\",\"app.kubernetes.io/part-of\":\"rabbitmq\",\"service.kubernetes.io/headless\":\"\"},\"annotations\":{\"endpoints.kubernetes.io/last-change-trigger-time\":\"2022-07-26T10:01:40Z\"},\"managedFields\":[{\"manager\":\"kube-controller-manager\",\"operation\":\"Update\",\"apiVersion\":\"v1\",\"time\":\"2022-07-26T10:01:44Z\",\"fieldsType\":\"FieldsV1\",\"fieldsV1\":{\"f:metadata\":{\"f:annotations\":{\".\":{},\"f:endpoints.kubernetes.io/last-change-trigger-time\":{}},\"f:labels\":{\".\":{},\"f:app.kubernetes.io/component\":{},\"f:app.kubernetes.io/name\":{},\"f:app.kubernetes.io/part-of\":{},\"f:service.kubernetes.io/headless\":{}}},\"f:subsets\":{}}}]},\"subsets\":[{\"addresses\":[{\"ip\":\"10.244.0.145\",\"hostname\":\"r1-server-0\",\"nodeName\":\"kind-control-plane\",\"targetRef\":{\"kind\":\"Pod\",\"namespace\":\"default\",\"name\":\"r1-server-0\",\"uid\":\"392dda7b-c4a9-4ca8-b562-8c56a13ba57d\",\"resourceVersion\":\"7328\"}},{\"ip\":\"10.244.0.146\",\"hostname\":\"r1-server-1\",\"nodeName\":\"kind-control-plane\",\"targetRef\":{\"kind\":\"Pod\",\"namespace\":\"default\",\"name\":\"r1-server-1\",\"uid\":\"c80770db-603d-4d5d-9500-b9a9d763b4f8\",\"resourceVersion\":\"7300\"}},{\"ip\":\"10.244.0.147\",\"hostname\":\"r1-server-2\",\"nodeName\":\"kind-control-plane\",\"targetRef\":{\"kind\":\"Pod\",\"namespace\":\"default\",\"name\":\"r1-server-2\",\"uid\":\"b50a1c45-f51f-4c52-b5a8-94a7c72717d7\",\"resourceVersion\":\"7341\"}},{\"ip\":\"10.244.0.149\",\"hostname\":\"r1-server-3\",\"nodeName\":\"kind-control-plane\",\"targetRef\":{\"kind\":\"Pod\",\"namespace\":\"default\",\"name\":\"r1-server-3\",\"uid\":\"fb1fb0bd-12a8-46f8-9e72-427fd82201ea\",\"resourceVersion\":\"7345\"}},{\"ip\":\"10.244.0.150\",\"hostname\":\"r1-server-4\",\"nodeName\":\"kind-control-plane\",\"targetRef\":{\"kind\":\"Pod\",\"namespace\":\"default\",\"name\":\"r1-server-4\",\"uid\":\"a1c72230-92c5-4bec-ac78-7d7af0125112\",\"resourceVersion\":\"7359\"}}],\"ports\":[{\"name\":\"cluster-rpc\",\"port\":25672,\"protocol\":\"TCP\"},{\"name\":\"epmd\",\"port\":4369,\"protocol\":\"TCP\"}]}]}\n"}}
2022-07-26 10:01:45.912014+00:00 [info] <0.229.0> Will try to lock connecting to nodes ['[email protected]',
2022-07-26 10:01:45.912014+00:00 [info] <0.229.0>                                       '[email protected]',
2022-07-26 10:01:45.912014+00:00 [info] <0.229.0>                                       '[email protected]',
2022-07-26 10:01:45.912014+00:00 [info] <0.229.0>                                       '[email protected]',
2022-07-26 10:01:45.912014+00:00 [info] <0.229.0>                                       '[email protected]']

So, I'm happy with merging this PR since it does fix #5322.

deps/rabbit/src/rabbit_networking.erl Outdated Show resolved Hide resolved
Signed-off-by: Gabriele Santomaggio <[email protected]>
Signed-off-by: Gabriele Santomaggio <[email protected]>
@Gsantomaggio
Copy link
Member Author

Per conversation with @ansd we decided to increase the timeout. In this way all the timeouts are the same and it should reduce the possibility for:

 nxdomain (non-existing domain)

@michaelklishin michaelklishin merged commit 7347614 into master Jul 26, 2022
@Gsantomaggio Gsantomaggio deleted the empd_not_found_5322 branch July 26, 2022 14:58
michaelklishin added a commit that referenced this pull request Jul 26, 2022
Add retry for erl_epmd:port_please (backport #5331)
ansd added a commit that referenced this pull request Aug 4, 2022
Prior to this commit, global:sync/0 gets sometimes stuck when either
performing a rolling update on Kubernetes or when creating a new
RabbitMQ cluster on Kubernetes.

When performing a rolling update, the node being booted will be stuck
in:
```
2022-07-26 10:49:58.891896+00:00 [debug] <0.226.0> == Plugins (prelaunch phase) ==
2022-07-26 10:49:58.891908+00:00 [debug] <0.226.0> Setting plugins up
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> Loading the following plugins: [cowlib,cowboy,rabbitmq_web_dispatch,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 rabbitmq_management_agent,amqp_client,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 rabbitmq_management,quantile_estimator,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 prometheus,rabbitmq_peer_discovery_common,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 accept,rabbitmq_peer_discovery_k8s,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 rabbitmq_prometheus]
2022-07-26 10:49:58.926373+00:00 [debug] <0.226.0> Feature flags: REFRESHING after applications load...
2022-07-26 10:49:58.926416+00:00 [debug] <0.372.0> Feature flags: registering controller globally before proceeding with task: refresh_after_app_load
2022-07-26 10:49:58.926450+00:00 [debug] <0.372.0> Feature flags: [global sync] @ [email protected]
```

During cluster creation, an example log of global:sync/0 being stuck can
be found in bullet point 2 of
#5331 (review)

When global:sync/0 is stuck, it never receives a message in line
https://github.com/erlang/otp/blob/bd05b07f973f11d73c4fc77d59b69f212f121c2d/lib/kernel/src/global.erl#L2942

This issue can be observed in both `kind` and GKE.

`kind` uses CoreDNS, GKE uses kubedns.

CoreDNS does not resolve the hostname of RabbitMQ and its peers
correctly for up to 30 seconds after node startup.
This is because the default cache value of CoreDNS is 30 seconds and
CoreDNS has a bug described in
kubernetes/kubernetes#92559

global:sync/0 is known to be buggy "in the presence of network failures"
unless the kernel parameter `prevent_overlapping_partitions` is set.

When either:
1. setting CoreDNS cache value to 1 second (see
#5322 (comment)
on how to set this value), or
2. setting the kernel parameter `prevent_overlapping_partitions` to `true`

rolling updates do NOT get stuck anymore.

This means we are hitting here a combination of:
1. Kubernetes DNS bug not updating DNS caches promptly for headless
   services with `publishNotReadyAddresses: true`, and
2. Erlang bug which causes global:sync/0 to hang forever in the presence
   of network failures.

The Erlang bug is fixed by setting `prevent_overlapping_partitions` to `true` (default in Erlang/OTP 25).

In RabbitMQ however, we explicitly set `prevent_overlapping_partitions`
to `false` because other issues may arise if we set this paramter to
`true`.

Luckily, to resolve this issue of global:sync/0 being stuck, we can just
call function rabbit_node_monitor:global_sync/0 which provides a
workaround.

With this commit applied, rolling updates are not stuck anymore and we
see in the debug log the workaround sometimes being applied.
ansd added a commit that referenced this pull request Aug 4, 2022
Prior to this commit, global:sync/0 gets sometimes stuck when either
performing a rolling update on Kubernetes or when creating a new
RabbitMQ cluster on Kubernetes.

When performing a rolling update, the node being booted will be stuck
in:
```
2022-07-26 10:49:58.891896+00:00 [debug] <0.226.0> == Plugins (prelaunch phase) ==
2022-07-26 10:49:58.891908+00:00 [debug] <0.226.0> Setting plugins up
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> Loading the following plugins: [cowlib,cowboy,rabbitmq_web_dispatch,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 rabbitmq_management_agent,amqp_client,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 rabbitmq_management,quantile_estimator,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 prometheus,rabbitmq_peer_discovery_common,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 accept,rabbitmq_peer_discovery_k8s,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 rabbitmq_prometheus]
2022-07-26 10:49:58.926373+00:00 [debug] <0.226.0> Feature flags: REFRESHING after applications load...
2022-07-26 10:49:58.926416+00:00 [debug] <0.372.0> Feature flags: registering controller globally before proceeding with task: refresh_after_app_load
2022-07-26 10:49:58.926450+00:00 [debug] <0.372.0> Feature flags: [global sync] @ [email protected]
```

During cluster creation, an example log of global:sync/0 being stuck can
be found in bullet point 2 of
#5331 (review)

When global:sync/0 is stuck, it never receives a message in line
https://github.com/erlang/otp/blob/bd05b07f973f11d73c4fc77d59b69f212f121c2d/lib/kernel/src/global.erl#L2942

This issue can be observed in both `kind` and GKE.

`kind` uses CoreDNS, GKE uses kubedns.

CoreDNS does not resolve the hostname of RabbitMQ and its peers
correctly for up to 30 seconds after node startup.
This is because the default cache value of CoreDNS is 30 seconds and
CoreDNS has a bug described in
kubernetes/kubernetes#92559

global:sync/0 is known to be buggy "in the presence of network failures"
unless the kernel parameter `prevent_overlapping_partitions` is set to
`true`.

When either:
1. setting CoreDNS cache value to 1 second (see
#5322 (comment)
on how to set this value), or
2. setting the kernel parameter `prevent_overlapping_partitions` to `true`

rolling updates do NOT get stuck anymore.

This means we are hitting here a combination of:
1. Kubernetes DNS bug not updating DNS caches promptly for headless
   services with `publishNotReadyAddresses: true`, and
2. Erlang bug which causes global:sync/0 to hang forever in the presence
   of network failures.

The Erlang bug is fixed by setting `prevent_overlapping_partitions` to `true` (default in Erlang/OTP 25).

In RabbitMQ however, we explicitly set `prevent_overlapping_partitions`
to `false` because we fear other issues could arise if we set this parameter to `true`.

Luckily, to resolve this issue of global:sync/0 being stuck, we can just
call function rabbit_node_monitor:global_sync/0 which provides a
workaround. This function was introduced 8 years ago in
9fcb31f

With this commit applied, rolling updates are not stuck anymore and we
see in the debug log the workaround sometimes being applied.
mergify bot pushed a commit that referenced this pull request Aug 4, 2022
Prior to this commit, global:sync/0 gets sometimes stuck when either
performing a rolling update on Kubernetes or when creating a new
RabbitMQ cluster on Kubernetes.

When performing a rolling update, the node being booted will be stuck
in:
```
2022-07-26 10:49:58.891896+00:00 [debug] <0.226.0> == Plugins (prelaunch phase) ==
2022-07-26 10:49:58.891908+00:00 [debug] <0.226.0> Setting plugins up
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> Loading the following plugins: [cowlib,cowboy,rabbitmq_web_dispatch,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 rabbitmq_management_agent,amqp_client,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 rabbitmq_management,quantile_estimator,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 prometheus,rabbitmq_peer_discovery_common,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 accept,rabbitmq_peer_discovery_k8s,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 rabbitmq_prometheus]
2022-07-26 10:49:58.926373+00:00 [debug] <0.226.0> Feature flags: REFRESHING after applications load...
2022-07-26 10:49:58.926416+00:00 [debug] <0.372.0> Feature flags: registering controller globally before proceeding with task: refresh_after_app_load
2022-07-26 10:49:58.926450+00:00 [debug] <0.372.0> Feature flags: [global sync] @ [email protected]
```

During cluster creation, an example log of global:sync/0 being stuck can
be found in bullet point 2 of
#5331 (review)

When global:sync/0 is stuck, it never receives a message in line
https://github.com/erlang/otp/blob/bd05b07f973f11d73c4fc77d59b69f212f121c2d/lib/kernel/src/global.erl#L2942

This issue can be observed in both `kind` and GKE.

`kind` uses CoreDNS, GKE uses kubedns.

CoreDNS does not resolve the hostname of RabbitMQ and its peers
correctly for up to 30 seconds after node startup.
This is because the default cache value of CoreDNS is 30 seconds and
CoreDNS has a bug described in
kubernetes/kubernetes#92559

global:sync/0 is known to be buggy "in the presence of network failures"
unless the kernel parameter `prevent_overlapping_partitions` is set to
`true`.

When either:
1. setting CoreDNS cache value to 1 second (see
#5322 (comment)
on how to set this value), or
2. setting the kernel parameter `prevent_overlapping_partitions` to `true`

rolling updates do NOT get stuck anymore.

This means we are hitting here a combination of:
1. Kubernetes DNS bug not updating DNS caches promptly for headless
   services with `publishNotReadyAddresses: true`, and
2. Erlang bug which causes global:sync/0 to hang forever in the presence
   of network failures.

The Erlang bug is fixed by setting `prevent_overlapping_partitions` to `true` (default in Erlang/OTP 25).

In RabbitMQ however, we explicitly set `prevent_overlapping_partitions`
to `false` because we fear other issues could arise if we set this parameter to `true`.

Luckily, to resolve this issue of global:sync/0 being stuck, we can just
call function rabbit_node_monitor:global_sync/0 which provides a
workaround. This function was introduced 8 years ago in
9fcb31f

With this commit applied, rolling updates are not stuck anymore and we
see in the debug log the workaround sometimes being applied.

(cherry picked from commit 4bf78d8)
ansd added a commit that referenced this pull request Aug 4, 2022
Prior to this commit, global:sync/0 gets sometimes stuck when either
performing a rolling update on Kubernetes or when creating a new
RabbitMQ cluster on Kubernetes.

When performing a rolling update, the node being booted will be stuck
in:
```
2022-07-26 10:49:58.891896+00:00 [debug] <0.226.0> == Plugins (prelaunch phase) ==
2022-07-26 10:49:58.891908+00:00 [debug] <0.226.0> Setting plugins up
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> Loading the following plugins: [cowlib,cowboy,rabbitmq_web_dispatch,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 rabbitmq_management_agent,amqp_client,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 rabbitmq_management,quantile_estimator,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 prometheus,rabbitmq_peer_discovery_common,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 accept,rabbitmq_peer_discovery_k8s,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 rabbitmq_prometheus]
2022-07-26 10:49:58.926373+00:00 [debug] <0.226.0> Feature flags: REFRESHING after applications load...
2022-07-26 10:49:58.926416+00:00 [debug] <0.372.0> Feature flags: registering controller globally before proceeding with task: refresh_after_app_load
2022-07-26 10:49:58.926450+00:00 [debug] <0.372.0> Feature flags: [global sync] @ [email protected]
```

During cluster creation, an example log of global:sync/0 being stuck can
be found in bullet point 2 of
#5331 (review)

When global:sync/0 is stuck, it never receives a message in line
https://github.com/erlang/otp/blob/bd05b07f973f11d73c4fc77d59b69f212f121c2d/lib/kernel/src/global.erl#L2942

This issue can be observed in both `kind` and GKE.

`kind` uses CoreDNS, GKE uses kubedns.

CoreDNS does not resolve the hostname of RabbitMQ and its peers
correctly for up to 30 seconds after node startup.
This is because the default cache value of CoreDNS is 30 seconds and
CoreDNS has a bug described in
kubernetes/kubernetes#92559

global:sync/0 is known to be buggy "in the presence of network failures"
unless the kernel parameter `prevent_overlapping_partitions` is set to
`true`.

When either:
1. setting CoreDNS cache value to 1 second (see
#5322 (comment)
on how to set this value), or
2. setting the kernel parameter `prevent_overlapping_partitions` to `true`

rolling updates do NOT get stuck anymore.

This means we are hitting here a combination of:
1. Kubernetes DNS bug not updating DNS caches promptly for headless
   services with `publishNotReadyAddresses: true`, and
2. Erlang bug which causes global:sync/0 to hang forever in the presence
   of network failures.

The Erlang bug is fixed by setting `prevent_overlapping_partitions` to `true` (default in Erlang/OTP 25).

In RabbitMQ however, we explicitly set `prevent_overlapping_partitions`
to `false` because we fear other issues could arise if we set this parameter to `true`.

Luckily, to resolve this issue of global:sync/0 being stuck, we can just
call function rabbit_node_monitor:global_sync/0 which provides a
workaround. This function was introduced 8 years ago in
9fcb31f

With this commit applied, rolling updates are not stuck anymore and we
see in the debug log the workaround sometimes being applied.

(cherry picked from commit 4bf78d8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sporadic error during startup: no_epmd_port
3 participants