Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling updates hang on K8s with rabbitmq:3.11.0 #6034

Closed
ansd opened this issue Oct 6, 2022 · 0 comments · Fixed by #6046
Closed

Rolling updates hang on K8s with rabbitmq:3.11.0 #6034

ansd opened this issue Oct 6, 2022 · 0 comments · Fixed by #6046
Labels
Milestone

Comments

@ansd
Copy link
Member

ansd commented Oct 6, 2022

The RabbitMQ “global hang workaround” as described in

%%----------------------------------------------------------------------------
%% "global" hang workaround.
%%----------------------------------------------------------------------------
%% This code works around a possible inconsistency in the "global"
%% state, causing global:sync/0 to never return.
%%
%% 1. A process is spawned.
%% 2. If after 10", global:sync() didn't return, the "global"
%% state is parsed.
%% 3. If it detects that a sync is blocked for more than 10",
%% the process sends fake nodedown/nodeup events to the two
%% nodes involved (one local, one remote).
%% 4. Both "global" instances restart their synchronisation.
%% 5. global:sync() finally returns.

breaks from OTP 25.1 onwards because the messages being sent in
{global_name_server, ThisNode} ! {nodedown, PeerNode},
{global_name_server, PeerNode} ! {nodedown, ThisNode},
{global_name_server, ThisNode} ! {nodeup, PeerNode},
{global_name_server, PeerNode} ! {nodeup, ThisNode},
are not handled anymore in global due to the following commit: erlang/otp@9274e89

This means rolling updates in RabbitMQ on Kubernetes with image rabbitmq:3.11.0 containing Erlang 25.1.1 will get (sometimes) stuck.

The reasoning of using "global hang workaround" is further described in #5438.
In short, a combination of the following issues leads us to relying on that workaround:

  1. the new feature flags controller v2 introduced in 3.11 calls global:sync/0 early on boot
  2. following Kubernetes bug: Headless Service not publishing not ready addresses for statefulset kubernetes/kubernetes#92559
  3. Erlang global bug (got fixed in Erlang by setting parameter prevent_overlapping_partitions to true)
  4. RabbitMQ not being able to set prevent_overlapping_partitions to true as described in Revert "Set kernel param prevent_overlapping_partitions to true" #5483

To reproduce, trigger a few times a rolling update on kindest/node:v1.25.2 using a 5 node RabbitMQ cluster with image rabbitmq:3.11.0-management.

The logs show the following before the node gets stuck:

2022-10-06 07:13:05.032492+00:00 [debug] <0.232.0> Register `rabbit` process (<0.232.0>) for rabbit_node_monitor

  ##  ##      RabbitMQ 3.11.0
  ##  ##
  ##########  Copyright (c) 2007-2022 VMware, Inc. or its affiliates.
  ######  ##
  ##########  Licensed under the MPL 2.0. Website: https://rabbitmq.com

  Erlang:      25.1.1 [jit]
  TLS Library: OpenSSL - OpenSSL 1.1.1q  5 Jul 2022
  Release series support status: supported

  Doc guides:  https://rabbitmq.com/documentation.html
  Support:     https://rabbitmq.com/contact.html
  Tutorials:   https://rabbitmq.com/getstarted.html
  Monitoring:  https://rabbitmq.com/monitoring.html

  Logs: /var/log/rabbitmq/[email protected]_upgrade.log
        <stdout>

  Config file(s): /etc/rabbitmq/conf.d/10-defaults.conf
                  /etc/rabbitmq/conf.d/10-operatorDefaults.conf
                  /etc/rabbitmq/conf.d/11-default_user.conf
                  /etc/rabbitmq/conf.d/90-userDefinedConfiguration.conf

  Starting broker...2022-10-06 07:13:05.033106+00:00 [info] <0.232.0>
2022-10-06 07:13:05.033106+00:00 [info] <0.232.0>  node           : [email protected]
2022-10-06 07:13:05.033106+00:00 [info] <0.232.0>  home dir       : /var/lib/rabbitmq
2022-10-06 07:13:05.033106+00:00 [info] <0.232.0>  config file(s) : /etc/rabbitmq/conf.d/10-defaults.conf
2022-10-06 07:13:05.033106+00:00 [info] <0.232.0>                 : /etc/rabbitmq/conf.d/10-operatorDefaults.conf
2022-10-06 07:13:05.033106+00:00 [info] <0.232.0>                 : /etc/rabbitmq/conf.d/11-default_user.conf
2022-10-06 07:13:05.033106+00:00 [info] <0.232.0>                 : /etc/rabbitmq/conf.d/90-userDefinedConfiguration.conf
2022-10-06 07:13:05.033106+00:00 [info] <0.232.0>  cookie hash    : idxtUKhQEU9Dav23AqS/0g==
2022-10-06 07:13:05.033106+00:00 [info] <0.232.0>  log(s)         : /var/log/rabbitmq/[email protected]_upgrade.log
2022-10-06 07:13:05.033106+00:00 [info] <0.232.0>                 : <stdout>
2022-10-06 07:13:05.033106+00:00 [info] <0.232.0>  database dir   : /var/lib/rabbitmq/mnesia/[email protected]
2022-10-06 07:13:05.033162+00:00 [debug] <0.232.0>
2022-10-06 07:13:05.033173+00:00 [debug] <0.232.0> == Plugins (prelaunch phase) ==
2022-10-06 07:13:05.033184+00:00 [debug] <0.232.0> Setting plugins up
2022-10-06 07:13:05.047274+00:00 [debug] <0.232.0> Plugins discovery: ignoring getopt, not a RabbitMQ plugin
2022-10-06 07:13:05.047316+00:00 [debug] <0.232.0> Plugins discovery: ignoring quantile_estimator, not a RabbitMQ plugin
2022-10-06 07:13:05.062932+00:00 [debug] <0.232.0> Loading the following plugins: [rabbitmq_management_agent,cowlib,cowboy,
2022-10-06 07:13:05.062932+00:00 [debug] <0.232.0>                                 rabbitmq_web_dispatch,amqp_client,
2022-10-06 07:13:05.062932+00:00 [debug] <0.232.0>                                 rabbitmq_management,prometheus,
2022-10-06 07:13:05.062932+00:00 [debug] <0.232.0>                                 rabbitmq_peer_discovery_common,accept,
2022-10-06 07:13:05.062932+00:00 [debug] <0.232.0>                                 rabbitmq_peer_discovery_k8s,
2022-10-06 07:13:05.062932+00:00 [debug] <0.232.0>                                 rabbitmq_prometheus]
2022-10-06 07:13:05.068152+00:00 [debug] <0.232.0> Feature flags: REFRESHING after applications load...
2022-10-06 07:13:05.068197+00:00 [debug] <0.377.0> Feature flags: registering controller globally before proceeding with task: refresh_after_app_load
2022-10-06 07:13:05.068231+00:00 [debug] <0.377.0> Feature flags: [global sync] @ [email protected]
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> Global hang workaround: global state on [email protected] seems inconsistent
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>  * Peer global state:  {status,<25933.55.0>,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                         {module,gen_server},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                         [[{'$ancestors',[kernel_sup,<25933.47.0>]},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                           {rand_seed,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                            {#{bits => 58,jump => #Fun<rand.3.34006561>,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               next => #Fun<rand.0.34006561>,type => exsss,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               uniform => #Fun<rand.1.34006561>,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               uniform_n => #Fun<rand.2.34006561>},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                             [122137426785371025|262422651984364825]}},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                           {creation_extension,269151976804057088},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                           {'$initial_call',{global,init,1}}],
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                          running,<25933.49.0>,[],
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                          [{header,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                            "Status for generic server global_name_server"},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                           {data,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                            [{"Status",running},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                             {"Parent",<25933.49.0>},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                             {"Logged events",[]}]},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                           {data,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                            [{"State",
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                              {state,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               {conf,true,false},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               #{'[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 '[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 '[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                  '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                  12910273,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                  '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                  12823848,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                  '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                  13047845,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                  '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                  12992514},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               ['[email protected]',
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                '[email protected]',
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                '[email protected]',
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                '[email protected]'],
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               [],[],'[email protected]',
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               <25933.56.0>,<25933.57.0>,no_trace,false}}]}]]}
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>  * Local global state: {status,<0.55.0>,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                         {module,gen_server},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                         [[{{pre_connect,'[email protected]'},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                            {8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                             {locker,no_longer_a_pid,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                              #{'[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                '[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                '[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 12910273,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 12823848,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 13047845,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 12992514},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                              <25933.56.0>},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                             -576460752303423477}},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                           {{prot_vsn,'[email protected]'},8},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                           {rand_seed,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                            {#{bits => 58,jump => #Fun<rand.3.34006561>,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               next => #Fun<rand.0.34006561>,type => exsss,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               uniform => #Fun<rand.1.34006561>,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               uniform_n => #Fun<rand.2.34006561>},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                             [2150540948049099|39771336696590305]}},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                           {{prot_vsn,'[email protected]'},8},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                           {{sync_tag_his,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                             '[email protected]'},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                            -576460752303423477},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                           {creation_extension,20109732664573952},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                           {'$initial_call',{global,init,1}},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                           {'$ancestors',[kernel_sup,<0.47.0>]},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                           {{sync_tag_his,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                             '[email protected]'},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                            -576460752303423469},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                           {{pre_connect,'[email protected]'},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                            {8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                             {locker,no_longer_a_pid,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                              #{'[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                '[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                '[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 6631665,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 9346854,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 9344269,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 6616455},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                              <25934.56.0>},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                             -576460752303423469}}],
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                          running,<0.49.0>,[],
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                          [{header,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                            "Status for generic server global_name_server"},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                           {data,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                            [{"Status",running},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                             {"Parent",<0.49.0>},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                             {"Logged events",[]}]},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                           {data,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                            [{"State",
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                              {state,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               {conf,true,false},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               #{'[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 '[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 '[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 '[email protected]' => 8,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                  '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                  12032275,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                  '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                  1029715,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                  '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                  1027052,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                 {connection_id,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                  '[email protected]'} =>
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                  1021319},
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               ['[email protected]',
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                                '[email protected]'],
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               [],
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               [<0.426.0>],
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               '[email protected]',<0.56.0>,
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0>                               <0.57.0>,no_trace,false}}]}]]}
2022-10-06 07:13:25.070793+00:00 [debug] <0.425.0> Faking nodedown/nodeup between [email protected] and [email protected]
2022-10-06 07:13:25.073375+00:00 [warning] <0.55.0> The global_name_server received an unexpected message:
2022-10-06 07:13:25.073375+00:00 [warning] <0.55.0> handle_info({nodedown,'[email protected]'}, _)
2022-10-06 07:13:25.073375+00:00 [warning] <0.55.0>
2022-10-06 07:13:25.073521+00:00 [warning] <0.55.0> The global_name_server received an unexpected message:
2022-10-06 07:13:25.073521+00:00 [warning] <0.55.0> handle_info({nodeup,'[email protected]'}, _)
2022-10-06 07:13:25.073521+00:00 [warning] <0.55.0>
</details>
@ansd ansd added the bug label Oct 6, 2022
ansd added a commit that referenced this issue Oct 6, 2022
TODOs:
* Continue to support OTP < 25.1
* Is it right use the same connection instance or do we need to create a
  new connection_id?

See #6034
ansd added a commit that referenced this issue Oct 6, 2022
TODOs:
* Continue to support OTP < 25.1
* Is it right use the same connection instance or do we need to create a
  new connection_id?
* Do not use sys:get_state/1

See #6034
ansd added a commit that referenced this issue Oct 7, 2022
ansd added a commit that referenced this issue Oct 7, 2022
Commit
erlang/otp@9274e89
present in OTP >= 25.1 expects a different nodedown / nodeup message
format: The connection ID needs to be included.

Fixes #6034
mergify bot pushed a commit that referenced this issue Oct 10, 2022
Commit
erlang/otp@9274e89
present in OTP >= 25.1 expects a different nodedown / nodeup message
format: The connection ID needs to be included.

Fixes #6034

(cherry picked from commit 47fe6e6)
@michaelklishin michaelklishin added this to the 3.10.9 milestone Oct 10, 2022
mergify bot pushed a commit that referenced this issue Oct 10, 2022
Commit
erlang/otp@9274e89
present in OTP >= 25.1 expects a different nodedown / nodeup message
format: The connection ID needs to be included.

Fixes #6034

(cherry picked from commit 47fe6e6)
(cherry picked from commit bfcdeec)

# Conflicts:
#	deps/rabbit/src/rabbit_node_monitor.erl
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants