-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sporadic error during startup: no_epmd_port #5322
Comments
Related: #2722 |
Also related: #4233 It's worth mentioning that we've observed all these DNS related issues on only some K8s environments, especially on |
Issues
are all symptoms of the following CoreDNS issues:
On the headless service of the RabbitMQ cluster, the rabbitmq/cluster-operator sets Therefore we had to add all kind of retries within RabbitMQ code. In To try it out, you can edit the
Reducing the Alternative workaroundSince this is a Kubernetes CoreDNS issue, instead of adding retries into RabbitMQ, and because the rabbitmq/cluster-operator already defines an init container, we could alternatively just wait in that init container until CoreDNS resolves correctly: diff --git a/internal/resource/statefulset.go b/internal/resource/statefulset.go
index 21a90c2..b87c108 100644
--- a/internal/resource/statefulset.go
+++ b/internal/resource/statefulset.go
@@ -722,7 +722,8 @@ func setupContainer(instance *rabbitmqv1beta1.RabbitmqCluster) corev1.Container
"cp /tmp/rabbitmq-plugins/enabled_plugins /operator/enabled_plugins ; " +
"echo '[default]' > /var/lib/rabbitmq/.rabbitmqadmin.conf " +
"&& sed -e 's/default_user/username/' -e 's/default_pass/password/' %s >> /var/lib/rabbitmq/.rabbitmqadmin.conf " +
- "&& chmod 600 /var/lib/rabbitmq/.rabbitmqadmin.conf",
+ "&& chmod 600 /var/lib/rabbitmq/.rabbitmqadmin.conf ; " +
+ "until host \"$HOSTNAME_DOMAIN\"; do echo waiting for 'host \"$HOSTNAME_DOMAIN\"' to succeed; sleep 5; done",
},
Resources: corev1.ResourceRequirements{
Limits: corev1.ResourceList{
@@ -756,6 +757,11 @@ func setupContainer(instance *rabbitmqv1beta1.RabbitmqCluster) corev1.Container
MountPath: "/var/lib/rabbitmq/mnesia/",
},
},
+ Env: append(envVarsK8sObjects(instance),
+ corev1.EnvVar{
+ Name: "HOSTNAME_DOMAIN",
+ Value: "$(MY_POD_NAME).$(K8S_SERVICE_NAME).$(MY_POD_NAMESPACE)",
+ }),
}
if instance.VaultDefaultUserSecretEnabled() { However, this works only when the CoreDNS deployment is scaled down from 2 replicas to 1 replica. (I suppose this is because every CoreDNS pod has its own cache.) |
Prior to this commit, global:sync/0 gets sometimes stuck when either performing a rolling update on Kubernetes or when creating a new RabbitMQ cluster on Kubernetes. When performing a rolling update, the node being booted will be stuck in: ``` 2022-07-26 10:49:58.891896+00:00 [debug] <0.226.0> == Plugins (prelaunch phase) == 2022-07-26 10:49:58.891908+00:00 [debug] <0.226.0> Setting plugins up 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> Loading the following plugins: [cowlib,cowboy,rabbitmq_web_dispatch, 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> rabbitmq_management_agent,amqp_client, 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> rabbitmq_management,quantile_estimator, 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> prometheus,rabbitmq_peer_discovery_common, 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> accept,rabbitmq_peer_discovery_k8s, 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> rabbitmq_prometheus] 2022-07-26 10:49:58.926373+00:00 [debug] <0.226.0> Feature flags: REFRESHING after applications load... 2022-07-26 10:49:58.926416+00:00 [debug] <0.372.0> Feature flags: registering controller globally before proceeding with task: refresh_after_app_load 2022-07-26 10:49:58.926450+00:00 [debug] <0.372.0> Feature flags: [global sync] @ [email protected] ``` During cluster creation, an example log of global:sync/0 being stuck can be found in bullet point 2 of #5331 (review) When global:sync/0 is stuck, it never receives a message in line https://github.com/erlang/otp/blob/bd05b07f973f11d73c4fc77d59b69f212f121c2d/lib/kernel/src/global.erl#L2942 This issue can be observed in both `kind` and GKE. `kind` uses CoreDNS, GKE uses kubedns. CoreDNS does not resolve the hostname of RabbitMQ and its peers correctly for up to 30 seconds after node startup. This is because the default cache value of CoreDNS is 30 seconds and CoreDNS has a bug described in kubernetes/kubernetes#92559 global:sync/0 is known to be buggy "in the presence of network failures" unless the kernel parameter `prevent_overlapping_partitions` is set. When either: 1. setting CoreDNS cache value to 1 second (see #5322 (comment) on how to set this value), or 2. setting the kernel parameter `prevent_overlapping_partitions` to `true` rolling updates do NOT get stuck anymore. This means we are hitting here a combination of: 1. Kubernetes DNS bug not updating DNS caches promptly for headless services with `publishNotReadyAddresses: true`, and 2. Erlang bug which causes global:sync/0 to hang forever in the presence of network failures. The Erlang bug is fixed by setting `prevent_overlapping_partitions` to `true` (default in Erlang/OTP 25). In RabbitMQ however, we explicitly set `prevent_overlapping_partitions` to `false` because other issues may arise if we set this paramter to `true`. Luckily, to resolve this issue of global:sync/0 being stuck, we can just call function rabbit_node_monitor:global_sync/0 which provides a workaround. With this commit applied, rolling updates are not stuck anymore and we see in the debug log the workaround sometimes being applied.
Prior to this commit, global:sync/0 gets sometimes stuck when either performing a rolling update on Kubernetes or when creating a new RabbitMQ cluster on Kubernetes. When performing a rolling update, the node being booted will be stuck in: ``` 2022-07-26 10:49:58.891896+00:00 [debug] <0.226.0> == Plugins (prelaunch phase) == 2022-07-26 10:49:58.891908+00:00 [debug] <0.226.0> Setting plugins up 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> Loading the following plugins: [cowlib,cowboy,rabbitmq_web_dispatch, 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> rabbitmq_management_agent,amqp_client, 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> rabbitmq_management,quantile_estimator, 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> prometheus,rabbitmq_peer_discovery_common, 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> accept,rabbitmq_peer_discovery_k8s, 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> rabbitmq_prometheus] 2022-07-26 10:49:58.926373+00:00 [debug] <0.226.0> Feature flags: REFRESHING after applications load... 2022-07-26 10:49:58.926416+00:00 [debug] <0.372.0> Feature flags: registering controller globally before proceeding with task: refresh_after_app_load 2022-07-26 10:49:58.926450+00:00 [debug] <0.372.0> Feature flags: [global sync] @ [email protected] ``` During cluster creation, an example log of global:sync/0 being stuck can be found in bullet point 2 of #5331 (review) When global:sync/0 is stuck, it never receives a message in line https://github.com/erlang/otp/blob/bd05b07f973f11d73c4fc77d59b69f212f121c2d/lib/kernel/src/global.erl#L2942 This issue can be observed in both `kind` and GKE. `kind` uses CoreDNS, GKE uses kubedns. CoreDNS does not resolve the hostname of RabbitMQ and its peers correctly for up to 30 seconds after node startup. This is because the default cache value of CoreDNS is 30 seconds and CoreDNS has a bug described in kubernetes/kubernetes#92559 global:sync/0 is known to be buggy "in the presence of network failures" unless the kernel parameter `prevent_overlapping_partitions` is set to `true`. When either: 1. setting CoreDNS cache value to 1 second (see #5322 (comment) on how to set this value), or 2. setting the kernel parameter `prevent_overlapping_partitions` to `true` rolling updates do NOT get stuck anymore. This means we are hitting here a combination of: 1. Kubernetes DNS bug not updating DNS caches promptly for headless services with `publishNotReadyAddresses: true`, and 2. Erlang bug which causes global:sync/0 to hang forever in the presence of network failures. The Erlang bug is fixed by setting `prevent_overlapping_partitions` to `true` (default in Erlang/OTP 25). In RabbitMQ however, we explicitly set `prevent_overlapping_partitions` to `false` because we fear other issues could arise if we set this parameter to `true`. Luckily, to resolve this issue of global:sync/0 being stuck, we can just call function rabbit_node_monitor:global_sync/0 which provides a workaround. This function was introduced 8 years ago in 9fcb31f With this commit applied, rolling updates are not stuck anymore and we see in the debug log the workaround sometimes being applied.
Prior to this commit, global:sync/0 gets sometimes stuck when either performing a rolling update on Kubernetes or when creating a new RabbitMQ cluster on Kubernetes. When performing a rolling update, the node being booted will be stuck in: ``` 2022-07-26 10:49:58.891896+00:00 [debug] <0.226.0> == Plugins (prelaunch phase) == 2022-07-26 10:49:58.891908+00:00 [debug] <0.226.0> Setting plugins up 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> Loading the following plugins: [cowlib,cowboy,rabbitmq_web_dispatch, 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> rabbitmq_management_agent,amqp_client, 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> rabbitmq_management,quantile_estimator, 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> prometheus,rabbitmq_peer_discovery_common, 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> accept,rabbitmq_peer_discovery_k8s, 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> rabbitmq_prometheus] 2022-07-26 10:49:58.926373+00:00 [debug] <0.226.0> Feature flags: REFRESHING after applications load... 2022-07-26 10:49:58.926416+00:00 [debug] <0.372.0> Feature flags: registering controller globally before proceeding with task: refresh_after_app_load 2022-07-26 10:49:58.926450+00:00 [debug] <0.372.0> Feature flags: [global sync] @ [email protected] ``` During cluster creation, an example log of global:sync/0 being stuck can be found in bullet point 2 of #5331 (review) When global:sync/0 is stuck, it never receives a message in line https://github.com/erlang/otp/blob/bd05b07f973f11d73c4fc77d59b69f212f121c2d/lib/kernel/src/global.erl#L2942 This issue can be observed in both `kind` and GKE. `kind` uses CoreDNS, GKE uses kubedns. CoreDNS does not resolve the hostname of RabbitMQ and its peers correctly for up to 30 seconds after node startup. This is because the default cache value of CoreDNS is 30 seconds and CoreDNS has a bug described in kubernetes/kubernetes#92559 global:sync/0 is known to be buggy "in the presence of network failures" unless the kernel parameter `prevent_overlapping_partitions` is set to `true`. When either: 1. setting CoreDNS cache value to 1 second (see #5322 (comment) on how to set this value), or 2. setting the kernel parameter `prevent_overlapping_partitions` to `true` rolling updates do NOT get stuck anymore. This means we are hitting here a combination of: 1. Kubernetes DNS bug not updating DNS caches promptly for headless services with `publishNotReadyAddresses: true`, and 2. Erlang bug which causes global:sync/0 to hang forever in the presence of network failures. The Erlang bug is fixed by setting `prevent_overlapping_partitions` to `true` (default in Erlang/OTP 25). In RabbitMQ however, we explicitly set `prevent_overlapping_partitions` to `false` because we fear other issues could arise if we set this parameter to `true`. Luckily, to resolve this issue of global:sync/0 being stuck, we can just call function rabbit_node_monitor:global_sync/0 which provides a workaround. This function was introduced 8 years ago in 9fcb31f With this commit applied, rolling updates are not stuck anymore and we see in the debug log the workaround sometimes being applied. (cherry picked from commit 4bf78d8)
Prior to this commit, global:sync/0 gets sometimes stuck when either performing a rolling update on Kubernetes or when creating a new RabbitMQ cluster on Kubernetes. When performing a rolling update, the node being booted will be stuck in: ``` 2022-07-26 10:49:58.891896+00:00 [debug] <0.226.0> == Plugins (prelaunch phase) == 2022-07-26 10:49:58.891908+00:00 [debug] <0.226.0> Setting plugins up 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> Loading the following plugins: [cowlib,cowboy,rabbitmq_web_dispatch, 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> rabbitmq_management_agent,amqp_client, 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> rabbitmq_management,quantile_estimator, 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> prometheus,rabbitmq_peer_discovery_common, 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> accept,rabbitmq_peer_discovery_k8s, 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> rabbitmq_prometheus] 2022-07-26 10:49:58.926373+00:00 [debug] <0.226.0> Feature flags: REFRESHING after applications load... 2022-07-26 10:49:58.926416+00:00 [debug] <0.372.0> Feature flags: registering controller globally before proceeding with task: refresh_after_app_load 2022-07-26 10:49:58.926450+00:00 [debug] <0.372.0> Feature flags: [global sync] @ [email protected] ``` During cluster creation, an example log of global:sync/0 being stuck can be found in bullet point 2 of #5331 (review) When global:sync/0 is stuck, it never receives a message in line https://github.com/erlang/otp/blob/bd05b07f973f11d73c4fc77d59b69f212f121c2d/lib/kernel/src/global.erl#L2942 This issue can be observed in both `kind` and GKE. `kind` uses CoreDNS, GKE uses kubedns. CoreDNS does not resolve the hostname of RabbitMQ and its peers correctly for up to 30 seconds after node startup. This is because the default cache value of CoreDNS is 30 seconds and CoreDNS has a bug described in kubernetes/kubernetes#92559 global:sync/0 is known to be buggy "in the presence of network failures" unless the kernel parameter `prevent_overlapping_partitions` is set to `true`. When either: 1. setting CoreDNS cache value to 1 second (see #5322 (comment) on how to set this value), or 2. setting the kernel parameter `prevent_overlapping_partitions` to `true` rolling updates do NOT get stuck anymore. This means we are hitting here a combination of: 1. Kubernetes DNS bug not updating DNS caches promptly for headless services with `publishNotReadyAddresses: true`, and 2. Erlang bug which causes global:sync/0 to hang forever in the presence of network failures. The Erlang bug is fixed by setting `prevent_overlapping_partitions` to `true` (default in Erlang/OTP 25). In RabbitMQ however, we explicitly set `prevent_overlapping_partitions` to `false` because we fear other issues could arise if we set this parameter to `true`. Luckily, to resolve this issue of global:sync/0 being stuck, we can just call function rabbit_node_monitor:global_sync/0 which provides a workaround. This function was introduced 8 years ago in 9fcb31f With this commit applied, rolling updates are not stuck anymore and we see in the debug log the workaround sometimes being applied. (cherry picked from commit 4bf78d8)
Observed in Kubernetes. To reproduce:
The text was updated successfully, but these errors were encountered: