Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic error during startup: no_epmd_port #5322

Closed
ansd opened this issue Jul 25, 2022 · 3 comments · Fixed by #5331
Closed

Sporadic error during startup: no_epmd_port #5322

ansd opened this issue Jul 25, 2022 · 3 comments · Fixed by #5331
Milestone

Comments

@ansd
Copy link
Member

ansd commented Jul 25, 2022

Observed in Kubernetes. To reproduce:

kind create cluster --image kindest/node:v1.21.12@sha256:f316b33dd88f8196379f38feb80545ef3ed44d9197dca1bfd48bcb1583210207
kubectl rabbitmq install-cluster-operator
kubectl rabbitmq create my-rabbit --image rabbitmq:3.10.6-management --replicas 5
kubectl get pods -lapp.kubernetes.io/component=rabbitmq,app.kubernetes.io/name=my-rabbit --watch

# Wait until all pods are READY.
# If you don't see a RESTART, delete that Rabbit cluster.
kubectl rabbitmq delete my-rabbit

# Wait a few seconds until all pods are deleted.
# In case, the pods are stuck in STATUS 'Terminating':
kubectl delete pods -lapp.kubernetes.io/name=my-rabbit --force=true --grace-period=0

# For me a RESTART occurs after the 2nd cluster creation already.
# In below output my-rabbit-server-0 was restarted (but it can be any other pod)
kubectl rabbitmq create my-rabbit --image rabbitmq:3.10.6-management --replicas 5
kubectl get pods -lapp.kubernetes.io/component=rabbitmq,app.kubernetes.io/name=my-rabbit --watch

NAME                 READY   STATUS    RESTARTS   AGE
my-rabbit-server-0   0/1     Running   0          35s
my-rabbit-server-1   1/1     Running   0          35s
my-rabbit-server-2   0/1     Running   0          35s
my-rabbit-server-3   1/1     Running   0          35s
my-rabbit-server-4   1/1     Running   0          35s
my-rabbit-server-2   1/1     Running   0          36s
my-rabbit-server-0   0/1     Completed   0          42s
my-rabbit-server-0   0/1     Running     1          43s
my-rabbit-server-0   1/1     Running     1          54s

# To see the logs of the failed container:
kubectl logs my-rabbit-server-0 -c rabbitmq --previous

...
BOOT FAILED
===========
Error during startup: {error,no_epmd_port}

2022-07-22 16:28:12.538492+00:00 [error] <0.627.0>
2022-07-22 16:28:12.538492+00:00 [error] <0.627.0> BOOT FAILED
2022-07-22 16:28:12.538492+00:00 [error] <0.627.0> ===========
2022-07-22 16:28:12.538492+00:00 [error] <0.627.0> Error during startup: {error,no_epmd_port}
2022-07-22 16:28:12.538492+00:00 [error] <0.627.0>
@ansd ansd changed the title Intermittent error during startup: no_epmd_port Sporadic error during startup: no_epmd_port Jul 25, 2022
@HoloRin
Copy link
Contributor

HoloRin commented Jul 25, 2022

Related: #2722

@ansd
Copy link
Member Author

ansd commented Jul 25, 2022

Also related: #4233

It's worth mentioning that we've observed all these DNS related issues on only some K8s environments, especially on kind.

@ansd
Copy link
Member Author

ansd commented Jul 26, 2022

Issues

are all symptoms of the following CoreDNS issues:

On the headless service of the RabbitMQ cluster, the rabbitmq/cluster-operator sets publishNotReadyAddresses to true for the purpose of peer discovery since RabbitMQ nodes need to talk to their peer nodes early at Pod startup. However, with CoreDNS default cache value of 30 it takes up to 30 seconds for CoreDNS to resolve correctly.

Therefore we had to add all kind of retries within RabbitMQ code.

In kind (kindest/node:v1.24.0), I can't reproduce #5322 (comment) once I set CoreDNS cache value to 2.

To try it out, you can edit the coredns ConfigMap in namespace kube-system directly. Make sure that reload is present and also add log into the config to see when both CoreDNS Pods reload the config. This takes up to 2 minutes. For example:

apiVersion: v1
data:
  Corefile: |
    .:53 {
        errors
        health {
           lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
           ttl 30
        }
        prometheus :9153
        forward . /etc/resolv.conf {
           max_concurrent 1000
        }
        cache 2
        log
        loop
        reload
        loadbalance
    }
kind: ConfigMap
metadata:
  creationTimestamp: "2022-07-26T17:44:22Z"
  name: coredns
  namespace: kube-system
  resourceVersion: "536"
  uid: 00874286-42cb-4310-a9e0-d758d6c6f8e6

Reducing the cache TTL is recommended in https://github.com/kubernetes/website/pull/22446/files. Prior to switching the CoreDNS default to 30, it was set to 5.

Alternative workaround

Since this is a Kubernetes CoreDNS issue, instead of adding retries into RabbitMQ, and because the rabbitmq/cluster-operator already defines an init container, we could alternatively just wait in that init container until CoreDNS resolves correctly:
The following patch to current rabbitmq/cluster-operator main branch works (when the host tool gets installed):

diff --git a/internal/resource/statefulset.go b/internal/resource/statefulset.go
index 21a90c2..b87c108 100644
--- a/internal/resource/statefulset.go
+++ b/internal/resource/statefulset.go
@@ -722,7 +722,8 @@ func setupContainer(instance *rabbitmqv1beta1.RabbitmqCluster) corev1.Container
                                "cp /tmp/rabbitmq-plugins/enabled_plugins /operator/enabled_plugins ; " +
                                "echo '[default]' > /var/lib/rabbitmq/.rabbitmqadmin.conf " +
                                "&& sed -e 's/default_user/username/' -e 's/default_pass/password/' %s >> /var/lib/rabbitmq/.rabbitmqadmin.conf " +
-                               "&& chmod 600 /var/lib/rabbitmq/.rabbitmqadmin.conf",
+                               "&& chmod 600 /var/lib/rabbitmq/.rabbitmqadmin.conf ; " +
+                               "until host \"$HOSTNAME_DOMAIN\"; do echo waiting for 'host \"$HOSTNAME_DOMAIN\"' to succeed; sleep 5; done",
                },
                Resources: corev1.ResourceRequirements{
                        Limits: corev1.ResourceList{
@@ -756,6 +757,11 @@ func setupContainer(instance *rabbitmqv1beta1.RabbitmqCluster) corev1.Container
                                MountPath: "/var/lib/rabbitmq/mnesia/",
                        },
                },
+               Env: append(envVarsK8sObjects(instance),
+                       corev1.EnvVar{
+                               Name:  "HOSTNAME_DOMAIN",
+                               Value: "$(MY_POD_NAME).$(K8S_SERVICE_NAME).$(MY_POD_NAMESPACE)",
+                       }),
        }

        if instance.VaultDefaultUserSecretEnabled() {

However, this works only when the CoreDNS deployment is scaled down from 2 replicas to 1 replica. (I suppose this is because every CoreDNS pod has its own cache.)

ansd added a commit that referenced this issue Aug 4, 2022
Prior to this commit, global:sync/0 gets sometimes stuck when either
performing a rolling update on Kubernetes or when creating a new
RabbitMQ cluster on Kubernetes.

When performing a rolling update, the node being booted will be stuck
in:
```
2022-07-26 10:49:58.891896+00:00 [debug] <0.226.0> == Plugins (prelaunch phase) ==
2022-07-26 10:49:58.891908+00:00 [debug] <0.226.0> Setting plugins up
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> Loading the following plugins: [cowlib,cowboy,rabbitmq_web_dispatch,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 rabbitmq_management_agent,amqp_client,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 rabbitmq_management,quantile_estimator,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 prometheus,rabbitmq_peer_discovery_common,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 accept,rabbitmq_peer_discovery_k8s,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 rabbitmq_prometheus]
2022-07-26 10:49:58.926373+00:00 [debug] <0.226.0> Feature flags: REFRESHING after applications load...
2022-07-26 10:49:58.926416+00:00 [debug] <0.372.0> Feature flags: registering controller globally before proceeding with task: refresh_after_app_load
2022-07-26 10:49:58.926450+00:00 [debug] <0.372.0> Feature flags: [global sync] @ [email protected]
```

During cluster creation, an example log of global:sync/0 being stuck can
be found in bullet point 2 of
#5331 (review)

When global:sync/0 is stuck, it never receives a message in line
https://github.com/erlang/otp/blob/bd05b07f973f11d73c4fc77d59b69f212f121c2d/lib/kernel/src/global.erl#L2942

This issue can be observed in both `kind` and GKE.

`kind` uses CoreDNS, GKE uses kubedns.

CoreDNS does not resolve the hostname of RabbitMQ and its peers
correctly for up to 30 seconds after node startup.
This is because the default cache value of CoreDNS is 30 seconds and
CoreDNS has a bug described in
kubernetes/kubernetes#92559

global:sync/0 is known to be buggy "in the presence of network failures"
unless the kernel parameter `prevent_overlapping_partitions` is set.

When either:
1. setting CoreDNS cache value to 1 second (see
#5322 (comment)
on how to set this value), or
2. setting the kernel parameter `prevent_overlapping_partitions` to `true`

rolling updates do NOT get stuck anymore.

This means we are hitting here a combination of:
1. Kubernetes DNS bug not updating DNS caches promptly for headless
   services with `publishNotReadyAddresses: true`, and
2. Erlang bug which causes global:sync/0 to hang forever in the presence
   of network failures.

The Erlang bug is fixed by setting `prevent_overlapping_partitions` to `true` (default in Erlang/OTP 25).

In RabbitMQ however, we explicitly set `prevent_overlapping_partitions`
to `false` because other issues may arise if we set this paramter to
`true`.

Luckily, to resolve this issue of global:sync/0 being stuck, we can just
call function rabbit_node_monitor:global_sync/0 which provides a
workaround.

With this commit applied, rolling updates are not stuck anymore and we
see in the debug log the workaround sometimes being applied.
ansd added a commit that referenced this issue Aug 4, 2022
Prior to this commit, global:sync/0 gets sometimes stuck when either
performing a rolling update on Kubernetes or when creating a new
RabbitMQ cluster on Kubernetes.

When performing a rolling update, the node being booted will be stuck
in:
```
2022-07-26 10:49:58.891896+00:00 [debug] <0.226.0> == Plugins (prelaunch phase) ==
2022-07-26 10:49:58.891908+00:00 [debug] <0.226.0> Setting plugins up
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> Loading the following plugins: [cowlib,cowboy,rabbitmq_web_dispatch,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 rabbitmq_management_agent,amqp_client,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 rabbitmq_management,quantile_estimator,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 prometheus,rabbitmq_peer_discovery_common,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 accept,rabbitmq_peer_discovery_k8s,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 rabbitmq_prometheus]
2022-07-26 10:49:58.926373+00:00 [debug] <0.226.0> Feature flags: REFRESHING after applications load...
2022-07-26 10:49:58.926416+00:00 [debug] <0.372.0> Feature flags: registering controller globally before proceeding with task: refresh_after_app_load
2022-07-26 10:49:58.926450+00:00 [debug] <0.372.0> Feature flags: [global sync] @ [email protected]
```

During cluster creation, an example log of global:sync/0 being stuck can
be found in bullet point 2 of
#5331 (review)

When global:sync/0 is stuck, it never receives a message in line
https://github.com/erlang/otp/blob/bd05b07f973f11d73c4fc77d59b69f212f121c2d/lib/kernel/src/global.erl#L2942

This issue can be observed in both `kind` and GKE.

`kind` uses CoreDNS, GKE uses kubedns.

CoreDNS does not resolve the hostname of RabbitMQ and its peers
correctly for up to 30 seconds after node startup.
This is because the default cache value of CoreDNS is 30 seconds and
CoreDNS has a bug described in
kubernetes/kubernetes#92559

global:sync/0 is known to be buggy "in the presence of network failures"
unless the kernel parameter `prevent_overlapping_partitions` is set to
`true`.

When either:
1. setting CoreDNS cache value to 1 second (see
#5322 (comment)
on how to set this value), or
2. setting the kernel parameter `prevent_overlapping_partitions` to `true`

rolling updates do NOT get stuck anymore.

This means we are hitting here a combination of:
1. Kubernetes DNS bug not updating DNS caches promptly for headless
   services with `publishNotReadyAddresses: true`, and
2. Erlang bug which causes global:sync/0 to hang forever in the presence
   of network failures.

The Erlang bug is fixed by setting `prevent_overlapping_partitions` to `true` (default in Erlang/OTP 25).

In RabbitMQ however, we explicitly set `prevent_overlapping_partitions`
to `false` because we fear other issues could arise if we set this parameter to `true`.

Luckily, to resolve this issue of global:sync/0 being stuck, we can just
call function rabbit_node_monitor:global_sync/0 which provides a
workaround. This function was introduced 8 years ago in
9fcb31f

With this commit applied, rolling updates are not stuck anymore and we
see in the debug log the workaround sometimes being applied.
mergify bot pushed a commit that referenced this issue Aug 4, 2022
Prior to this commit, global:sync/0 gets sometimes stuck when either
performing a rolling update on Kubernetes or when creating a new
RabbitMQ cluster on Kubernetes.

When performing a rolling update, the node being booted will be stuck
in:
```
2022-07-26 10:49:58.891896+00:00 [debug] <0.226.0> == Plugins (prelaunch phase) ==
2022-07-26 10:49:58.891908+00:00 [debug] <0.226.0> Setting plugins up
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> Loading the following plugins: [cowlib,cowboy,rabbitmq_web_dispatch,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 rabbitmq_management_agent,amqp_client,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 rabbitmq_management,quantile_estimator,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 prometheus,rabbitmq_peer_discovery_common,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 accept,rabbitmq_peer_discovery_k8s,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 rabbitmq_prometheus]
2022-07-26 10:49:58.926373+00:00 [debug] <0.226.0> Feature flags: REFRESHING after applications load...
2022-07-26 10:49:58.926416+00:00 [debug] <0.372.0> Feature flags: registering controller globally before proceeding with task: refresh_after_app_load
2022-07-26 10:49:58.926450+00:00 [debug] <0.372.0> Feature flags: [global sync] @ [email protected]
```

During cluster creation, an example log of global:sync/0 being stuck can
be found in bullet point 2 of
#5331 (review)

When global:sync/0 is stuck, it never receives a message in line
https://github.com/erlang/otp/blob/bd05b07f973f11d73c4fc77d59b69f212f121c2d/lib/kernel/src/global.erl#L2942

This issue can be observed in both `kind` and GKE.

`kind` uses CoreDNS, GKE uses kubedns.

CoreDNS does not resolve the hostname of RabbitMQ and its peers
correctly for up to 30 seconds after node startup.
This is because the default cache value of CoreDNS is 30 seconds and
CoreDNS has a bug described in
kubernetes/kubernetes#92559

global:sync/0 is known to be buggy "in the presence of network failures"
unless the kernel parameter `prevent_overlapping_partitions` is set to
`true`.

When either:
1. setting CoreDNS cache value to 1 second (see
#5322 (comment)
on how to set this value), or
2. setting the kernel parameter `prevent_overlapping_partitions` to `true`

rolling updates do NOT get stuck anymore.

This means we are hitting here a combination of:
1. Kubernetes DNS bug not updating DNS caches promptly for headless
   services with `publishNotReadyAddresses: true`, and
2. Erlang bug which causes global:sync/0 to hang forever in the presence
   of network failures.

The Erlang bug is fixed by setting `prevent_overlapping_partitions` to `true` (default in Erlang/OTP 25).

In RabbitMQ however, we explicitly set `prevent_overlapping_partitions`
to `false` because we fear other issues could arise if we set this parameter to `true`.

Luckily, to resolve this issue of global:sync/0 being stuck, we can just
call function rabbit_node_monitor:global_sync/0 which provides a
workaround. This function was introduced 8 years ago in
9fcb31f

With this commit applied, rolling updates are not stuck anymore and we
see in the debug log the workaround sometimes being applied.

(cherry picked from commit 4bf78d8)
ansd added a commit that referenced this issue Aug 4, 2022
Prior to this commit, global:sync/0 gets sometimes stuck when either
performing a rolling update on Kubernetes or when creating a new
RabbitMQ cluster on Kubernetes.

When performing a rolling update, the node being booted will be stuck
in:
```
2022-07-26 10:49:58.891896+00:00 [debug] <0.226.0> == Plugins (prelaunch phase) ==
2022-07-26 10:49:58.891908+00:00 [debug] <0.226.0> Setting plugins up
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> Loading the following plugins: [cowlib,cowboy,rabbitmq_web_dispatch,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 rabbitmq_management_agent,amqp_client,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 rabbitmq_management,quantile_estimator,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 prometheus,rabbitmq_peer_discovery_common,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 accept,rabbitmq_peer_discovery_k8s,
2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0>                                 rabbitmq_prometheus]
2022-07-26 10:49:58.926373+00:00 [debug] <0.226.0> Feature flags: REFRESHING after applications load...
2022-07-26 10:49:58.926416+00:00 [debug] <0.372.0> Feature flags: registering controller globally before proceeding with task: refresh_after_app_load
2022-07-26 10:49:58.926450+00:00 [debug] <0.372.0> Feature flags: [global sync] @ [email protected]
```

During cluster creation, an example log of global:sync/0 being stuck can
be found in bullet point 2 of
#5331 (review)

When global:sync/0 is stuck, it never receives a message in line
https://github.com/erlang/otp/blob/bd05b07f973f11d73c4fc77d59b69f212f121c2d/lib/kernel/src/global.erl#L2942

This issue can be observed in both `kind` and GKE.

`kind` uses CoreDNS, GKE uses kubedns.

CoreDNS does not resolve the hostname of RabbitMQ and its peers
correctly for up to 30 seconds after node startup.
This is because the default cache value of CoreDNS is 30 seconds and
CoreDNS has a bug described in
kubernetes/kubernetes#92559

global:sync/0 is known to be buggy "in the presence of network failures"
unless the kernel parameter `prevent_overlapping_partitions` is set to
`true`.

When either:
1. setting CoreDNS cache value to 1 second (see
#5322 (comment)
on how to set this value), or
2. setting the kernel parameter `prevent_overlapping_partitions` to `true`

rolling updates do NOT get stuck anymore.

This means we are hitting here a combination of:
1. Kubernetes DNS bug not updating DNS caches promptly for headless
   services with `publishNotReadyAddresses: true`, and
2. Erlang bug which causes global:sync/0 to hang forever in the presence
   of network failures.

The Erlang bug is fixed by setting `prevent_overlapping_partitions` to `true` (default in Erlang/OTP 25).

In RabbitMQ however, we explicitly set `prevent_overlapping_partitions`
to `false` because we fear other issues could arise if we set this parameter to `true`.

Luckily, to resolve this issue of global:sync/0 being stuck, we can just
call function rabbit_node_monitor:global_sync/0 which provides a
workaround. This function was introduced 8 years ago in
9fcb31f

With this commit applied, rolling updates are not stuck anymore and we
see in the debug log the workaround sometimes being applied.

(cherry picked from commit 4bf78d8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants