Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RabbitMQ Cluster failed with 0/2 nodes are available: 2 pod has unbound immediate PersistentVolumeClaims on Kubernetes #752

Closed
sathishc58 opened this issue Jun 30, 2021 · 8 comments
Labels
closed-stale Issue or PR closed due to long period of inactivity stale Issue or PR with long period of inactivity

Comments

@sathishc58
Copy link

sathishc58 commented Jun 30, 2021

I am trying to install RabbitMQ Cluster Operator and RabbitMQ Cluster on K8S Cluster (bare metal server) by following the steps mentioned below

OS Version: CentOS Linux 7 (Core)
[root@re-ctrl01 tmp]# kubectl version

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:31:21Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:25:06Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}

[root@re-ctrl01 tmp]# kubectl apply -f https://github.com/rabbitmq/cluster-operator/releases/latest/download/cluster-operator.yml

Warning: Detected changes to resource rabbitmqclusters.rabbitmq.com which is currently being deleted.
customresourcedefinition.apiextensions.k8s.io/rabbitmqclusters.rabbitmq.com configured
serviceaccount/rabbitmq-cluster-operator created
role.rbac.authorization.k8s.io/rabbitmq-cluster-leader-election-role created
clusterrole.rbac.authorization.k8s.io/rabbitmq-cluster-operator-role created
rolebinding.rbac.authorization.k8s.io/rabbitmq-cluster-leader-election-rolebinding created
clusterrolebinding.rbac.authorization.k8s.io/rabbitmq-cluster-operator-rolebinding created
deployment.apps/rabbitmq-cluster-operator created

[root@re-ctrl01 tmp]# kubectl describe pod/test-rabbitmq-cluster-server-0 -n rabbitmq-system

Name:           test-rabbitmq-cluster-server-0
Namespace:      rabbitmq-system
Priority:       0
Node:           <none>
Labels:         app.kubernetes.io/component=rabbitmq
                app.kubernetes.io/name=test-rabbitmq-cluster
                app.kubernetes.io/part-of=rabbitmq
                controller-revision-hash=test-rabbitmq-cluster-server-5bff99dbf9
                statefulset.kubernetes.io/pod-name=test-rabbitmq-cluster-server-0
Annotations:    prometheus.io/port: 15692
                prometheus.io/scrape: true
Status:         Pending
IP:
IPs:            <none>
Controlled By:  StatefulSet/test-rabbitmq-cluster-server
Init Containers:
  setup-container:
    Image:      rabbitmq:3.8.16-management
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
      cp /tmp/erlang-cookie-secret/.erlang.cookie /var/lib/rabbitmq/.erlang.cookie && chown 999:999 /var/lib/rabbitmq/.erlang.cookie && chmod 600 /var/lib/rabbitmq/.erlang.cookie ; cp /tmp/rabbitmq-plugins/enabled_plugins /operator/enabled_plugins && chown 999:999 /operator/enabled_plugins ; chown 999:999 /var/lib/rabbitmq/mnesia/ ; echo '[default]' > /var/lib/rabbitmq/.rabbitmqadmin.conf && sed -e 's/default_user/username/' -e 's/default_pass/password/' /tmp/default_user.conf >> /var/lib/rabbitmq/.rabbitmqadmin.conf && chown 999:999 /var/lib/rabbitmq/.rabbitmqadmin.conf && chmod 600 /var/lib/rabbitmq/.rabbitmqadmin.conf
    Limits:
      cpu:     100m
      memory:  500Mi
    Requests:
      cpu:        100m
      memory:     500Mi
    Environment:  <none>
    Mounts:
      /operator from rabbitmq-plugins (rw)
      /tmp/default_user.conf from rabbitmq-confd (rw,path="default_user.conf")
      /tmp/erlang-cookie-secret/ from erlang-cookie-secret (rw)
      /tmp/rabbitmq-plugins/ from plugins-conf (rw)
      /var/lib/rabbitmq/ from rabbitmq-erlang-cookie (rw)
      /var/lib/rabbitmq/mnesia/ from persistence (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fz87g (ro)
Containers:
  rabbitmq:
    Image:       rabbitmq:3.8.16-management
    Ports:       4369/TCP, 5672/TCP, 15672/TCP, 15692/TCP
    Host Ports:  0/TCP, 0/TCP, 0/TCP, 0/TCP
    Limits:
      cpu:     2
      memory:  2Gi
    Requests:
      cpu:      1
      memory:   2Gi
    Readiness:  tcp-socket :amqp delay=10s timeout=5s period=10s #success=1 #failure=3
    Environment:
      MY_POD_NAME:                    test-rabbitmq-cluster-server-0 (v1:metadata.name)
      MY_POD_NAMESPACE:               rabbitmq-system (v1:metadata.namespace)
      RABBITMQ_ENABLED_PLUGINS_FILE:  /operator/enabled_plugins
      K8S_SERVICE_NAME:               test-rabbitmq-cluster-nodes
      RABBITMQ_USE_LONGNAME:          true
      RABBITMQ_NODENAME:              rabbit@$(MY_POD_NAME).$(K8S_SERVICE_NAME).$(MY_POD_NAMESPACE)
      K8S_HOSTNAME_SUFFIX:            .$(K8S_SERVICE_NAME).$(MY_POD_NAMESPACE)
    Mounts:
      /etc/pod-info/ from pod-info (rw)
      /etc/rabbitmq/conf.d/10-operatorDefaults.conf from rabbitmq-confd (rw,path="operatorDefaults.conf")
      /etc/rabbitmq/conf.d/11-default_user.conf from rabbitmq-confd (rw,path="default_user.conf")
      /etc/rabbitmq/conf.d/90-userDefinedConfiguration.conf from rabbitmq-confd (rw,path="userDefinedConfiguration.conf")
      /operator from rabbitmq-plugins (rw)
      /var/lib/rabbitmq/ from rabbitmq-erlang-cookie (rw)
      /var/lib/rabbitmq/mnesia/ from persistence (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fz87g (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  persistence:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  persistence-test-rabbitmq-cluster-server-0
    ReadOnly:   false
  plugins-conf:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      test-rabbitmq-cluster-plugins-conf
    Optional:  false
  rabbitmq-confd:
    Type:                Projected (a volume that contains injected data from multiple sources)
    SecretName:          test-rabbitmq-cluster-default-user
    SecretOptionalName:  <nil>
    ConfigMapName:       test-rabbitmq-cluster-server-conf
    ConfigMapOptional:   <nil>
  rabbitmq-erlang-cookie:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  erlang-cookie-secret:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  test-rabbitmq-cluster-erlang-cookie
    Optional:    false
  rabbitmq-plugins:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  pod-info:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.labels['skipPreStopChecks'] -> skipPreStopChecks
  kube-api-access-fz87g:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  30s (x2 over 32s)  default-scheduler  0/2 nodes are available: 2 pod has unbound immediate PersistentVolumeClaims.

Though POD was created it failed with 0/2 nodes are available: 2 pod has unbound immediate PersistentVolumeClaims.

[root@re-ctrl01 tmp]# kubectl get all -n rabbitmq-system                    
NAME                                             READY   STATUS    RESTARTS   AGE
pod/rabbitmq-cluster-operator-5b9587d6bd-r6fnx   1/1     Running   0          108m
pod/test-rabbitmq-cluster-server-0               0/1     Pending   0          103m

  1. Should we create PVC manually and assign them ?
  2. How to add hostAliases to add IP entry to /etc/hosts of rabbitmq:3.8.16-management docker image through YAML ?
@sathishc58
Copy link
Author

I tried installing RabbitMQ Operator and RabbitMQ Cluster using Docker Desktop's embedded Kubernetes on Windows

I repeated the steps mentioned above ie ​Installed cluster-operator and created RabbitMQ Cluster

Though RabbitMQ Cluster on Windows generated the following Warning pod came up successfully

Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  8m31s  default-scheduler  0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.
  Normal   Scheduled         8m29s  default-scheduler  Successfully assigned rabbitmq-system/testrmc-server-0 to docker-desktop
  Normal   Pulling           8m29s  kubelet            Pulling image "rabbitmq:3.8.18-management"
  Normal   Pulled            2m1s   kubelet            Successfully pulled image "rabbitmq:3.8.18-management" in 6m28.1561387s
  Normal   Created           2m     kubelet            Created container setup-container
  Normal   Started           2m     kubelet            Started container setup-container
  Normal   Pulled            2m     kubelet            Container image "rabbitmq:3.8.18-management" already present on machine
  Normal   Created           2m     kubelet            Created container rabbitmq
  Normal   Started           119s   kubelet            Started container rabbitmq

PS C:\RMO> kubectl get all -n rabbitmq-system

NAME                                             READY   STATUS    RESTARTS   AGE
pod/rabbitmq-cluster-operator-5b4b795998-kpmq5   1/1     Running   0          14m
pod/testrmc-server-0                             1/1     Running   0          9m42s

Someone kindly let me know whether this issue is OS dependent or am I doing something wrong on CentOS 7 ?

@coro
Copy link
Contributor

coro commented Jun 30, 2021

Firstly, probably not related to your issue, but around the time you raised this issue we released a new operator version, so if you run kubectl apply -f https://github.com/rabbitmq/cluster-operator/releases/latest/download/cluster-operator.yml now you'll likely get a new version, as a heads up.


Looking at that error message, and only seeing it on one Kubernetes distribution of the two, makes me suspect that Dynamic Provisioning isn't being used.

When setting up your CentOS Kubernetes cluster, did you either define any storage classes, or enable the DefaultStorageClass admission controller? If you did neither, then your cluster won't know how to automatically reserve disk for persistent volumes through Dynamic Provisioning, and you would have to create the Persistent Volumes yourself for the operator to then claim with a PVC.

You have two options when deploying:

  • Enable the DefaultStorageClass admission controller and set a storageClass as default, leaving RabbitmqClusters to use that storage class to provision persistent volumes
  • Manually create a storageClass, and tell the operator to use it for PVCs by setting it in RabbitmqCluster.Spec.Persistence.StorageClassName

It might be useful to look at the output of kubectl describe pvc persistence-test-rabbitmq-cluster-server-0 as well.

@coro
Copy link
Contributor

coro commented Jun 30, 2021

As for your second question, you can use the statefulSet override feature to add additional fields to the StatefulSet template. In your case, that would look something like:

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: my-override-rabbit
spec:
  override:
    statefulSet:
      spec:
        template:
          spec:
            containers: []
            hostAliases:
            - ip: "127.0.0.1"
              hostnames:
              - "foo.local"
              - "bar.local"
            - ip: "10.1.2.3"
              hostnames:
              - "foo.remote"
              - "bar.remote"

@sathishc58
Copy link
Author

@coro Thank you for responding quickly. Will try and update you tomorrow as it's late here

@sathishc58
Copy link
Author

sathishc58 commented Jul 1, 2021

On Windows

Despite 'unbound immediate PersistentVolumeClaims' and 'Readiness probe' failures Windows opened up the required ports and kept the pod running

PS C:\WINDOWS\system32> kubectl describe pod testrmc-server-0 -n rabbitmq-system | more

Name:         testrmc-server-0
Namespace:    rabbitmq-system
Priority:     0
Node:         docker-desktop/192.168.xx.xx
  Normal   Pulled          40m   kubelet  Container image "rabbitmq:3.8.18-management" already present on machine
  Normal   Created         40m   kubelet  Created container rabbitmq
  Normal   Started         40m   kubelet  Started container rabbitmq
  Warning  Unhealthy       40m   kubelet  Readiness probe failed: dial tcp 10.1.0.144:5672: connect: connection refused

On CentOS 7 and I am running K8S Cluster with 1 master and 1 worker node

I see 3 different reasons (in 'logs', 'describe' and 'cluster definition') but not sure how to resolve them

[root@re-ctrl01 tmp]# kubectl logs definition-server-0 -n rabbitmq-system (Tailored output)

10:06:55.721 [error] BOOT FAILED
10:06:55.721 [error] ===========
10:06:55.721 [error] ERROR: epmd error for host definition-server-0.definition-nodes.rabbitmq-system: nxdomain (non-existing domain)

[root@re-ctrl01 tmp]# kubectl describe pod/definition-server-0 -n rabbitmq-system (Tailored output)

Name:         definition-server-0
Namespace:    rabbitmq-system
Priority:     0
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Created    9m21s (x3 over 11m)    kubelet            Created container rabbitmq
  Normal   Started    9m21s (x3 over 11m)    kubelet            Started container rabbitmq
  Warning  BackOff    8m25s (x3 over 9m34s)  kubelet            Back-off restarting failed container
  Warning  Unhealthy  5m55s (x17 over 10m)   kubelet            Readiness probe failed: dial tcp 10.244.0.173:5672: connect: connection refused
  Normal   Pulled     50s (x7 over 11m)      kubelet            Container image "rabbitmq:3.8.16-management" already present on machine

[root@re-ctrl01 tmp]# kubectl describe rabbitmqcluster.rabbitmq.com/definition -n rabbitmq-system

Name:         definition
Namespace:    rabbitmq-system
Labels:       <none>
Annotations:  <none>
API Version:  rabbitmq.com/v1beta1
Kind:         RabbitmqCluster
Status:
  Binding:
    Name:  definition-default-user
  Conditions:
    Last Transition Time:  2021-07-01T08:55:34Z
    **Message:               0/1 Pods ready
    Reason:                NotAllPodsReady**
    Status:                False
    Type:                  AllReplicasReady
    Last Transition Time:  2021-07-01T08:55:34Z
    Message:               The service has no endpoints available
    **Reason:                NoEndpointsAvailable**
    Status:                False
    Type:                  ClusterAvailable
    Last Transition Time:  2021-07-01T08:55:34Z
Events:
  Type    Reason            Age                   From                        Message
  ----    ------            ----                  ----                        -------
  Normal  SuccessfulUpdate  2m1s (x304 over 77m)  rabbitmqcluster-controller  updated resource definition-nodes of Type *v1.Service

The following outputs were also added for your reference

[root@re-ctrl01 ~]# kubectl exec pod/definition-server-0 -n rabbitmq-system -- tac /etc/hosts

Defaulted container "rabbitmq" out of: rabbitmq, setup-container (init)
10.244.0.169    definition-server-0.definition-nodes.rabbitmq-system.svc.cluster.local        definition-server-0

[root@re-devk8s-ctrl01 ~]# kubectl get pv

NAME              CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                             STORAGECLASS    REASON   AGE
rbmqpv            1Gi        RWO            Retain           Bound    rabbitmq-system/persistence-definition-server-0   local-storage            84m

[root@re-ctrl01 ~]# kubectl get pvc -n rabbitmq-system

NAME                              STATUS   VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS    AGE
persistence-definition-server-0   Bound    rbmqpv   1Gi        RWO            local-storage   84m

Kindly let me know had I done something wrong

@sathishc58
Copy link
Author

I added hostAliases to /etc/hosts file of rabbitmq docker container which resolved the issue but I am not sure whether it's correct

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: definition
  namespace: rabbitmq-system
spec:
  replicas: 1
  override:
    statefulSet:
      spec:
        template:
          spec:
            containers: []
            hostAliases:
            - ip: "127.0.0.1"
              hostnames:
              - "definition-server-0"
              - "definition-server-0.definition-nodes.rabbitmq-system"

Output of rabbitmq pod

[root@re-devk8s-control01 ~]# kubectl describe pod/definition-server-0 -n rabbitmq-system
Name:         definition-server-0
Namespace:    rabbitmq-system
Priority:     0
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Created    31m                kubelet            Created container rabbitmq
  Normal   Started    31m                kubelet            Started container rabbitmq
  Warning  Unhealthy  30m (x5 over 31m)  kubelet            Readiness probe failed: dial tcp 10.244.0.160:5672: connect: connection refused

Though readiness probe failed I was able to test the connectivity

[root@re-devk8s-control01 ~]# telnet 10.244.0.160 5672                        
Trying 10.244.0.160...
Connected to 10.244.0.160.
Escape character is '^]'.

If possible kindly let me know why 'readiness probe' fails ?

@github-actions
Copy link

This issue has been marked as stale due to 60 days of inactivity. Stale issues will be closed after a further 30 days of inactivity; please remove the stale label in order to prevent this occurring.

@github-actions github-actions bot added the stale Issue or PR with long period of inactivity label Aug 31, 2021
@github-actions
Copy link

Closing stale issue due to further inactivity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
closed-stale Issue or PR closed due to long period of inactivity stale Issue or PR with long period of inactivity
Projects
None yet
Development

No branches or pull requests

2 participants