Fix mnesia permissions in PV #501

wbagdon · 2020-11-25T18:31:46Z

Note to reviewers: remember to look at the commits in this PR and consider if they can be squashed

Summary Of Changes

Sets the UID on /var/lib/rabbitmq/mnesia/

Additional Context

Clusters fail to start on Rancher 2.5.2 Kubernetes 1.19.3 using vSphere CSI with the following error:

Configuring logger redirection

18:25:52.996 [warning] Failed to write PID file "/var/lib/rabbitmq/mnesia/[email protected]": permission denied
18:25:53.316 [error] Failed to create Ra data directory at '/var/lib/rabbitmq/mnesia/[email protected]/quorum/[email protected]', file system operation error: enoent
18:25:53.317 [error] Supervisor ra_sup had child ra_system_sup started with ra_system_sup:start_link() at undefined exit with reason {error,"Ra could not create its data directory. See the log for details."} in context start_error
18:25:53.317 [error] CRASH REPORT Process <0.247.0> with 0 neighbours exited with reason: {error,"Ra could not create its data directory. See the log for details."} in ra_system_sup:init/1 line 43
18:25:53.318 [error] CRASH REPORT Process <0.241.0> with 0 neighbours exited with reason: {{shutdown,{failed_to_start_child,ra_system_sup,{error,"Ra could not create its data directory. See the log for details."}}},{ra_app,start,[normal,[]]}} in application_master:init/4 line 138
{"Kernel pid terminated",application_controller,"{application_start_failure,ra,{{shutdown,{failed_to_start_child,ra_system_sup,{error,\"Ra could not create its data directory. See the log for details.\"}}},{ra_app,start,[normal,[]]}}}"}

Kernel pid terminated (application_controller) ({application_start_failure,ra,{{shutdown,{failed_to_start_child,ra_system_sup,{error,"Ra could not create its data directory. See the log for details."}

Crash dump is being written to: erl_crash.dump...

Setting the UID allows the pod to write to the PV

Local Testing

Please ensure you run the unit, integration and system tests before approving the PR.

To run the unit and integration tests:

$ make unit-tests integration-tests

You will need to target a k8s cluster and have the operator deployed for running the system tests.

For example, for a Kubernetes context named dev-bunny:

$ kubectx dev-bunny
$ make destroy deploy-dev
# wait for operator to be deployed
$ make system-tests

ChunyiLyu · 2020-11-26T11:08:24Z

Hi @wbagdon, thanks for using the operator. We have seen related problems in other environment. Our suspicion is that we mount /var/lib/rabbitmq/ after mounting /var/lib/rabbitmq/mnesia/, which could mess up the permissions of the directory (related code).

I've updated the mount order and pushed the changes as a dev image. Could you test this operator image: rabbitmqoperator/cluster-operator-dev:e3152a and let me know if that fixes the deploy error that you are seeing?

wbagdon · 2020-11-26T14:53:18Z

Thanks for taking a look at this @ChunyiLyu

I tried using the image you provided in the operator but I'm still having the same permissions issue when creating the hello world cluster.

Here's what I did in case I missed a step:

 kubectl delete -f https://raw.githubusercontent.com/rabbitmq/cluster-operator/main/docs/examples/hello-world/rabbitmq.yaml
 kubectl delete -f https://github.com/rabbitmq/cluster-operator/releases/latest/download/cluster-operator.yml

Downloaded cluster-operator.yml and changed line 4123 to image: rabbitmqoperator/cluster-operator-dev:e3152a

 kubectl apply -f cluster-operator.yml
 kubectl apply -f https://raw.githubusercontent.com/rabbitmq/cluster-operator/main/docs/examples/hello-world/rabbitmq.yaml

Error is the same:

Configuring logger redirection

14:47:48.976 [warning] Failed to write PID file "/var/lib/rabbitmq/mnesia/[email protected]": permission denied
14:47:49.165 [error] Failed to create Ra data directory at '/var/lib/rabbitmq/mnesia/[email protected]/quorum/[email protected]', file system operation error: enoent
14:47:49.166 [error] Supervisor ra_sup had child ra_system_sup started with ra_system_sup:start_link() at undefined exit with reason {error,"Ra could not create its data directory. See the log for details."} in context start_error
14:47:49.166 [error] CRASH REPORT Process <0.247.0> with 0 neighbours exited with reason: {error,"Ra could not create its data directory. See the log for details."} in ra_system_sup:init/1 line 43
14:47:49.167 [error] CRASH REPORT Process <0.241.0> with 0 neighbours exited with reason: {{shutdown,{failed_to_start_child,ra_system_sup,{error,"Ra could not create its data directory. See the log for details."}}},{ra_app,start,[normal,[]]}} in application_master:init/4 line 138
Kernel pid terminated (application_controller) ({application_start_failure,ra,{{shutdown,{failed_to_start_child,ra_system_sup,{error,"Ra could not create its data directory. See the log for details."}
{"Kernel pid terminated",application_controller,"{application_start_failure,ra,{{shutdown,{failed_to_start_child,ra_system_sup,{error,\"Ra could not create its data directory. See the log for details.\"}}},{ra_app,start,[normal,[]]}}}"}

Crash dump is being written to: erl_crash.dump...

ChunyiLyu · 2020-11-26T17:27:54Z

@wbagdon I see. Sorry that the fix didn't fix the issue you are facing. This could be an infrastructure related issue as we have never seen this deploy error when testing on other k8s provider.

Before the team decide to accept the fix or not. I suggest that you leverage our statefulset override for now to verify that this is indeed a fix for you and to deploy. You can use the statefulSet override to patch the statefulSet definition whichever way that suits your use case. For example, you can try to override the initContainer commands like

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: test
spec:
  override:
    statefulSet:
      spec:
        template:
          spec:
            containers: []
            initContainers:
            - name: setup-container
              command:
              - sh
              - -c
              - cp /tmp/erlang-cookie-secret/.erlang.cookie /var/lib/rabbitmq/.erlang.cookie
                && chown 999:999 /var/lib/rabbitmq/.erlang.cookie && chmod 600 /var/lib/rabbitmq/.erlang.cookie
                ; cp /tmp/rabbitmq-plugins/enabled_plugins /operator/enabled_plugins &&
                chown 999:999 /operator/enabled_plugins ; chown 999:999 /var/lib/rabbitmq/mnesia/
                ; echo '[default]' > /var/lib/rabbitmq/.rabbitmqadmin.conf && sed -e 's/default_user/username/'
                -e 's/default_pass/password/' /tmp/default_user.conf >> /var/lib/rabbitmq/.rabbitmqadmin.conf
                && chown 999:999 /var/lib/rabbitmq/.rabbitmqadmin.conf && chmod 600 /var/lib/rabbitmq/.rabbitmqadmin.conf

I hope this helps :)

wbagdon · 2020-11-30T13:01:50Z

Yes, I was able to get the cluster running using the override spec:

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: cluster-dev
spec:
  service:
    type: NodePort
  override:
    statefulSet:
      spec:
        template:
          spec:
            containers: []
            initContainers:
            - name: setup-container
              command:
              - sh
              - -c
              - cp /tmp/erlang-cookie-secret/.erlang.cookie /var/lib/rabbitmq/.erlang.cookie
                && chown 999:999 /var/lib/rabbitmq/.erlang.cookie
                && chmod 600 /var/lib/rabbitmq/.erlang.cookie ;
                cp /tmp/rabbitmq-plugins/enabled_plugins /operator/enabled_plugins
                && chown 999:999 /operator/enabled_plugins ;
                chown 999:999 /var/lib/rabbitmq/mnesia/ ;
                echo '[default]' > /var/lib/rabbitmq/.rabbitmqadmin.conf 
                && sed -e 's/default_user/username/' -e 's/default_pass/password/' /tmp/default_user.conf >> /var/lib/rabbitmq/.rabbitmqadmin.conf
                && chown 999:999 /var/lib/rabbitmq/.rabbitmqadmin.conf 
                && chmod 600 /var/lib/rabbitmq/.rabbitmqadmin.conf

ChunyiLyu · 2020-11-30T14:59:40Z

The unit test is broken because of the PR. Could you fix the test and clean up the branch history? I will merge the PR after that 😄

You can run the unit test by cd in the repo and run make unit-tests. The test that's broken by the change is this one: https://github.com/rabbitmq/cluster-operator/blob/main/internal/resource/statefulset_test.go#L1085

Thanks for contributing!

wbagdon · 2020-12-01T14:46:03Z

Thanks for the assist on the unit test, I believe everything is as requested.
Please let me know if you need anything else.

ChunyiLyu · 2020-12-03T12:07:50Z

@wbagdon merged and thanks for contributing 😃

Fix permissions on mnesia

d1d7e9b

wbagdon force-pushed the patch-1 branch from 2cd50df to d1d7e9b Compare December 1, 2020 14:19

ChunyiLyu merged commit 8f066d1 into rabbitmq:main Dec 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix mnesia permissions in PV #501

Fix mnesia permissions in PV #501

wbagdon commented Nov 25, 2020

ChunyiLyu commented Nov 26, 2020 •

edited

Loading

wbagdon commented Nov 26, 2020

ChunyiLyu commented Nov 26, 2020

wbagdon commented Nov 30, 2020

ChunyiLyu commented Nov 30, 2020

wbagdon commented Dec 1, 2020

ChunyiLyu commented Dec 3, 2020

Fix mnesia permissions in PV #501

Fix mnesia permissions in PV #501

Conversation

wbagdon commented Nov 25, 2020

Summary Of Changes

Additional Context

Local Testing

ChunyiLyu commented Nov 26, 2020 • edited Loading

wbagdon commented Nov 26, 2020

ChunyiLyu commented Nov 26, 2020

wbagdon commented Nov 30, 2020

ChunyiLyu commented Nov 30, 2020

wbagdon commented Dec 1, 2020

ChunyiLyu commented Dec 3, 2020

ChunyiLyu commented Nov 26, 2020 •

edited

Loading