Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix mnesia permissions in PV #501

Merged
merged 1 commit into from
Dec 3, 2020
Merged

Fix mnesia permissions in PV #501

merged 1 commit into from
Dec 3, 2020

Conversation

wbagdon
Copy link
Contributor

@wbagdon wbagdon commented Nov 25, 2020

Note to reviewers: remember to look at the commits in this PR and consider if they can be squashed

Summary Of Changes

Sets the UID on /var/lib/rabbitmq/mnesia/

Additional Context

Clusters fail to start on Rancher 2.5.2 Kubernetes 1.19.3 using vSphere CSI with the following error:

Configuring logger redirection

18:25:52.996 [warning] Failed to write PID file "/var/lib/rabbitmq/mnesia/[email protected]": permission denied
18:25:53.316 [error] Failed to create Ra data directory at '/var/lib/rabbitmq/mnesia/[email protected]/quorum/[email protected]', file system operation error: enoent
18:25:53.317 [error] Supervisor ra_sup had child ra_system_sup started with ra_system_sup:start_link() at undefined exit with reason {error,"Ra could not create its data directory. See the log for details."} in context start_error
18:25:53.317 [error] CRASH REPORT Process <0.247.0> with 0 neighbours exited with reason: {error,"Ra could not create its data directory. See the log for details."} in ra_system_sup:init/1 line 43
18:25:53.318 [error] CRASH REPORT Process <0.241.0> with 0 neighbours exited with reason: {{shutdown,{failed_to_start_child,ra_system_sup,{error,"Ra could not create its data directory. See the log for details."}}},{ra_app,start,[normal,[]]}} in application_master:init/4 line 138
{"Kernel pid terminated",application_controller,"{application_start_failure,ra,{{shutdown,{failed_to_start_child,ra_system_sup,{error,\"Ra could not create its data directory. See the log for details.\"}}},{ra_app,start,[normal,[]]}}}"}

Kernel pid terminated (application_controller) ({application_start_failure,ra,{{shutdown,{failed_to_start_child,ra_system_sup,{error,"Ra could not create its data directory. See the log for details."}

Crash dump is being written to: erl_crash.dump...

Setting the UID allows the pod to write to the PV

Local Testing

Please ensure you run the unit, integration and system tests before approving the PR.

To run the unit and integration tests:

$ make unit-tests integration-tests

You will need to target a k8s cluster and have the operator deployed for running the system tests.

For example, for a Kubernetes context named dev-bunny:

$ kubectx dev-bunny
$ make destroy deploy-dev
# wait for operator to be deployed
$ make system-tests

@ChunyiLyu
Copy link
Contributor

ChunyiLyu commented Nov 26, 2020

Hi @wbagdon, thanks for using the operator. We have seen related problems in other environment. Our suspicion is that we mount /var/lib/rabbitmq/ after mounting /var/lib/rabbitmq/mnesia/, which could mess up the permissions of the directory (related code).

I've updated the mount order and pushed the changes as a dev image. Could you test this operator image: rabbitmqoperator/cluster-operator-dev:e3152a and let me know if that fixes the deploy error that you are seeing?

@wbagdon
Copy link
Contributor Author

wbagdon commented Nov 26, 2020

Thanks for taking a look at this @ChunyiLyu

I tried using the image you provided in the operator but I'm still having the same permissions issue when creating the hello world cluster.

Here's what I did in case I missed a step:

  1.  kubectl delete -f https://raw.githubusercontent.com/rabbitmq/cluster-operator/main/docs/examples/hello-world/rabbitmq.yaml
     kubectl delete -f https://github.com/rabbitmq/cluster-operator/releases/latest/download/cluster-operator.yml
    
  2. Downloaded cluster-operator.yml and changed line 4123 to image: rabbitmqoperator/cluster-operator-dev:e3152a
  3.  kubectl apply -f cluster-operator.yml
     kubectl apply -f https://raw.githubusercontent.com/rabbitmq/cluster-operator/main/docs/examples/hello-world/rabbitmq.yaml
    

Error is the same:

Configuring logger redirection

14:47:48.976 [warning] Failed to write PID file "/var/lib/rabbitmq/mnesia/[email protected]": permission denied
14:47:49.165 [error] Failed to create Ra data directory at '/var/lib/rabbitmq/mnesia/[email protected]/quorum/[email protected]', file system operation error: enoent
14:47:49.166 [error] Supervisor ra_sup had child ra_system_sup started with ra_system_sup:start_link() at undefined exit with reason {error,"Ra could not create its data directory. See the log for details."} in context start_error
14:47:49.166 [error] CRASH REPORT Process <0.247.0> with 0 neighbours exited with reason: {error,"Ra could not create its data directory. See the log for details."} in ra_system_sup:init/1 line 43
14:47:49.167 [error] CRASH REPORT Process <0.241.0> with 0 neighbours exited with reason: {{shutdown,{failed_to_start_child,ra_system_sup,{error,"Ra could not create its data directory. See the log for details."}}},{ra_app,start,[normal,[]]}} in application_master:init/4 line 138
Kernel pid terminated (application_controller) ({application_start_failure,ra,{{shutdown,{failed_to_start_child,ra_system_sup,{error,"Ra could not create its data directory. See the log for details."}
{"Kernel pid terminated",application_controller,"{application_start_failure,ra,{{shutdown,{failed_to_start_child,ra_system_sup,{error,\"Ra could not create its data directory. See the log for details.\"}}},{ra_app,start,[normal,[]]}}}"}

Crash dump is being written to: erl_crash.dump...

@ChunyiLyu
Copy link
Contributor

@wbagdon I see. Sorry that the fix didn't fix the issue you are facing. This could be an infrastructure related issue as we have never seen this deploy error when testing on other k8s provider.

Before the team decide to accept the fix or not. I suggest that you leverage our statefulset override for now to verify that this is indeed a fix for you and to deploy. You can use the statefulSet override to patch the statefulSet definition whichever way that suits your use case. For example, you can try to override the initContainer commands like

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: test
spec:
  override:
    statefulSet:
      spec:
        template:
          spec:
            containers: []
            initContainers:
            - name: setup-container
              command:
              - sh
              - -c
              - cp /tmp/erlang-cookie-secret/.erlang.cookie /var/lib/rabbitmq/.erlang.cookie
                && chown 999:999 /var/lib/rabbitmq/.erlang.cookie && chmod 600 /var/lib/rabbitmq/.erlang.cookie
                ; cp /tmp/rabbitmq-plugins/enabled_plugins /operator/enabled_plugins &&
                chown 999:999 /operator/enabled_plugins ; chown 999:999 /var/lib/rabbitmq/mnesia/
                ; echo '[default]' > /var/lib/rabbitmq/.rabbitmqadmin.conf && sed -e 's/default_user/username/'
                -e 's/default_pass/password/' /tmp/default_user.conf >> /var/lib/rabbitmq/.rabbitmqadmin.conf
                && chown 999:999 /var/lib/rabbitmq/.rabbitmqadmin.conf && chmod 600 /var/lib/rabbitmq/.rabbitmqadmin.conf

I hope this helps :)

@wbagdon
Copy link
Contributor Author

wbagdon commented Nov 30, 2020

Yes, I was able to get the cluster running using the override spec:

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: cluster-dev
spec:
  service:
    type: NodePort
  override:
    statefulSet:
      spec:
        template:
          spec:
            containers: []
            initContainers:
            - name: setup-container
              command:
              - sh
              - -c
              - cp /tmp/erlang-cookie-secret/.erlang.cookie /var/lib/rabbitmq/.erlang.cookie
                && chown 999:999 /var/lib/rabbitmq/.erlang.cookie
                && chmod 600 /var/lib/rabbitmq/.erlang.cookie ;
                cp /tmp/rabbitmq-plugins/enabled_plugins /operator/enabled_plugins
                && chown 999:999 /operator/enabled_plugins ;
                chown 999:999 /var/lib/rabbitmq/mnesia/ ;
                echo '[default]' > /var/lib/rabbitmq/.rabbitmqadmin.conf 
                && sed -e 's/default_user/username/' -e 's/default_pass/password/' /tmp/default_user.conf >> /var/lib/rabbitmq/.rabbitmqadmin.conf
                && chown 999:999 /var/lib/rabbitmq/.rabbitmqadmin.conf 
                && chmod 600 /var/lib/rabbitmq/.rabbitmqadmin.conf

@ChunyiLyu
Copy link
Contributor

The unit test is broken because of the PR. Could you fix the test and clean up the branch history? I will merge the PR after that 😄

You can run the unit test by cd in the repo and run make unit-tests. The test that's broken by the change is this one: https://github.com/rabbitmq/cluster-operator/blob/main/internal/resource/statefulset_test.go#L1085

Thanks for contributing!

@wbagdon
Copy link
Contributor Author

wbagdon commented Dec 1, 2020

Thanks for the assist on the unit test, I believe everything is as requested.
Please let me know if you need anything else.

@ChunyiLyu ChunyiLyu merged commit 8f066d1 into rabbitmq:main Dec 3, 2020
@ChunyiLyu
Copy link
Contributor

@wbagdon merged and thanks for contributing 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants