Use RabbitMQ user for init container instead of root #731

coro · 2021-06-14T13:04:51Z

This closes #357

Note to reviewers: remember to look at the commits in this PR and consider if they can be squashed

Summary Of Changes

This changes the init container from running as root to running as the same specified RabbitMQ user - 999. This allows RabbitmqClusters to be deployed on environments with restrictive security policies that do not allow for root containers, while still retaining support for Openshift.

Additional Context

RabbitMQ requires group ownership of the Mnesia database, which is mounted as a persistent volume.

The root init container was added in #303, as a means of changing the GID of this volume on Openshift - in Openshift, we observed that the GID was not being modified by StatefulSet.Spec.SecurityContext.FSGroup, and so the volume had the wrong owner. We added the init container to set the GID of this volume correctly on all systems.

The default set of capabilities on Openshift's container runtime, CRI-O, as well as on containerd & Docker, includes SETGID, so one shouldn't need to run as root to perform these actions. In testing, the initContainer seems to handle running as the RabbitMQ user 999, and correctly sets the permissions & owner of the Mnesia volume on all three container runtimes (CRI-O tested on Openshift, containerd & Docker tested on GKE).

I'm not sure if something has changed to make this work, if we falsely assumed we needed root for this in the past, or if CRC (where we previously tested Openshift) uses a different default set of capabilities.

Local Testing

Please ensure you run the unit, integration and system tests before approving the PR.

To run the unit and integration tests:

$ make unit-tests integration-tests

You will need to target a k8s cluster and have the operator deployed for running the system tests.

For example, for a Kubernetes context named dev-bunny:

$ kubectx dev-bunny
$ make destroy deploy-dev
# wait for operator to be deployed
$ make system-tests

ansd · 2021-06-17T11:03:40Z

Thanks @coro.

With this change, we can omit section 3 (Create a Security Context Constraint that allows the RabbitMQ pod to have the capabilities FOWNER and CHOWN) and section 5 (Assign Security Context Constraint to Service Account) described in https://www.rabbitmq.com/kubernetes/operator/install-operator.html#openshift, correct?

I'm not sure if something has changed to make this work, if we falsely assumed we needed root for this in the past, or if CRC (where we previously tested Openshift) uses a different default set of capabilities.

I just tested this change with the latest CRC and the init container can't change ownership of /var/lib/rabbitmq/mnesia/ anymore:

> crc version
CodeReady Containers version: 1.28.0+08de64bd
OpenShift version: 4.7.13 (not embedded in executable)

> k get pods
NAME          READY   STATUS             RESTARTS   AGE
r1-server-0   0/1     CrashLoopBackOff   7          13m

> k logs r1-server-0 -c setup-container
chown: changing ownership of '/var/lib/rabbitmq/mnesia/': Operation not permitted

> k logs r1-server-0 -c rabbitmq
Configuring logger redirection
10:49:58.466 [warning] Failed to write PID file "/var/lib/rabbitmq/mnesia/[email protected]": permission denied
10:49:58.613 [error] Failed to create Ra data directory at '/var/lib/rabbitmq/mnesia/[email protected]/quorum/[email protected]', file system operation error: eacces
10:49:58.614 [error] Supervisor ra_sup had child ra_system_sup started with ra_system_sup:start_link() at undefined exit with reason {error,"Ra could not create its data directory. See the log for details."} in context start_error
10:49:58.614 [error] CRASH REPORT Process <0.248.0> with 0 neighbours exited with reason: {error,"Ra could not create its data directory. See the log for details."} in ra_system_sup:init/1 line 43
10:49:58.614 [error] CRASH REPORT Process <0.242.0> with 0 neighbours exited with reason: {{shutdown,{failed_to_start_child,ra_system_sup,{error,"Ra could not create its data directory. See the log for details."}}},{ra_app,start,[normal,[]]}} in application_master:init/4 line 142
{"Kernel pid terminated",application_controller,"{application_start_failure,ra,{{shutdown,{failed_to_start_child,ra_system_sup,{error,\"Ra could not create its data directory. See the log for details.\"}}},{ra_app,start,[normal,[]]}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,ra,{{shutdown,{failed_to_start_child,ra_system_sup,{error,"Ra could not create its data directory. See the log for details."}

Crash dump is being written to: erl_crash.dump...%

The K8s node of CRC uses cri-o container runtime:

  Container Runtime Version:                  cri-o://1.20.2-12.rhaos4.7.git9f7be76.el8

Are you getting the same behaviour on CRC?
If this issue applies only to CRC, and not to any other K8s runtime, I'm happy with this change.

coro · 2021-06-17T14:23:42Z

I just tested this with @ansd . We're observing that fsGroup is successfully applied on a real Openshift cluster hosted in GCP, meaning that we don't need the root user in the init container. This is likely because the CRC volume driver does not support the fsGroup flag, whereas the one used in the real cluster does.

We also noticed that the chown actions done by the initContainer were redundant on systems where the storage driver supports fsGroup.

Another thing we noticed is that the rabbitmq user 999 is not actually strictly necessary for the container to run. Replacing this number with an arbitrary value in runAsUser and fsGroup in the security contexts of the containers still led to the cluster being successfully created.

Use RabbitMQ user for init container instead of root

5f3d7fa

Zerpet approved these changes Jun 15, 2021

View reviewed changes

ansd assigned ansd and unassigned ansd Jun 17, 2021

ansd self-requested a review June 17, 2021 10:46

ansd removed their request for review June 17, 2021 11:04

ansd approved these changes Jun 17, 2021

View reviewed changes

coro merged commit 39ab8b1 into main Jun 17, 2021

coro deleted the non-root-initcon branch June 17, 2021 14:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use RabbitMQ user for init container instead of root #731

Use RabbitMQ user for init container instead of root #731

coro commented Jun 14, 2021

ansd commented Jun 17, 2021

coro commented Jun 17, 2021

Use RabbitMQ user for init container instead of root #731

Use RabbitMQ user for init container instead of root #731

Conversation

coro commented Jun 14, 2021

Summary Of Changes

Additional Context

Local Testing

ansd commented Jun 17, 2021

coro commented Jun 17, 2021