Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use RabbitMQ user for init container instead of root #731

Merged
merged 1 commit into from
Jun 17, 2021
Merged

Conversation

coro
Copy link
Contributor

@coro coro commented Jun 14, 2021

This closes #357

Note to reviewers: remember to look at the commits in this PR and consider if they can be squashed

Summary Of Changes

This changes the init container from running as root to running as the same specified RabbitMQ user - 999. This allows RabbitmqClusters to be deployed on environments with restrictive security policies that do not allow for root containers, while still retaining support for Openshift.

Additional Context

RabbitMQ requires group ownership of the Mnesia database, which is mounted as a persistent volume.

The root init container was added in #303, as a means of changing the GID of this volume on Openshift - in Openshift, we observed that the GID was not being modified by StatefulSet.Spec.SecurityContext.FSGroup, and so the volume had the wrong owner. We added the init container to set the GID of this volume correctly on all systems.

The default set of capabilities on Openshift's container runtime, CRI-O, as well as on containerd & Docker, includes SETGID, so one shouldn't need to run as root to perform these actions. In testing, the initContainer seems to handle running as the RabbitMQ user 999, and correctly sets the permissions & owner of the Mnesia volume on all three container runtimes (CRI-O tested on Openshift, containerd & Docker tested on GKE).

I'm not sure if something has changed to make this work, if we falsely assumed we needed root for this in the past, or if CRC (where we previously tested Openshift) uses a different default set of capabilities.

Local Testing

Please ensure you run the unit, integration and system tests before approving the PR.

To run the unit and integration tests:

$ make unit-tests integration-tests

You will need to target a k8s cluster and have the operator deployed for running the system tests.

For example, for a Kubernetes context named dev-bunny:

$ kubectx dev-bunny
$ make destroy deploy-dev
# wait for operator to be deployed
$ make system-tests

@ansd ansd assigned ansd and unassigned ansd Jun 17, 2021
@ansd ansd self-requested a review June 17, 2021 10:46
@ansd
Copy link
Member

ansd commented Jun 17, 2021

Thanks @coro.

With this change, we can omit section 3 (Create a Security Context Constraint that allows the RabbitMQ pod to have the capabilities FOWNER and CHOWN) and section 5 (Assign Security Context Constraint to Service Account) described in https://www.rabbitmq.com/kubernetes/operator/install-operator.html#openshift, correct?

I'm not sure if something has changed to make this work, if we falsely assumed we needed root for this in the past, or if CRC (where we previously tested Openshift) uses a different default set of capabilities.

I just tested this change with the latest CRC and the init container can't change ownership of /var/lib/rabbitmq/mnesia/ anymore:

> crc version
CodeReady Containers version: 1.28.0+08de64bd
OpenShift version: 4.7.13 (not embedded in executable)

> k get pods
NAME          READY   STATUS             RESTARTS   AGE
r1-server-0   0/1     CrashLoopBackOff   7          13m

> k logs r1-server-0 -c setup-container
chown: changing ownership of '/var/lib/rabbitmq/mnesia/': Operation not permitted

> k logs r1-server-0 -c rabbitmq
Configuring logger redirection
10:49:58.466 [warning] Failed to write PID file "/var/lib/rabbitmq/mnesia/[email protected]": permission denied
10:49:58.613 [error] Failed to create Ra data directory at '/var/lib/rabbitmq/mnesia/[email protected]/quorum/[email protected]', file system operation error: eacces
10:49:58.614 [error] Supervisor ra_sup had child ra_system_sup started with ra_system_sup:start_link() at undefined exit with reason {error,"Ra could not create its data directory. See the log for details."} in context start_error
10:49:58.614 [error] CRASH REPORT Process <0.248.0> with 0 neighbours exited with reason: {error,"Ra could not create its data directory. See the log for details."} in ra_system_sup:init/1 line 43
10:49:58.614 [error] CRASH REPORT Process <0.242.0> with 0 neighbours exited with reason: {{shutdown,{failed_to_start_child,ra_system_sup,{error,"Ra could not create its data directory. See the log for details."}}},{ra_app,start,[normal,[]]}} in application_master:init/4 line 142
{"Kernel pid terminated",application_controller,"{application_start_failure,ra,{{shutdown,{failed_to_start_child,ra_system_sup,{error,\"Ra could not create its data directory. See the log for details.\"}}},{ra_app,start,[normal,[]]}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,ra,{{shutdown,{failed_to_start_child,ra_system_sup,{error,"Ra could not create its data directory. See the log for details."}

Crash dump is being written to: erl_crash.dump...%

The K8s node of CRC uses cri-o container runtime:

  Container Runtime Version:                  cri-o://1.20.2-12.rhaos4.7.git9f7be76.el8

Are you getting the same behaviour on CRC?
If this issue applies only to CRC, and not to any other K8s runtime, I'm happy with this change.

@ansd ansd removed their request for review June 17, 2021 11:04
@coro
Copy link
Contributor Author

coro commented Jun 17, 2021

I just tested this with @ansd . We're observing that fsGroup is successfully applied on a real Openshift cluster hosted in GCP, meaning that we don't need the root user in the init container. This is likely because the CRC volume driver does not support the fsGroup flag, whereas the one used in the real cluster does.

We also noticed that the chown actions done by the initContainer were redundant on systems where the storage driver supports fsGroup.

Another thing we noticed is that the rabbitmq user 999 is not actually strictly necessary for the container to run. Replacing this number with an arbitrary value in runAsUser and fsGroup in the security contexts of the containers still led to the cluster being successfully created.

@coro coro merged commit 39ab8b1 into main Jun 17, 2021
@coro coro deleted the non-root-initcon branch June 17, 2021 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Init container runs as root
3 participants