-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chown cgroup to process uid in container namespace #3057
chown cgroup to process uid in container namespace #3057
Conversation
b30bb92
to
9e9c973
Compare
In general this kinda makes sense, although I'm afraid systemd might overwrite that at some random moment. This also needs test(s). |
You can use systemd transient unit API to set a different user. Unfortunately, the uid needs to I filed an issue against systemd to relax this requirement. It was provisionally rejected: I would like if we can push back and convince systemd project (i.e. Lennart) to just relax the checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For security reason, the default behavior should not be changed.
We can probably have a feature gate as an annotation
https://github.com/containers/crun/blob/main/crun.1.md#extensions-to-oci
https://github.com/opencontainers/runc/blob/master/docs/systemd.md#auxiliary-properties
User namespaces is already behind an annotation ( |
I'm talking about OCI Runtime Spec annotations, not about Kubernetes annotations. CRI-O may set the suggest OCI Runtime Spec annotation by default at their own risk. |
@AkihiroSuda thanks for the clarification. Fair enough. I'll gate it. I'll try to make all the suggested changes next week. |
@AkihiroSuda what would be an appropriate key/value for the annotation? |
SGTM, but in the long term we should have proper JSON entry in OCI Runtime Spec https://github.com/opencontainers/runtime-spec |
Better would be to adjust the semantics such that the container process can create scopes/slices in its own cgroup, and manage its child processes/threads. But that is a discussion to be had over there, not here :) |
9e9c973
to
3c94bb5
Compare
Thanks @kolyshkin and @AkihiroSuda for your reviews. I implemented all of your requested changes, and added a test. By the way, for testing with OpenShift, do not use 4.8.0, there is a regression in CRI-O 1.21 that prevents k8s annotations from being propagated to the OCI configuration. CRI-O 1.20 (OpenShift 4.7.x) is fine. |
43e5c14
to
0f4ffc4
Compare
Ping @kolyshkin and @AkihiroSuda - are you able to re-review this PR soon? |
4efd94e
to
99cf0a8
Compare
99cf0a8
to
e155172
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good and ready; only left a single suggestion.
e155172
to
cce31b0
Compare
cce31b0
to
8212dfa
Compare
Delegating cgroups to the container enables more complex workloads, including systemd-based workloads. The OCI runtime-spec was recently updated to explicitly admit such delegation, through specification of cgroup ownership semantics: opencontainers/runtime-spec#1123 Pursuant to the updated OCI runtime-spec, change the ownership of the container's cgroup directory and particular files therein, when using cgroups v2 and when the cgroupfs is to be mounted read/write. As a result of this change, systemd workloads can run in isolated user namespaces on OpenShift when the sandbox's cgroupfs is mounted read/write. It might be possible to implement this feature in other cgroup managers, but that work is deferred. Signed-off-by: Fraser Tweedale <[email protected]>
8212dfa
to
35d20c4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks to all who have reviewed this PR and given feedback. We need more approvals and then someone can press the merge button. Ping @cyphar, @AkihiroSuda, @giuseppe, @jfroy. |
Any takers? @mrunalp? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Many thanks, @cyphar and all reviewers. |
I created #3311 to backport this feature to the |
Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads. The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:
opencontainers/runtime-spec#1123
Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.
As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.
It might be possible to implement this feature in other cgroup
managers, but that work is deferred.