-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nsenter: improve namespace creation and SELinux IPC handling #1562
nsenter: improve namespace creation and SELinux IPC handling #1562
Conversation
libcontainer/nsenter/nsexec.c
Outdated
*/ | ||
if (leftover_cloneflags) { | ||
if (unshare(leftover_cloneflags) < 0) | ||
bail("failed to unshare leftover namespaces"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How unshare(CLONE_NEWUSER | CLONE_NEWIPC) is different from
unshare(CLONE_NEWUSER)followed by
unshare(CLONE_NEWIPC)`. IIUC, in both the cases new user namespace will be owner of IPC namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree that that is true on paper, it isn't. There's a kernel bug (as discussed in #959) which results in SELinux labels (on /dev/mqueue
) not being settable in a user namespace unless the IPC namespace was created after the user namespace was set up fully (including the uid_map
and setuid
stuff).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That issue really does not give exact technical details. I am looking at kernel code, and owner of new ipc namespace is going to be newly created user namespace. If that's the case, then we should be able to just do.
unshare(CLONE_NEWUSER | CLONE_NEWIPC)
set_uid_gid_map
switch to root inside container
mount_dev_mqueue_and_label.
I am wondering why above workflow is not working and if it is a kernel bug which needs fixing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried reproducing problem on fedora 26 and I can reproduce it. I see following error message in journal.
SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at strace() output. Looks like first we tried selinux context mount of mqueue and that failed. I think due to above kernel message. And the we tried a non-context mount and tried lsexattr() and that lsetxattr failed.
29777 mount("mqueue", "/root/runc-testing/rootfs/dev/mqueue", "mqueue", MS_NOSUID|MS_NODEV|MS_NOEXEC, "context="system_u:object_r:svirt"... <unfinished ...>
29777 <... mount resumed> ) = -1 EINVAL (Invalid argument)
29777 mount("mqueue", "/root/runc-testing/rootfs/dev/mqueue", "mqueue", MS_NOSUID|MS_NODEV|MS_NOEXEC, NULL <unfinished ...>
29777 <... mount resumed> ) = 0
29777 lsetxattr("/root/runc-testing/rootfs/dev/mqueue", "security.selinux", "system_u:object_r:svirt_sandbox_"..., 47, 0) = -1 EPERM (Operation not permitted)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More debugging. I think following check in selinux fails.
selinux_inode_setxattr() {
if (!inode_owner_or_capable(inode)) {
}
}
This inode should belong to /rootfs/dev/mqueue (one belonging mqueuefs). IIUC, calling thread is the one which is already inside container and has effective fsuid=1000. But interestingly inod->i_uid is 0. And 0 is not mapped inside container, so kuid_has_mapping() check fails too.
So question is, how did inode->i_uid is 0. IIUC, it is the fsuid=1000 which created this mqueue directory and mqueue mount point. I would think that inode->i_uid should been 1000 instead? What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhvgoyal I'll admit I never went through debugging this particular issue, but I agree with your conclusion. It looks like the reason why it's done that way is because of the mq_open
(and related mq_*
syscalls) that don't go through the VFS but need access to the mqueue
mountpoint -- hence the mount is created each time a new IPC namespace is created.
More importantly, it looks like we actually cannot ever set mount options if we are not in a user namespace -- mqueue
is not whitelisted. So xattr
is the only way to set them, and you need to pass this uid
-based check. And in order for the uid
-based check to pass you need to have your user namespace set up properly (namely kuid_has_mapping
needs to work).
In short we have to set up CLONE_NEWUSER
before everything else. Actually I think I should just do CLONE_NEWUSER
first and then do the rest of the flags afterwards rather than whitelisting CLONE_NEWIPC
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhvgoyal Ah, you saw the same thing as me (I was debugging in parallel 😸).
This inode should belong to /rootfs/dev/mqueue (one belonging mqueuefs). IIUC, calling thread is the one which is already inside container and has effective fsuid=1000. But interestingly inode->i_uid is 0.
This will only be true in rootless containers. In a container that uses user namespaces but is running as root, the unshare(...)
will happen with an fsuid
of 0
(which will carry on to mqueue
). So when later we have everything mapped, we have a different fsuid
(1000
in your example).
And 0 is not mapped inside container, so kuid_has_mapping() check fails too.
Yeah. I think making kuid_has_mapping
pass is the only way to be sure that things work out the way we want. But we could try something like setfsuid(rootuid)
which should not affect anything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm..., I think it ties back to clone(CLONE_NEWUSER | CLONE_NEWIPC) call. This is called by process with fsuid=0. And kernel creates new ipc mount namespace, and also creates an internal mount point of mqueuefs. That in turn instantiates root inode and assigns i_uid from the calling thread.
mqueue_get_inode() {
inode->i_uid = current_fsuid();
inode->i_gid = current_fsgid();
}
And that's how root inode of mqueuefs gets i_uid=0. When container process later tries to change the label, it fails as host uid 0 is not mapped inside container.
So now I atleast understand the problem.
And Mrunal's patch is helping because unshare(CLONE_NEWIPC) is called by container process with fsuid=1000. That means mqeueufs root inode will get i_uid=1000 and container process will have the privileges to change selinux label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhvgoyal We can set our fsuid
though (see my above comment). I'm working on another patch which try to do it that way.
[do not merge yet] I think the |
@cyphar Any update? |
@mrunalp I've just pushed the new patch now. I have code to handle the full I'd be interested to know whether this fixes the problem. However I had some discussions with @brauner at OSS, and it looks like trying to work around these sorts of bugs is a waste of time and we should always do a |
I'll check this out. Thanks! |
Can you look into the ci failure on this?
|
@cyphar ping |
I think I'm going to rewrite this one quite significantly, to just delay all namespace unsharing until after user namespaces are set up. This is what LXC does and I think they're right about not trusting that the kernel does the right thing in all cases here. |
I've rebased this to just delay unsharing of all namespaces. PTAL. |
Test failure is because of spec validator. |
Yes, I will rebase. /ping @mrunalp to verify that this fixes the issue originally reported. |
Technically, this change should not be necessary, as the kernel documentation claims that if you call clone(flags|CLONE_NEWUSER), the new user namespace will be the owner of all other namespaces created in @flags. Unfortunately this isn't always the case, due to various additional semantics and kernel bugs. One particular instance is SELinux, which acts very strangely towards the IPC namespace and mqueue. If you unshare the IPC namespace *before* you map a user in the user namespace, the IPC namespace's internal kern-mount for mqueue will be labelled incorrectly and the container won't be able to access it. The only way of solving this is to unshare IPC *after* the user has been mapped and we have changed to that user. I've also heard of this happening to the NET namespace while talking to some LXC folks, though I haven't personally seen that issue. This change matches our handling of user namespaces to be the same as how LXC handles these problems. Signed-off-by: Aleksa Sarai <[email protected]>
This has been rebased. Ping @mrunalp and the rest of @opencontainers/runc-maintainers. |
I'll test this out and get back. Thanks! |
@mrunalp any updates here? |
A docker bug causes the docker daemon to fail in creating a container when the '--userns-remap' option is used and SELinux is enforcing. Set SELinux to permisive mode so this test can run. See: opencontainers/runc#1562 (nsenter: improve namespace creation and SELinux IPC handling). Fixes runtime errors like these: OCI runtime create failed: running exec setns process for init caused exit Signed-off-by: Geoff Levand <[email protected]>
A docker bug causes the docker daemon to fail in creating a container when the '--userns-remap' option is used and SELinux is enforcing. Set SELinux to permisive mode so this test can run. See: opencontainers/runc#1562 (nsenter: improve namespace creation and SELinux IPC handling). Fixes runtime errors like these: OCI runtime create failed: running exec setns process for init caused exit Signed-off-by: Geoff Levand <[email protected]>
A docker bug causes the docker daemon to fail in creating a container when the '--userns-remap' option is used and SELinux is enforcing. Set SELinux to permisive mode so this test can run. See: opencontainers/runc#1562 (nsenter: improve namespace creation and SELinux IPC handling). Fixes runtime errors like these: OCI runtime create failed: running exec setns process for init caused exit Signed-off-by: Geoff Levand <[email protected]>
A docker bug causes the docker daemon to fail in creating a container when the '--userns-remap' option is used and SELinux is enforcing. Set SELinux to permisive mode so this test can run. See: opencontainers/runc#1562 (nsenter: improve namespace creation and SELinux IPC handling). Fixes runtime errors like these: OCI runtime create failed: running exec setns process for init caused exit Signed-off-by: Geoff Levand <[email protected]>
A docker bug causes the docker daemon to fail in creating a container when the '--userns-remap' option is used and SELinux is enforcing. Set SELinux to permisive mode so this test can run. See: opencontainers/runc#1562 (nsenter: improve namespace creation and SELinux IPC handling). Fixes runtime errors like these: OCI runtime create failed: running exec setns process for init caused exit Signed-off-by: Geoff Levand <[email protected]>
A docker bug causes the docker daemon to fail in creating a container when the '--userns-remap' option is used and SELinux is enforcing. Set SELinux to permisive mode so this test can run. See: opencontainers/runc#1562 (nsenter: improve namespace creation and SELinux IPC handling). Fixes runtime errors like these: OCI runtime create failed: running exec setns process for init caused exit Signed-off-by: Geoff Levand <[email protected]>
@mrunalp This patch might fix some of the issues we are seeing with adding UserNS Support to CRI-O and Podman. Can you review? |
@rhatdan I confirm this solves the issue we have seen. LGTM |
/cc @opencontainers/runc-maintainers |
A docker bug causes the docker daemon to fail in creating a container when the '--userns-remap' option is used and SELinux is enforcing. Set SELinux to permisive mode so this test can run. See: opencontainers/runc#1562 (nsenter: improve namespace creation and SELinux IPC handling). Fixes runtime errors like these: OCI runtime create failed: running exec setns process for init caused exit Signed-off-by: Geoff Levand <[email protected]>
A docker bug causes the docker daemon to fail in creating a container when the '--userns-remap' option is used and SELinux is enforcing. Set SELinux to permisive mode so this test can run. See: opencontainers/runc#1562 (nsenter: improve namespace creation and SELinux IPC handling). Fixes runtime errors like these: OCI runtime create failed: running exec setns process for init caused exit Signed-off-by: Geoff Levand <[email protected]>
A docker bug causes the docker daemon to fail in creating a container when the '--userns-remap' option is used and SELinux is enforcing. Set SELinux to permisive mode so this test can run. See: opencontainers/runc#1562 (nsenter: improve namespace creation and SELinux IPC handling). Fixes runtime errors like these: OCI runtime create failed: running exec setns process for init caused exit Signed-off-by: Geoff Levand <[email protected]>
could this be finally merged? |
LGTM |
Reviewing right now. |
🎉 |
A docker bug causes the docker daemon to fail in creating a container when the '--userns-remap' option is used and SELinux is enforcing. Set SELinux to permisive mode so this test can run. See: opencontainers/runc#1562 (nsenter: improve namespace creation and SELinux IPC handling). Fixes runtime errors like these: OCI runtime create failed: running exec setns process for init caused exit Signed-off-by: Geoff Levand <[email protected]>
A docker bug causes the docker daemon to fail in creating a container when the '--userns-remap' option is used and SELinux is enforcing. Set SELinux to permisive mode so this test can run. See: opencontainers/runc#1562 (nsenter: improve namespace creation and SELinux IPC handling). Fixes runtime errors like these: OCI runtime create failed: running exec setns process for init caused exit Signed-off-by: Geoff Levand <[email protected]>
A docker bug causes the docker daemon to fail in creating a container when the '--userns-remap' option is used and SELinux is enforcing. Set SELinux to permisive mode so this test can run. See: opencontainers/runc#1562 (nsenter: improve namespace creation and SELinux IPC handling). Fixes runtime errors like these: OCI runtime create failed: running exec setns process for init caused exit Signed-off-by: Geoff Levand <[email protected]>
A docker bug causes the docker daemon to fail in creating a container when the '--userns-remap' option is used and SELinux is enforcing. Set SELinux to permisive mode so this test can run. See: opencontainers/runc#1562 (nsenter: improve namespace creation and SELinux IPC handling). Fixes runtime errors like these: OCI runtime create failed: running exec setns process for init caused exit Signed-off-by: Geoff Levand <[email protected]>
A docker bug causes the docker daemon to fail in creating a container when the '--userns-remap' option is used and SELinux is enforcing. Set SELinux to permisive mode so this test can run. See: opencontainers/runc#1562 (nsenter: improve namespace creation and SELinux IPC handling). Fixes runtime errors like these: OCI runtime create failed: running exec setns process for init caused exit Signed-off-by: Geoff Levand <[email protected]>
A docker bug causes the docker daemon to fail in creating a container when the '--userns-remap' option is used and SELinux is enforcing. Set SELinux to permisive mode so this test can run. See: opencontainers/runc#1562 (nsenter: improve namespace creation and SELinux IPC handling). Fixes runtime errors like these: OCI runtime create failed: running exec setns process for init caused exit Signed-off-by: Geoff Levand <[email protected]>
A docker bug causes the docker daemon to fail in creating a container when the '--userns-remap' option is used and SELinux is enforcing. Set SELinux to permisive mode so this test can run. See: opencontainers/runc#1562 (nsenter: improve namespace creation and SELinux IPC handling). Fixes runtime errors like these: OCI runtime create failed: running exec setns process for init caused exit Signed-off-by: Geoff Levand <[email protected]>
A docker bug causes the docker daemon to fail in creating a container when the '--userns-remap' option is used and SELinux is enforcing. Set SELinux to permisive mode so this test can run. See: opencontainers/runc#1562 (nsenter: improve namespace creation and SELinux IPC handling). Fixes runtime errors like these: OCI runtime create failed: running exec setns process for init caused exit Signed-off-by: Geoff Levand <[email protected]>
nsenter: move namespace creation after userns creation
Technically, this change should not be necessary, as the kernel
documentation claims that if you call clone(flags|CLONE_NEWUSER), the
new user namespace will be the owner of all other namespaces created in
@ flags. Unfortunately this isn't always the case, due to various
additional semantics and kernel bugs.
One particular instance is SELinux, which acts very strangely towards
the IPC namespace and mqueue. If you unshare the IPC namespace before
you map a user in the user namespace, the IPC namespace's internal
kern-mount for mqueue will be labelled incorrectly and the container
won't be able to access it. The only way of solving this is to unshare
IPC after the user has been mapped and we have changed to that user.
I've also heard of this happening to the NET namespace while talking to
some LXC folks, though I haven't personally seen that issue.
This change matches our handling of user namespaces to be the same as
how LXC handles these problems.
Closes #959
Closes #975
Signed-off-by: Aleksa Sarai [email protected]