nsenter: improve namespace creation and SELinux IPC handling #1562

cyphar · 2017-08-18T08:18:46Z

nsenter: move namespace creation after userns creation

Technically, this change should not be necessary, as the kernel
documentation claims that if you call clone(flags|CLONE_NEWUSER), the
new user namespace will be the owner of all other namespaces created in
@ flags. Unfortunately this isn't always the case, due to various
additional semantics and kernel bugs.

One particular instance is SELinux, which acts very strangely towards
the IPC namespace and mqueue. If you unshare the IPC namespace before
you map a user in the user namespace, the IPC namespace's internal
kern-mount for mqueue will be labelled incorrectly and the container
won't be able to access it. The only way of solving this is to unshare
IPC after the user has been mapped and we have changed to that user.
I've also heard of this happening to the NET namespace while talking to
some LXC folks, though I haven't personally seen that issue.

This change matches our handling of user namespaces to be the same as
how LXC handles these problems.

Closes #959
Closes #975
Signed-off-by: Aleksa Sarai [email protected]

crosbymichael · 2017-09-07T14:27:55Z

LGTM

rhvgoyal · 2017-09-07T19:32:38Z

libcontainer/nsenter/nsexec.c

+			 */
+			if (leftover_cloneflags) {
+				if (unshare(leftover_cloneflags) < 0)
+					bail("failed to unshare leftover namespaces");


How unshare(CLONE_NEWUSER | CLONE_NEWIPC) is different from unshare(CLONE_NEWUSER)followed byunshare(CLONE_NEWIPC)`. IIUC, in both the cases new user namespace will be owner of IPC namespace.

While I agree that that is true on paper, it isn't. There's a kernel bug (as discussed in #959) which results in SELinux labels (on /dev/mqueue) not being settable in a user namespace unless the IPC namespace was created after the user namespace was set up fully (including the uid_map and setuid stuff).

That issue really does not give exact technical details. I am looking at kernel code, and owner of new ipc namespace is going to be newly created user namespace. If that's the case, then we should be able to just do.

unshare(CLONE_NEWUSER | CLONE_NEWIPC)
set_uid_gid_map
switch to root inside container
mount_dev_mqueue_and_label.

I am wondering why above workflow is not working and if it is a kernel bug which needs fixing.

I tried reproducing problem on fedora 26 and I can reproduce it. I see following error message in journal.

SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue)

Looking at strace() output. Looks like first we tried selinux context mount of mqueue and that failed. I think due to above kernel message. And the we tried a non-context mount and tried lsexattr() and that lsetxattr failed.

29777 mount("mqueue", "/root/runc-testing/rootfs/dev/mqueue", "mqueue", MS_NOSUID|MS_NODEV|MS_NOEXEC, "context="system_u:object_r:svirt"... <unfinished ...>
29777 <... mount resumed> ) = -1 EINVAL (Invalid argument)

29777 mount("mqueue", "/root/runc-testing/rootfs/dev/mqueue", "mqueue", MS_NOSUID|MS_NODEV|MS_NOEXEC, NULL <unfinished ...>
29777 <... mount resumed> ) = 0

29777 lsetxattr("/root/runc-testing/rootfs/dev/mqueue", "security.selinux", "system_u:object_r:svirt_sandbox_"..., 47, 0) = -1 EPERM (Operation not permitted)

More debugging. I think following check in selinux fails.

selinux_inode_setxattr() {
if (!inode_owner_or_capable(inode)) {
}
}

This inode should belong to /rootfs/dev/mqueue (one belonging mqueuefs). IIUC, calling thread is the one which is already inside container and has effective fsuid=1000. But interestingly inod->i_uid is 0. And 0 is not mapped inside container, so kuid_has_mapping() check fails too.

So question is, how did inode->i_uid is 0. IIUC, it is the fsuid=1000 which created this mqueue directory and mqueue mount point. I would think that inode->i_uid should been 1000 instead? What am I missing?

@rhvgoyal I'll admit I never went through debugging this particular issue, but I agree with your conclusion. It looks like the reason why it's done that way is because of the mq_open (and related mq_* syscalls) that don't go through the VFS but need access to the mqueue mountpoint -- hence the mount is created each time a new IPC namespace is created.

More importantly, it looks like we actually cannot ever set mount options if we are not in a user namespace -- mqueue is not whitelisted. So xattr is the only way to set them, and you need to pass this uid-based check. And in order for the uid-based check to pass you need to have your user namespace set up properly (namely kuid_has_mapping needs to work).

In short we have to set up CLONE_NEWUSER before everything else. Actually I think I should just do CLONE_NEWUSER first and then do the rest of the flags afterwards rather than whitelisting CLONE_NEWIPC.

@rhvgoyal Ah, you saw the same thing as me (I was debugging in parallel 😸).

This inode should belong to /rootfs/dev/mqueue (one belonging mqueuefs). IIUC, calling thread is the one which is already inside container and has effective fsuid=1000. But interestingly inode->i_uid is 0.

This will only be true in rootless containers. In a container that uses user namespaces but is running as root, the unshare(...) will happen with an fsuid of 0 (which will carry on to mqueue). So when later we have everything mapped, we have a different fsuid (1000 in your example).

And 0 is not mapped inside container, so kuid_has_mapping() check fails too.

Yeah. I think making kuid_has_mapping pass is the only way to be sure that things work out the way we want. But we could try something like setfsuid(rootuid) which should not affect anything else.

Hmm..., I think it ties back to clone(CLONE_NEWUSER | CLONE_NEWIPC) call. This is called by process with fsuid=0. And kernel creates new ipc mount namespace, and also creates an internal mount point of mqueuefs. That in turn instantiates root inode and assigns i_uid from the calling thread.
mqueue_get_inode() {
inode->i_uid = current_fsuid();
inode->i_gid = current_fsgid();
}

And that's how root inode of mqueuefs gets i_uid=0. When container process later tries to change the label, it fails as host uid 0 is not mapped inside container.

So now I atleast understand the problem.

And Mrunal's patch is helping because unshare(CLONE_NEWIPC) is called by container process with fsuid=1000. That means mqeueufs root inode will get i_uid=1000 and container process will have the privileges to change selinux label.

@rhvgoyal We can set our fsuid though (see my above comment). I'm working on another patch which try to do it that way.

mrunalp · 2017-09-07T21:44:53Z

LGTM

cyphar · 2017-09-08T09:00:43Z

[do not merge yet]

I think the uid_map handling probably needs to be corrected.

mrunalp · 2017-09-20T20:23:29Z

@cyphar Any update?

cyphar · 2017-09-21T05:03:04Z

@mrunalp I've just pushed the new patch now. I have code to handle the full uid_map schema, as well as some basic code that uses setfsuid (which on paper should solve the problem) but I don't have a machine to test this on.

I'd be interested to know whether this fixes the problem. However I had some discussions with @brauner at OSS, and it looks like trying to work around these sorts of bugs is a waste of time and we should always do a CLONE_NEWUSER and mapping before doing all other unshares.

mrunalp · 2017-10-04T20:55:54Z

I'll check this out. Thanks!

crosbymichael · 2017-10-10T19:11:10Z

Can you look into the ci failure on this?

--- FAIL: TestExecInUserns (0.32s)
	utils_test.go:51: execin_test.go:571: unexpected error: container_linux.go:295: starting container process caused "process_linux.go:302: running exec setns process for init caused \"signal: segmentation fault (core dumped)\""

mrunalp · 2017-12-11T17:13:05Z

@cyphar ping

cyphar · 2017-12-20T08:10:25Z

I think I'm going to rewrite this one quite significantly, to just delay all namespace unsharing until after user namespaces are set up. This is what LXC does and I think they're right about not trusting that the kernel does the right thing in all cases here.

cyphar · 2018-01-08T01:20:47Z

I've rebased this to just delay unsharing of all namespaces. PTAL.

cyphar · 2018-01-08T02:06:19Z

Test failure is because of spec validator.

dqminh · 2018-01-25T12:51:05Z

LGTM.

I think spec validator should be fixed now. Do you want to rebase ?

cyphar · 2018-01-25T12:56:29Z

Yes, I will rebase.

/ping @mrunalp to verify that this fixes the issue originally reported.

@flags

Technically, this change should not be necessary, as the kernel documentation claims that if you call clone(flags|CLONE_NEWUSER), the new user namespace will be the owner of all other namespaces created in @flags. Unfortunately this isn't always the case, due to various additional semantics and kernel bugs. One particular instance is SELinux, which acts very strangely towards the IPC namespace and mqueue. If you unshare the IPC namespace *before* you map a user in the user namespace, the IPC namespace's internal kern-mount for mqueue will be labelled incorrectly and the container won't be able to access it. The only way of solving this is to unshare IPC *after* the user has been mapped and we have changed to that user. I've also heard of this happening to the NET namespace while talking to some LXC folks, though I haven't personally seen that issue. This change matches our handling of user namespaces to be the same as how LXC handles these problems. Signed-off-by: Aleksa Sarai <[email protected]>

cyphar · 2018-02-04T01:59:56Z

This has been rebased. Ping @mrunalp and the rest of @opencontainers/runc-maintainers.

mrunalp · 2018-02-07T18:13:34Z

I'll test this out and get back. Thanks!

iavael · 2018-03-07T20:18:59Z

@mrunalp any updates here?

A docker bug causes the docker daemon to fail in creating a container when the '--userns-remap' option is used and SELinux is enforcing. Set SELinux to permisive mode so this test can run. See: opencontainers/runc#1562 (nsenter: improve namespace creation and SELinux IPC handling). Fixes runtime errors like these: OCI runtime create failed: running exec setns process for init caused exit Signed-off-by: Geoff Levand <[email protected]>

rhatdan · 2018-04-13T11:18:13Z

@mrunalp This patch might fix some of the issues we are seeing with adding UserNS Support to CRI-O and Podman. Can you review?

giuseppe · 2018-04-16T10:27:56Z

@rhatdan I confirm this solves the issue we have seen.

LGTM

cyphar · 2018-04-16T11:45:12Z

/cc @opencontainers/runc-maintainers

A docker bug causes the docker daemon to fail in creating a container when the '--userns-remap' option is used and SELinux is enforcing. Set SELinux to permisive mode so this test can run. See: opencontainers/runc#1562 (nsenter: improve namespace creation and SELinux IPC handling). Fixes runtime errors like these: OCI runtime create failed: running exec setns process for init caused exit Signed-off-by: Geoff Levand <[email protected]>

giuseppe · 2018-04-26T20:38:21Z

could this be finally merged?

crosbymichael · 2018-04-26T20:40:10Z

LGTM

rhatdan · 2018-04-26T20:41:07Z

LGTM
This fixes an issue that is causing us to have to disable SELinux when using UserNS, not ideal in podman.

mrunalp · 2018-04-26T21:00:42Z

Reviewing right now.

mrunalp · 2018-04-26T21:11:41Z

LGTM

cyphar · 2018-04-27T11:29:07Z

🎉

A docker bug causes the docker daemon to fail in creating a container when the '--userns-remap' option is used and SELinux is enforcing. Set SELinux to permisive mode so this test can run. See: opencontainers/runc#1562 (nsenter: improve namespace creation and SELinux IPC handling). Fixes runtime errors like these: OCI runtime create failed: running exec setns process for init caused exit Signed-off-by: Geoff Levand <[email protected]>

This was referenced Aug 18, 2017

Fix setting SELinux label for mqueue when user namespaces are enabled #959

Closed

nsenter: set {uid,gid} explicitly around namespace creation #975

Closed

rhvgoyal reviewed Sep 7, 2017

View reviewed changes

mrunalp added this to the 1.0.0 milestone Oct 4, 2017

JOduMonT mentioned this pull request Nov 25, 2017

lsetxattr /dev/mqueue operation not permitted when using docker userns with selinux-enabled moby/moby#20798

Open

dqminh self-requested a review January 8, 2018 11:16

dqminh approved these changes Jan 25, 2018

View reviewed changes

mikebrow mentioned this pull request Jan 30, 2018

v1.0 discussion #1709

Closed

cyphar mentioned this pull request Mar 23, 2018

--userns-remap=default and --ipc=host: operation not permitted on /dev/mqueue moby/moby#36674

Closed

mrunalp merged commit 0cbfd83 into opencontainers:master Apr 26, 2018

cyphar deleted the carry-975-959-ipc-uid-namespaces branch April 27, 2018 11:29

nsenter: improve namespace creation and SELinux IPC handling #1562

nsenter: improve namespace creation and SELinux IPC handling #1562

Conversation

cyphar commented Aug 18, 2017 • edited Loading

crosbymichael commented Sep 7, 2017 • edited by caniszczyk Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cyphar Sep 8, 2017 • edited Loading

Choose a reason for hiding this comment

cyphar Sep 8, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrunalp commented Sep 7, 2017 • edited by caniszczyk Loading

cyphar commented Sep 8, 2017

mrunalp commented Sep 20, 2017

cyphar commented Sep 21, 2017

mrunalp commented Oct 4, 2017

crosbymichael commented Oct 10, 2017

mrunalp commented Dec 11, 2017

cyphar commented Dec 20, 2017

cyphar commented Jan 8, 2018

cyphar commented Jan 8, 2018

dqminh commented Jan 25, 2018 • edited by caniszczyk Loading

cyphar commented Jan 25, 2018

cyphar commented Feb 4, 2018

mrunalp commented Feb 7, 2018

iavael commented Mar 7, 2018

rhatdan commented Apr 13, 2018

giuseppe commented Apr 16, 2018

cyphar commented Apr 16, 2018

giuseppe commented Apr 26, 2018

crosbymichael commented Apr 26, 2018 • edited by caniszczyk Loading

rhatdan commented Apr 26, 2018

mrunalp commented Apr 26, 2018

mrunalp commented Apr 26, 2018 • edited by caniszczyk Loading

cyphar commented Apr 27, 2018

cyphar commented Aug 18, 2017 •

edited

Loading

crosbymichael commented Sep 7, 2017 •

edited by caniszczyk

Loading

cyphar Sep 8, 2017 •

edited

Loading

cyphar Sep 8, 2017 •

edited

Loading

mrunalp commented Sep 7, 2017 •

edited by caniszczyk

Loading

dqminh commented Jan 25, 2018 •

edited by caniszczyk

Loading

crosbymichael commented Apr 26, 2018 •

edited by caniszczyk

Loading

mrunalp commented Apr 26, 2018 •

edited by caniszczyk

Loading