Fix setting SELinux label for mqueue when user namespaces are enabled #959

mrunalp · 2016-07-20T18:43:54Z

If one tries to use SELinux with user namespaces, then labeling of /dev/mqueue
fails because the IPC namespace belongs to the root in init_user_ns. This
commit fixes that by unsharing IPC namespace after we clone into a new USER
namespace so the IPC namespace is owned by the new USER namespace
as opposed to init_user_ns.

Without this fix

[root@localhost test]# oci-runtime-tool generate --tty --output=config.json --selinux-label system_u:system_r:svirt_lxc_net_t:s0:c1,c2 --mount-label system_u:object_r:svirt_sandbox_file_t:s0:c1,c2 --uidmappings 1000:0:32000 --gidmappings 1000:0:32000

[root@localhost test]# runc run 1234
rootfs_linux.go:53: mounting "/dev/mqueue" to rootfs "/test/rootfs" caused "operation not permitted"

strace output:

3802  select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
3801  <... mount resumed> )             = -1 EINVAL (Invalid argument)
3801  mount("mqueue", "/test/rootfs/dev/mqueue", "mqueue", MS_NOSUID|MS_NODEV|MS_NOEXEC, NULL) = 0
3801  lsetxattr("/test/rootfs/dev/mqueue", "security.selinux", "system_u:object_r:svirt_sandbox_file_t:s0:c1,c2", 47, 0) = -1 EPERM (Operation not permitted)
3802  <... select resumed> )            = 0 (Timeout)
3802  select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>

cc: @rhatdan

Signed-off-by: Mrunal Patel [email protected]

cyphar · 2016-07-20T23:17:04Z

libcontainer/nsenter/nsexec.c

 {
 	struct clone_arg ca;
 	int		 child;

+	// Don't clone into NEWIPC at the same time as cloning into NEWUSER.
+	// This way we can ensure that NEWIPC namespace belongs to the root in new user namespace.
+	if (delay_ipc_unshare) {


This behaviour is guaranteed by most recent Linux kernels (when you set CLONE_NEW<namespace> as well as CLONE_NEWUSER, the user namespace is created first). However, if this is a problem on older RedHat kernels then the proper fix should use unshare for the user namespace and then use clone for the rest of the namespaces. This code already exists in my rootless container PR, but I'd be happy to port the code to #950 (where a bunch of other nsenter cleanups are happening).

@cyphar I thought that this should have worked on newer kernels, but we do need this patch. This is reproducible on 4.4.9 kernel on Fedora. I don't mind if you port this over to #950 and we can get it in altogether.

Yup. It turns out that #960 also can be fixed with some code from my rootless containers patchset too. I'm also cleaning up the netlink code to be easier to read.

But as I said, the "proper" fix is to do unshare(CLONE_NEWUSER), do all of the mapping and setgroup setup and then finally do the clone.

@cyphar Yes, I agree. I wanted this patch to be least disruptive given your modifications going on in #950. If we are overhauling it all might as well clean it up better.

Alright, I've ported my rootless container fixes to #950. PTAL: 8a454e5.

One more thing. It isn't just the order here. We also need to be root in the user namespace before unshare of IPC.

@cyphar Tried the latest changes on #950 for this PR. It fails:

22696 unshare(CLONE_NEWUSER) = 0 22696 open("/proc/self/uid_map", O_RDWR) = 7 22696 write(7, "0 1000 32000\n\0", 14) = -1 EPERM (Operation not permitted) 22690 <... select resumed> ) = 0 (Timeout) 22690 futex(0xc820029790, FUTEX_WAKE, 1 <unfinished ...> 22692 <... futex resumed> ) = 0 22690 <... futex resumed> ) = 1 22692 futex(0xc820029790, FUTEX_WAIT, 0, NULL <unfinished ...> 22690 select(0, NULL, NULL, NULL, {0, 20} <unfinished ...> 22696 write(2, "nsenter: failed to update /proc/"..., 70) = 70 22696 exit_group(4) = ? 22696 +++ exited with 4 +++

Yeah, I'm debugging it now. Weirdly, you only get EPERM if you're a privileged user (this same code works with rootless containers). Can we move the discussion to #950?

Depending on your SELinux setup, the order in which you join namespaces can be important. In general, user namespaces should *always* be joined and unshared first because then the other namespaces are correctly pinned and you have the right priviliges within them. This also is very useful for rootless containers. Signed-off-by: Aleksa Sarai <[email protected]>

cyphar · 2016-10-01T13:19:07Z

A variant of this patch now exists within #975. @mrunalp do you mind if we close this since I'm fairly sure you said that #975 also fixes the issue?

mrunalp · 2016-10-03T22:22:42Z

@cyphar Sure, closing this one.

cyphar · 2016-10-13T00:11:28Z

Reopening since it looks like #975 doesn't actually fix this issue.

matthewdfuller · 2017-01-30T19:38:32Z

Has this issue been solved more recently? It doesn't seem to have been updated since Oct 12, but I am still running into the following issue when using namespaces:

$ docker run alpine /bin/sh
Unable to find image 'alpine:latest' locally
latest: Pulling from library/alpine
0a8490d0dfd3: Pull complete
Digest: sha256:dfbd4a3a8ebca874ebd2474f044a0b33600d4523d03b0df76e5c5986cb02d7e8
Status: Downloaded newer image for alpine:latest
docker: Error response from daemon: oci runtime error: rootfs_linux.go:53: mounting "/dev/mqueue" to rootfs "/var/lib/docker/100000.100000/overlay/fd7c3b42ae55079d95125bb24eb4c6e5ca586c73963d4d5d7ffe7d1f87ccf40c/merged" caused "operation not permitted".

This is on Docker version 1.12.3, build 34a2ead, CoreOS 1235.6.0.

mrunalp · 2017-02-02T21:43:46Z

I have rebased this patch to latest.
@opencontainers/runc-maintainers PTAL. We and others are having to carry this patch as SELinux doesn't work without it.

crosbymichael · 2017-02-02T21:45:27Z

LGTM

cyphar · 2017-02-28T13:52:43Z

I reckon this code would look nicer if we merge #975 first.

We ensure that mqueue is owned by user namespace root by unsharing CLONE_NEWIPC after we become user namespace root. This allows us to apply the container SELinux label to mqueue. Signed-off-by: Mrunal Patel <[email protected]>

drnybble · 2017-05-08T02:47:55Z

Any updates on this? This blocks usage of SELinux and user namespace remapping.

rhatdan · 2017-05-08T13:48:17Z

We don't intend to support this until RHEL7.4, I am not sure if the kernel will be fixed by then.

dqminh · 2017-08-17T23:16:54Z

LGTM

dqminh · 2017-08-17T23:19:55Z

ping @crosbymichael @cyphar

i think this looks alright to merge now, i also looked at #975 but will need some time to digest it again.

hqhq · 2017-08-18T01:49:39Z

libcontainer/nsenter/nsexec.c

+			if ((config.cloneflags & CLONE_NEWUSER) && (config.cloneflags & CLONE_NEWIPC)) {
+				if (unshare(CLONE_NEWIPC) < 0)
+					bail("unshare ipc failed");
+			}


Why do we have to unshare ipc this late, Can't this be done in "runc:[1:CHILD]" process after unsharing other namespaces? We only need to fork to actually join pid namespace but not user namespace right?

I think that the reasoning is that you need to have this run after setuid(0) and setgid(0). #975 was meant to make it possible to have this section earlier, by doing setresuid and setresgid immediately after the necessary unshares.

@mrunalp @dqminh Do you mind if I carry this and #975 and make a new PR that combines both?

cyphar · 2017-08-18T08:01:25Z

By the way, I still contend that this is a kernel bug:

This commit fixes that by unsharing IPC namespace after we clone into a new USER namespace so the IPC namespace is owned by the new USER namespace as opposed to init_user_ns.

Because I'm fairly sure that violates the current kABI of how clone(multiple_flags) is meant to operate when it comes to CLONE_NEWUSER (user unsharing is done first).

cyphar · 2017-08-18T08:53:57Z

#1562 is my attempt to carry this and #975 together.

dqminh · 2017-08-18T10:46:01Z

@cyphar i prefer to merge this first actually. #1562 and #975 has drawbacks ( only support 1 single map line, but we do support multiple mappings in the spec ). I dont think these two PRs overlap at all, the one who merged latter will have to do some refactoring but i think its not too terrible.

By the way, I still contend that this is a kernel bug:

Indeed, i think so too. But i guess we have to patch where we can 😢

cyphar · 2017-08-18T15:03:31Z

@dqminh My main concern is related to @hqhq's concern about how late the unshare is done. By the time you've hit that code the process has already joined the container -- which means that now the host's IPC namespace is temporarily visible from inside the container. Though this is probably more of a theoretical attack.

I can try to make #1562 simpler if you like, by not doing the first set of setresuid/setresgids. Or I can just sit down and implement full handling of the multi-line map format.

dqminh · 2017-08-18T16:28:23Z

My main concern is related to @hqhq's concern about how late the unshare is done. By the time you've hit that code the process has already joined the container -- which means that now the host's IPC namespace is temporarily visible from inside the container. Though this is probably more of a theoretical attack.

Yah, at the point of unshare, we don't execute user's code yet so i dont see how the attack can work. Also i'm not saying that we dont need #1562 now, just that we can merge this first rather than waiting for all to land.

cyphar · 2017-08-19T08:39:52Z

Yah, at the point of unshare, we don't execute user's code yet so i dont see how the attack can work.

I'm talking about an attack from another process in the container, similar to the /proc/$pid/fd/7/.. exploit we had earlier. While our protections against CVE-2016-9962 are okay, we are still vulnerable to cases where the container process has been given CAP_SYS_PTRACE and is not using a user namespace. In that instance, this change would expose the host IPC namespace through /proc/$pid/ns/ipc inside the container's mount namespace.

crosbymichael · 2017-10-10T19:11:59Z

Closing since #1562 takes care of this. Just cleaning up the milestone some.

GordonTheTurtle added the status/0-triage label Jul 20, 2016

mrunalp mentioned this pull request Jul 20, 2016

lsetxattr /dev/mqueue operation not permitted when using docker userns with selinux-enabled moby/moby#20798

Open

cyphar reviewed Jul 20, 2016
View reviewed changes

cyphar mentioned this pull request Jul 20, 2016

Failed to join the user and pid namespaces of an existing runc container #960

Closed

mrunalp mentioned this pull request Aug 1, 2016

nsenter: major cleanups #950

Merged

2 tasks

cyphar mentioned this pull request Aug 8, 2016

nsenter: set {uid,gid} explicitly around namespace creation #975

Closed

mrunalp closed this Oct 3, 2016

mrunalp mentioned this pull request Oct 12, 2016

nsenter: guarantee correct user namespace ordering #977

Merged

2 tasks

cyphar reopened this Oct 13, 2016

euank mentioned this pull request Jan 31, 2017

app-emulation/runc: workaround userns issue coreos/coreos-overlay#2398

Merged

runcom mentioned this pull request Feb 2, 2017

Delay unshare of CLONE_NEWIPC for SELinux projectatomic/runc#5

Merged

mrunalp force-pushed the mqueue_userns_fix branch from eb839c1 to 913c9b1 Compare February 2, 2017 21:42

crosbymichael assigned cyphar Feb 2, 2017

Delay unshare of CLONE_NEWIPC for SELinux

5907671

We ensure that mqueue is owned by user namespace root by unsharing CLONE_NEWIPC after we become user namespace root. This allows us to apply the container SELinux label to mqueue. Signed-off-by: Mrunal Patel <[email protected]>

mrunalp force-pushed the mqueue_userns_fix branch from 913c9b1 to 5907671 Compare April 17, 2017 21:23

dqminh added this to the 1.0.0 milestone Aug 17, 2017

hqhq reviewed Aug 18, 2017

View reviewed changes

cyphar mentioned this pull request Aug 18, 2017

nsenter: improve namespace creation and SELinux IPC handling #1562

Merged

crosbymichael closed this Oct 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix setting SELinux label for mqueue when user namespaces are enabled #959

Fix setting SELinux label for mqueue when user namespaces are enabled #959

mrunalp commented Jul 20, 2016 •

edited

Loading

cyphar Jul 20, 2016

mrunalp Jul 20, 2016

cyphar Jul 20, 2016

cyphar Jul 20, 2016

mrunalp Jul 20, 2016

cyphar Jul 21, 2016

mrunalp Jul 21, 2016

mrunalp Jul 21, 2016

cyphar Jul 21, 2016

cyphar commented Oct 1, 2016

mrunalp commented Oct 3, 2016

cyphar commented Oct 13, 2016

matthewdfuller commented Jan 30, 2017

mrunalp commented Feb 2, 2017

crosbymichael commented Feb 2, 2017 •

edited by caniszczyk

Loading

cyphar commented Feb 28, 2017

drnybble commented May 8, 2017

rhatdan commented May 8, 2017

dqminh commented Aug 17, 2017 •

edited by caniszczyk

Loading

dqminh commented Aug 17, 2017

hqhq Aug 18, 2017

cyphar Aug 18, 2017 •

edited

Loading

cyphar commented Aug 18, 2017

cyphar commented Aug 18, 2017

dqminh commented Aug 18, 2017

cyphar commented Aug 18, 2017 •

edited

Loading

dqminh commented Aug 18, 2017

cyphar commented Aug 19, 2017 •

edited

Loading

crosbymichael commented Oct 10, 2017

Fix setting SELinux label for mqueue when user namespaces are enabled #959

Fix setting SELinux label for mqueue when user namespaces are enabled #959

Conversation

mrunalp commented Jul 20, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cyphar commented Oct 1, 2016

mrunalp commented Oct 3, 2016

cyphar commented Oct 13, 2016

matthewdfuller commented Jan 30, 2017

mrunalp commented Feb 2, 2017

crosbymichael commented Feb 2, 2017 • edited by caniszczyk Loading

cyphar commented Feb 28, 2017

drnybble commented May 8, 2017

rhatdan commented May 8, 2017

dqminh commented Aug 17, 2017 • edited by caniszczyk Loading

dqminh commented Aug 17, 2017

Choose a reason for hiding this comment

cyphar Aug 18, 2017 • edited Loading

Choose a reason for hiding this comment

cyphar commented Aug 18, 2017

cyphar commented Aug 18, 2017

dqminh commented Aug 18, 2017

cyphar commented Aug 18, 2017 • edited Loading

dqminh commented Aug 18, 2017

cyphar commented Aug 19, 2017 • edited Loading

crosbymichael commented Oct 10, 2017

mrunalp commented Jul 20, 2016 •

edited

Loading

crosbymichael commented Feb 2, 2017 •

edited by caniszczyk

Loading

dqminh commented Aug 17, 2017 •

edited by caniszczyk

Loading

cyphar Aug 18, 2017 •

edited

Loading

cyphar commented Aug 18, 2017 •

edited

Loading

cyphar commented Aug 19, 2017 •

edited

Loading