Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix setting SELinux label for mqueue when user namespaces are enabled #959

Closed
wants to merge 1 commit into from

Conversation

mrunalp
Copy link
Contributor

@mrunalp mrunalp commented Jul 20, 2016

If one tries to use SELinux with user namespaces, then labeling of /dev/mqueue
fails because the IPC namespace belongs to the root in init_user_ns. This
commit fixes that by unsharing IPC namespace after we clone into a new USER
namespace so the IPC namespace is owned by the new USER namespace
as opposed to init_user_ns.

Without this fix

[root@localhost test]# oci-runtime-tool generate --tty --output=config.json --selinux-label system_u:system_r:svirt_lxc_net_t:s0:c1,c2 --mount-label system_u:object_r:svirt_sandbox_file_t:s0:c1,c2 --uidmappings 1000:0:32000 --gidmappings 1000:0:32000

[root@localhost test]# runc run 1234
rootfs_linux.go:53: mounting "/dev/mqueue" to rootfs "/test/rootfs" caused "operation not permitted"

strace output:

3802  select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
3801  <... mount resumed> )             = -1 EINVAL (Invalid argument)
3801  mount("mqueue", "/test/rootfs/dev/mqueue", "mqueue", MS_NOSUID|MS_NODEV|MS_NOEXEC, NULL) = 0
3801  lsetxattr("/test/rootfs/dev/mqueue", "security.selinux", "system_u:object_r:svirt_sandbox_file_t:s0:c1,c2", 47, 0) = -1 EPERM (Operation not permitted)
3802  <... select resumed> )            = 0 (Timeout)
3802  select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>

cc: @rhatdan

Signed-off-by: Mrunal Patel [email protected]

{
struct clone_arg ca;
int child;

// Don't clone into NEWIPC at the same time as cloning into NEWUSER.
// This way we can ensure that NEWIPC namespace belongs to the root in new user namespace.
if (delay_ipc_unshare) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behaviour is guaranteed by most recent Linux kernels (when you set CLONE_NEW<namespace> as well as CLONE_NEWUSER, the user namespace is created first). However, if this is a problem on older RedHat kernels then the proper fix should use unshare for the user namespace and then use clone for the rest of the namespaces. This code already exists in my rootless container PR, but I'd be happy to port the code to #950 (where a bunch of other nsenter cleanups are happening).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyphar I thought that this should have worked on newer kernels, but we do need this patch. This is reproducible on 4.4.9 kernel on Fedora. I don't mind if you port this over to #950 and we can get it in altogether.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. It turns out that #960 also can be fixed with some code from my rootless containers patchset too. I'm also cleaning up the netlink code to be easier to read.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But as I said, the "proper" fix is to do unshare(CLONE_NEWUSER), do all of the mapping and setgroup setup and then finally do the clone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyphar Yes, I agree. I wanted this patch to be least disruptive given your modifications going on in #950. If we are overhauling it all might as well clean it up better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I've ported my rootless container fixes to #950. PTAL: 8a454e5.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing. It isn't just the order here. We also need to be root in the user namespace before unshare of IPC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyphar Tried the latest changes on #950 for this PR. It fails:

22696 unshare(CLONE_NEWUSER)            = 0
22696 open("/proc/self/uid_map", O_RDWR) = 7
22696 write(7, "0 1000 32000\n\0", 14)  = -1 EPERM (Operation not permitted)
22690 <... select resumed> )            = 0 (Timeout)
22690 futex(0xc820029790, FUTEX_WAKE, 1 <unfinished ...>
22692 <... futex resumed> )             = 0
22690 <... futex resumed> )             = 1
22692 futex(0xc820029790, FUTEX_WAIT, 0, NULL <unfinished ...>
22690 select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
22696 write(2, "nsenter: failed to update /proc/"..., 70) = 70
22696 exit_group(4)                     = ?
22696 +++ exited with 4 +++

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm debugging it now. Weirdly, you only get EPERM if you're a privileged user (this same code works with rootless containers). Can we move the discussion to #950?

cyphar referenced this pull request Jul 21, 2016
Depending on your SELinux setup, the order in which you join namespaces
can be important. In general, user namespaces should *always* be joined
and unshared first because then the other namespaces are correctly
pinned and you have the right priviliges within them. This also is very
useful for rootless containers.

Signed-off-by: Aleksa Sarai <[email protected]>
@mrunalp mrunalp mentioned this pull request Aug 1, 2016
2 tasks
@cyphar
Copy link
Member

cyphar commented Oct 1, 2016

A variant of this patch now exists within #975. @mrunalp do you mind if we close this since I'm fairly sure you said that #975 also fixes the issue?

@mrunalp
Copy link
Contributor Author

mrunalp commented Oct 3, 2016

@cyphar Sure, closing this one.

@cyphar
Copy link
Member

cyphar commented Oct 13, 2016

Reopening since it looks like #975 doesn't actually fix this issue.

@cyphar cyphar reopened this Oct 13, 2016
@matthewdfuller
Copy link

Has this issue been solved more recently? It doesn't seem to have been updated since Oct 12, but I am still running into the following issue when using namespaces:

$ docker run alpine /bin/sh
Unable to find image 'alpine:latest' locally
latest: Pulling from library/alpine
0a8490d0dfd3: Pull complete
Digest: sha256:dfbd4a3a8ebca874ebd2474f044a0b33600d4523d03b0df76e5c5986cb02d7e8
Status: Downloaded newer image for alpine:latest
docker: Error response from daemon: oci runtime error: rootfs_linux.go:53: mounting "/dev/mqueue" to rootfs "/var/lib/docker/100000.100000/overlay/fd7c3b42ae55079d95125bb24eb4c6e5ca586c73963d4d5d7ffe7d1f87ccf40c/merged" caused "operation not permitted".

This is on Docker version 1.12.3, build 34a2ead, CoreOS 1235.6.0.

@mrunalp
Copy link
Contributor Author

mrunalp commented Feb 2, 2017

I have rebased this patch to latest.
@opencontainers/runc-maintainers PTAL. We and others are having to carry this patch as SELinux doesn't work without it.

@crosbymichael
Copy link
Member

crosbymichael commented Feb 2, 2017

LGTM

Approved with PullApprove

@cyphar
Copy link
Member

cyphar commented Feb 28, 2017

I reckon this code would look nicer if we merge #975 first.

We ensure that mqueue is owned by user namespace root
by unsharing CLONE_NEWIPC after we become user namespace
root. This allows us to apply the container SELinux label
to mqueue.

Signed-off-by: Mrunal Patel <[email protected]>
@mrunalp mrunalp force-pushed the mqueue_userns_fix branch from 913c9b1 to 5907671 Compare April 17, 2017 21:23
@drnybble
Copy link

drnybble commented May 8, 2017

Any updates on this? This blocks usage of SELinux and user namespace remapping.

@rhatdan
Copy link
Contributor

rhatdan commented May 8, 2017

We don't intend to support this until RHEL7.4, I am not sure if the kernel will be fixed by then.

@dqminh dqminh added this to the 1.0.0 milestone Aug 17, 2017
@dqminh
Copy link
Contributor

dqminh commented Aug 17, 2017

LGTM

Approved with PullApprove

@dqminh
Copy link
Contributor

dqminh commented Aug 17, 2017

ping @crosbymichael @cyphar

i think this looks alright to merge now, i also looked at #975 but will need some time to digest it again.

if ((config.cloneflags & CLONE_NEWUSER) && (config.cloneflags & CLONE_NEWIPC)) {
if (unshare(CLONE_NEWIPC) < 0)
bail("unshare ipc failed");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have to unshare ipc this late, Can't this be done in "runc:[1:CHILD]" process after unsharing other namespaces? We only need to fork to actually join pid namespace but not user namespace right?

Copy link
Member

@cyphar cyphar Aug 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the reasoning is that you need to have this run after setuid(0) and setgid(0). #975 was meant to make it possible to have this section earlier, by doing setresuid and setresgid immediately after the necessary unshares.

@mrunalp @dqminh Do you mind if I carry this and #975 and make a new PR that combines both?

@cyphar
Copy link
Member

cyphar commented Aug 18, 2017

By the way, I still contend that this is a kernel bug:

This commit fixes that by unsharing IPC namespace after we clone into a new USER namespace so the IPC namespace is owned by the new USER namespace as opposed to init_user_ns.

Because I'm fairly sure that violates the current kABI of how clone(multiple_flags) is meant to operate when it comes to CLONE_NEWUSER (user unsharing is done first).

@cyphar
Copy link
Member

cyphar commented Aug 18, 2017

#1562 is my attempt to carry this and #975 together.

@dqminh
Copy link
Contributor

dqminh commented Aug 18, 2017

@cyphar i prefer to merge this first actually. #1562 and #975 has drawbacks ( only support 1 single map line, but we do support multiple mappings in the spec ). I dont think these two PRs overlap at all, the one who merged latter will have to do some refactoring but i think its not too terrible.

By the way, I still contend that this is a kernel bug:

Indeed, i think so too. But i guess we have to patch where we can 😢

@cyphar
Copy link
Member

cyphar commented Aug 18, 2017

@dqminh My main concern is related to @hqhq's concern about how late the unshare is done. By the time you've hit that code the process has already joined the container -- which means that now the host's IPC namespace is temporarily visible from inside the container. Though this is probably more of a theoretical attack.

I can try to make #1562 simpler if you like, by not doing the first set of setresuid/setresgids. Or I can just sit down and implement full handling of the multi-line map format.

@dqminh
Copy link
Contributor

dqminh commented Aug 18, 2017

My main concern is related to @hqhq's concern about how late the unshare is done. By the time you've hit that code the process has already joined the container -- which means that now the host's IPC namespace is temporarily visible from inside the container. Though this is probably more of a theoretical attack.

Yah, at the point of unshare, we don't execute user's code yet so i dont see how the attack can work. Also i'm not saying that we dont need #1562 now, just that we can merge this first rather than waiting for all to land.

@cyphar
Copy link
Member

cyphar commented Aug 19, 2017

Yah, at the point of unshare, we don't execute user's code yet so i dont see how the attack can work.

I'm talking about an attack from another process in the container, similar to the /proc/$pid/fd/7/.. exploit we had earlier. While our protections against CVE-2016-9962 are okay, we are still vulnerable to cases where the container process has been given CAP_SYS_PTRACE and is not using a user namespace. In that instance, this change would expose the host IPC namespace through /proc/$pid/ns/ipc inside the container's mount namespace.

@crosbymichael
Copy link
Member

Closing since #1562 takes care of this. Just cleaning up the milestone some.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants