-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix setting SELinux label for mqueue when user namespaces are enabled #959
Conversation
libcontainer/nsenter/nsexec.c
Outdated
{ | ||
struct clone_arg ca; | ||
int child; | ||
|
||
// Don't clone into NEWIPC at the same time as cloning into NEWUSER. | ||
// This way we can ensure that NEWIPC namespace belongs to the root in new user namespace. | ||
if (delay_ipc_unshare) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behaviour is guaranteed by most recent Linux kernels (when you set CLONE_NEW<namespace>
as well as CLONE_NEWUSER
, the user namespace is created first). However, if this is a problem on older RedHat kernels then the proper fix should use unshare
for the user namespace and then use clone for the rest of the namespaces. This code already exists in my rootless container PR, but I'd be happy to port the code to #950 (where a bunch of other nsenter
cleanups are happening).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. It turns out that #960 also can be fixed with some code from my rootless containers patchset too. I'm also cleaning up the netlink
code to be easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But as I said, the "proper" fix is to do unshare(CLONE_NEWUSER)
, do all of the mapping and setgroup setup and then finally do the clone
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing. It isn't just the order here. We also need to be root in the user namespace before unshare of IPC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cyphar Tried the latest changes on #950 for this PR. It fails:
22696 unshare(CLONE_NEWUSER) = 0
22696 open("/proc/self/uid_map", O_RDWR) = 7
22696 write(7, "0 1000 32000\n\0", 14) = -1 EPERM (Operation not permitted)
22690 <... select resumed> ) = 0 (Timeout)
22690 futex(0xc820029790, FUTEX_WAKE, 1 <unfinished ...>
22692 <... futex resumed> ) = 0
22690 <... futex resumed> ) = 1
22692 futex(0xc820029790, FUTEX_WAIT, 0, NULL <unfinished ...>
22690 select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
22696 write(2, "nsenter: failed to update /proc/"..., 70) = 70
22696 exit_group(4) = ?
22696 +++ exited with 4 +++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm debugging it now. Weirdly, you only get EPERM
if you're a privileged user (this same code works with rootless containers). Can we move the discussion to #950?
Depending on your SELinux setup, the order in which you join namespaces can be important. In general, user namespaces should *always* be joined and unshared first because then the other namespaces are correctly pinned and you have the right priviliges within them. This also is very useful for rootless containers. Signed-off-by: Aleksa Sarai <[email protected]>
@cyphar Sure, closing this one. |
Reopening since it looks like #975 doesn't actually fix this issue. |
Has this issue been solved more recently? It doesn't seem to have been updated since Oct 12, but I am still running into the following issue when using namespaces:
This is on Docker version 1.12.3, build 34a2ead, CoreOS 1235.6.0. |
eb839c1
to
913c9b1
Compare
I have rebased this patch to latest. |
I reckon this code would look nicer if we merge #975 first. |
We ensure that mqueue is owned by user namespace root by unsharing CLONE_NEWIPC after we become user namespace root. This allows us to apply the container SELinux label to mqueue. Signed-off-by: Mrunal Patel <[email protected]>
913c9b1
to
5907671
Compare
Any updates on this? This blocks usage of SELinux and user namespace remapping. |
We don't intend to support this until RHEL7.4, I am not sure if the kernel will be fixed by then. |
ping @crosbymichael @cyphar i think this looks alright to merge now, i also looked at #975 but will need some time to digest it again. |
if ((config.cloneflags & CLONE_NEWUSER) && (config.cloneflags & CLONE_NEWIPC)) { | ||
if (unshare(CLONE_NEWIPC) < 0) | ||
bail("unshare ipc failed"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have to unshare ipc this late, Can't this be done in "runc:[1:CHILD]" process after unsharing other namespaces? We only need to fork to actually join pid namespace but not user namespace right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the reasoning is that you need to have this run after setuid(0)
and setgid(0)
. #975 was meant to make it possible to have this section earlier, by doing setresuid
and setresgid
immediately after the necessary unshares.
@mrunalp @dqminh Do you mind if I carry this and #975 and make a new PR that combines both?
By the way, I still contend that this is a kernel bug:
Because I'm fairly sure that violates the current kABI of how |
@cyphar i prefer to merge this first actually. #1562 and #975 has drawbacks ( only support 1 single map line, but we do support multiple mappings in the spec ). I dont think these two PRs overlap at all, the one who merged latter will have to do some refactoring but i think its not too terrible.
Indeed, i think so too. But i guess we have to patch where we can 😢 |
@dqminh My main concern is related to @hqhq's concern about how late the I can try to make #1562 simpler if you like, by not doing the first set of |
Yah, at the point of unshare, we don't execute user's code yet so i dont see how the attack can work. Also i'm not saying that we dont need #1562 now, just that we can merge this first rather than waiting for all to land. |
I'm talking about an attack from another process in the container, similar to the |
Closing since #1562 takes care of this. Just cleaning up the milestone some. |
If one tries to use SELinux with user namespaces, then labeling of /dev/mqueue
fails because the IPC namespace belongs to the root in init_user_ns. This
commit fixes that by unsharing IPC namespace after we clone into a new USER
namespace so the IPC namespace is owned by the new USER namespace
as opposed to init_user_ns.
Without this fix
strace output:
cc: @rhatdan
Signed-off-by: Mrunal Patel [email protected]