Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsenter: improve namespace creation and SELinux IPC handling #1562

Merged
merged 1 commit into from
Apr 26, 2018
Merged

nsenter: improve namespace creation and SELinux IPC handling #1562

merged 1 commit into from
Apr 26, 2018

Conversation

cyphar
Copy link
Member

@cyphar cyphar commented Aug 18, 2017

nsenter: move namespace creation after userns creation

Technically, this change should not be necessary, as the kernel
documentation claims that if you call clone(flags|CLONE_NEWUSER), the
new user namespace will be the owner of all other namespaces created in
@ flags. Unfortunately this isn't always the case, due to various
additional semantics and kernel bugs.

One particular instance is SELinux, which acts very strangely towards
the IPC namespace and mqueue. If you unshare the IPC namespace before
you map a user in the user namespace, the IPC namespace's internal
kern-mount for mqueue will be labelled incorrectly and the container
won't be able to access it. The only way of solving this is to unshare
IPC after the user has been mapped and we have changed to that user.
I've also heard of this happening to the NET namespace while talking to
some LXC folks, though I haven't personally seen that issue.

This change matches our handling of user namespaces to be the same as
how LXC handles these problems.

Closes #959
Closes #975
Signed-off-by: Aleksa Sarai [email protected]

@crosbymichael
Copy link
Member

crosbymichael commented Sep 7, 2017

LGTM

Approved with PullApprove

*/
if (leftover_cloneflags) {
if (unshare(leftover_cloneflags) < 0)
bail("failed to unshare leftover namespaces");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How unshare(CLONE_NEWUSER | CLONE_NEWIPC) is different from unshare(CLONE_NEWUSER)followed byunshare(CLONE_NEWIPC)`. IIUC, in both the cases new user namespace will be owner of IPC namespace.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree that that is true on paper, it isn't. There's a kernel bug (as discussed in #959) which results in SELinux labels (on /dev/mqueue) not being settable in a user namespace unless the IPC namespace was created after the user namespace was set up fully (including the uid_map and setuid stuff).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That issue really does not give exact technical details. I am looking at kernel code, and owner of new ipc namespace is going to be newly created user namespace. If that's the case, then we should be able to just do.

unshare(CLONE_NEWUSER | CLONE_NEWIPC)
set_uid_gid_map
switch to root inside container
mount_dev_mqueue_and_label.

I am wondering why above workflow is not working and if it is a kernel bug which needs fixing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried reproducing problem on fedora 26 and I can reproduce it. I see following error message in journal.

SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at strace() output. Looks like first we tried selinux context mount of mqueue and that failed. I think due to above kernel message. And the we tried a non-context mount and tried lsexattr() and that lsetxattr failed.

29777 mount("mqueue", "/root/runc-testing/rootfs/dev/mqueue", "mqueue", MS_NOSUID|MS_NODEV|MS_NOEXEC, "context="system_u:object_r:svirt"... <unfinished ...>
29777 <... mount resumed> ) = -1 EINVAL (Invalid argument)

29777 mount("mqueue", "/root/runc-testing/rootfs/dev/mqueue", "mqueue", MS_NOSUID|MS_NODEV|MS_NOEXEC, NULL <unfinished ...>
29777 <... mount resumed> ) = 0

29777 lsetxattr("/root/runc-testing/rootfs/dev/mqueue", "security.selinux", "system_u:object_r:svirt_sandbox_"..., 47, 0) = -1 EPERM (Operation not permitted)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More debugging. I think following check in selinux fails.

selinux_inode_setxattr() {
if (!inode_owner_or_capable(inode)) {
}
}

This inode should belong to /rootfs/dev/mqueue (one belonging mqueuefs). IIUC, calling thread is the one which is already inside container and has effective fsuid=1000. But interestingly inod->i_uid is 0. And 0 is not mapped inside container, so kuid_has_mapping() check fails too.

So question is, how did inode->i_uid is 0. IIUC, it is the fsuid=1000 which created this mqueue directory and mqueue mount point. I would think that inode->i_uid should been 1000 instead? What am I missing?

Copy link
Member Author

@cyphar cyphar Sep 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhvgoyal I'll admit I never went through debugging this particular issue, but I agree with your conclusion. It looks like the reason why it's done that way is because of the mq_open (and related mq_* syscalls) that don't go through the VFS but need access to the mqueue mountpoint -- hence the mount is created each time a new IPC namespace is created.

More importantly, it looks like we actually cannot ever set mount options if we are not in a user namespace -- mqueue is not whitelisted. So xattr is the only way to set them, and you need to pass this uid-based check. And in order for the uid-based check to pass you need to have your user namespace set up properly (namely kuid_has_mapping needs to work).

In short we have to set up CLONE_NEWUSER before everything else. Actually I think I should just do CLONE_NEWUSER first and then do the rest of the flags afterwards rather than whitelisting CLONE_NEWIPC.

Copy link
Member Author

@cyphar cyphar Sep 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhvgoyal Ah, you saw the same thing as me (I was debugging in parallel 😸).

This inode should belong to /rootfs/dev/mqueue (one belonging mqueuefs). IIUC, calling thread is the one which is already inside container and has effective fsuid=1000. But interestingly inode->i_uid is 0.

This will only be true in rootless containers. In a container that uses user namespaces but is running as root, the unshare(...) will happen with an fsuid of 0 (which will carry on to mqueue). So when later we have everything mapped, we have a different fsuid (1000 in your example).

And 0 is not mapped inside container, so kuid_has_mapping() check fails too.

Yeah. I think making kuid_has_mapping pass is the only way to be sure that things work out the way we want. But we could try something like setfsuid(rootuid) which should not affect anything else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm..., I think it ties back to clone(CLONE_NEWUSER | CLONE_NEWIPC) call. This is called by process with fsuid=0. And kernel creates new ipc mount namespace, and also creates an internal mount point of mqueuefs. That in turn instantiates root inode and assigns i_uid from the calling thread.
mqueue_get_inode() {
inode->i_uid = current_fsuid();
inode->i_gid = current_fsgid();
}

And that's how root inode of mqueuefs gets i_uid=0. When container process later tries to change the label, it fails as host uid 0 is not mapped inside container.

So now I atleast understand the problem.

And Mrunal's patch is helping because unshare(CLONE_NEWIPC) is called by container process with fsuid=1000. That means mqeueufs root inode will get i_uid=1000 and container process will have the privileges to change selinux label.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhvgoyal We can set our fsuid though (see my above comment). I'm working on another patch which try to do it that way.

@mrunalp
Copy link
Contributor

mrunalp commented Sep 7, 2017

LGTM

Approved with PullApprove

@cyphar
Copy link
Member Author

cyphar commented Sep 8, 2017

[do not merge yet]

I think the uid_map handling probably needs to be corrected.

@mrunalp
Copy link
Contributor

mrunalp commented Sep 20, 2017

@cyphar Any update?

@cyphar
Copy link
Member Author

cyphar commented Sep 21, 2017

@mrunalp I've just pushed the new patch now. I have code to handle the full uid_map schema, as well as some basic code that uses setfsuid (which on paper should solve the problem) but I don't have a machine to test this on.

I'd be interested to know whether this fixes the problem. However I had some discussions with @brauner at OSS, and it looks like trying to work around these sorts of bugs is a waste of time and we should always do a CLONE_NEWUSER and mapping before doing all other unshares.

@mrunalp
Copy link
Contributor

mrunalp commented Oct 4, 2017

I'll check this out. Thanks!

@mrunalp mrunalp added this to the 1.0.0 milestone Oct 4, 2017
@crosbymichael
Copy link
Member

Can you look into the ci failure on this?

--- FAIL: TestExecInUserns (0.32s)
	utils_test.go:51: execin_test.go:571: unexpected error: container_linux.go:295: starting container process caused "process_linux.go:302: running exec setns process for init caused \"signal: segmentation fault (core dumped)\""

@mrunalp
Copy link
Contributor

mrunalp commented Dec 11, 2017

@cyphar ping

@cyphar
Copy link
Member Author

cyphar commented Dec 20, 2017

I think I'm going to rewrite this one quite significantly, to just delay all namespace unsharing until after user namespaces are set up. This is what LXC does and I think they're right about not trusting that the kernel does the right thing in all cases here.

@cyphar
Copy link
Member Author

cyphar commented Jan 8, 2018

I've rebased this to just delay unsharing of all namespaces. PTAL.

@cyphar
Copy link
Member Author

cyphar commented Jan 8, 2018

Test failure is because of spec validator.

@dqminh dqminh self-requested a review January 8, 2018 11:16
@dqminh
Copy link
Contributor

dqminh commented Jan 25, 2018

LGTM.

I think spec validator should be fixed now. Do you want to rebase ?

Approved with PullApprove

@cyphar
Copy link
Member Author

cyphar commented Jan 25, 2018

Yes, I will rebase.

/ping @mrunalp to verify that this fixes the issue originally reported.

Technically, this change should not be necessary, as the kernel
documentation claims that if you call clone(flags|CLONE_NEWUSER), the
new user namespace will be the owner of all other namespaces created in
@flags. Unfortunately this isn't always the case, due to various
additional semantics and kernel bugs.

One particular instance is SELinux, which acts very strangely towards
the IPC namespace and mqueue. If you unshare the IPC namespace *before*
you map a user in the user namespace, the IPC namespace's internal
kern-mount for mqueue will be labelled incorrectly and the container
won't be able to access it. The only way of solving this is to unshare
IPC *after* the user has been mapped and we have changed to that user.
I've also heard of this happening to the NET namespace while talking to
some LXC folks, though I haven't personally seen that issue.

This change matches our handling of user namespaces to be the same as
how LXC handles these problems.

Signed-off-by: Aleksa Sarai <[email protected]>
@mikebrow mikebrow mentioned this pull request Jan 30, 2018
@cyphar
Copy link
Member Author

cyphar commented Feb 4, 2018

This has been rebased. Ping @mrunalp and the rest of @opencontainers/runc-maintainers.

@mrunalp
Copy link
Contributor

mrunalp commented Feb 7, 2018

I'll test this out and get back. Thanks!

@iavael
Copy link

iavael commented Mar 7, 2018

@mrunalp any updates here?

glevand added a commit to glevand/coreos--mantle that referenced this pull request Mar 27, 2018
A docker bug causes the docker daemon to fail in creating a container
when the '--userns-remap' option is used and SELinux is enforcing.
Set SELinux to permisive mode so this test can run.
See: opencontainers/runc#1562 (nsenter:
improve namespace creation and SELinux IPC handling).

Fixes runtime errors like these:

  OCI runtime create failed: running exec setns process for init caused exit

Signed-off-by: Geoff Levand <[email protected]>
glevand added a commit to glevand/coreos--mantle that referenced this pull request Mar 28, 2018
A docker bug causes the docker daemon to fail in creating a container
when the '--userns-remap' option is used and SELinux is enforcing.
Set SELinux to permisive mode so this test can run.
See: opencontainers/runc#1562 (nsenter:
improve namespace creation and SELinux IPC handling).

Fixes runtime errors like these:

  OCI runtime create failed: running exec setns process for init caused exit

Signed-off-by: Geoff Levand <[email protected]>
glevand added a commit to glevand/coreos--mantle that referenced this pull request Mar 29, 2018
A docker bug causes the docker daemon to fail in creating a container
when the '--userns-remap' option is used and SELinux is enforcing.
Set SELinux to permisive mode so this test can run.
See: opencontainers/runc#1562 (nsenter:
improve namespace creation and SELinux IPC handling).

Fixes runtime errors like these:

  OCI runtime create failed: running exec setns process for init caused exit

Signed-off-by: Geoff Levand <[email protected]>
glevand added a commit to glevand/coreos--mantle that referenced this pull request Mar 29, 2018
A docker bug causes the docker daemon to fail in creating a container
when the '--userns-remap' option is used and SELinux is enforcing.
Set SELinux to permisive mode so this test can run.
See: opencontainers/runc#1562 (nsenter:
improve namespace creation and SELinux IPC handling).

Fixes runtime errors like these:

  OCI runtime create failed: running exec setns process for init caused exit

Signed-off-by: Geoff Levand <[email protected]>
glevand added a commit to glevand/coreos--mantle that referenced this pull request Apr 3, 2018
A docker bug causes the docker daemon to fail in creating a container
when the '--userns-remap' option is used and SELinux is enforcing.
Set SELinux to permisive mode so this test can run.
See: opencontainers/runc#1562 (nsenter:
improve namespace creation and SELinux IPC handling).

Fixes runtime errors like these:

  OCI runtime create failed: running exec setns process for init caused exit

Signed-off-by: Geoff Levand <[email protected]>
glevand added a commit to glevand/coreos--mantle that referenced this pull request Apr 6, 2018
A docker bug causes the docker daemon to fail in creating a container
when the '--userns-remap' option is used and SELinux is enforcing.
Set SELinux to permisive mode so this test can run.
See: opencontainers/runc#1562 (nsenter:
improve namespace creation and SELinux IPC handling).

Fixes runtime errors like these:

  OCI runtime create failed: running exec setns process for init caused exit

Signed-off-by: Geoff Levand <[email protected]>
@rhatdan
Copy link
Contributor

rhatdan commented Apr 13, 2018

@mrunalp This patch might fix some of the issues we are seeing with adding UserNS Support to CRI-O and Podman. Can you review?

@giuseppe
Copy link
Member

@rhatdan I confirm this solves the issue we have seen.

LGTM

@cyphar
Copy link
Member Author

cyphar commented Apr 16, 2018

/cc @opencontainers/runc-maintainers

glevand added a commit to glevand/coreos--mantle that referenced this pull request Apr 16, 2018
A docker bug causes the docker daemon to fail in creating a container
when the '--userns-remap' option is used and SELinux is enforcing.
Set SELinux to permisive mode so this test can run.
See: opencontainers/runc#1562 (nsenter:
improve namespace creation and SELinux IPC handling).

Fixes runtime errors like these:

  OCI runtime create failed: running exec setns process for init caused exit

Signed-off-by: Geoff Levand <[email protected]>
glevand added a commit to glevand/coreos--mantle that referenced this pull request Apr 17, 2018
A docker bug causes the docker daemon to fail in creating a container
when the '--userns-remap' option is used and SELinux is enforcing.
Set SELinux to permisive mode so this test can run.
See: opencontainers/runc#1562 (nsenter:
improve namespace creation and SELinux IPC handling).

Fixes runtime errors like these:

  OCI runtime create failed: running exec setns process for init caused exit

Signed-off-by: Geoff Levand <[email protected]>
glevand added a commit to glevand/coreos--mantle that referenced this pull request Apr 20, 2018
A docker bug causes the docker daemon to fail in creating a container
when the '--userns-remap' option is used and SELinux is enforcing.
Set SELinux to permisive mode so this test can run.
See: opencontainers/runc#1562 (nsenter:
improve namespace creation and SELinux IPC handling).

Fixes runtime errors like these:

  OCI runtime create failed: running exec setns process for init caused exit

Signed-off-by: Geoff Levand <[email protected]>
@giuseppe
Copy link
Member

could this be finally merged?

@crosbymichael
Copy link
Member

crosbymichael commented Apr 26, 2018

LGTM

Approved with PullApprove

@rhatdan
Copy link
Contributor

rhatdan commented Apr 26, 2018

LGTM
This fixes an issue that is causing us to have to disable SELinux when using UserNS, not ideal in podman.

@mrunalp
Copy link
Contributor

mrunalp commented Apr 26, 2018

Reviewing right now.

@mrunalp
Copy link
Contributor

mrunalp commented Apr 26, 2018

LGTM

Approved with PullApprove

@mrunalp mrunalp merged commit 0cbfd83 into opencontainers:master Apr 26, 2018
@cyphar cyphar deleted the carry-975-959-ipc-uid-namespaces branch April 27, 2018 11:29
@cyphar
Copy link
Member Author

cyphar commented Apr 27, 2018

🎉

glevand added a commit to glevand/coreos--mantle that referenced this pull request May 1, 2018
A docker bug causes the docker daemon to fail in creating a container
when the '--userns-remap' option is used and SELinux is enforcing.
Set SELinux to permisive mode so this test can run.
See: opencontainers/runc#1562 (nsenter:
improve namespace creation and SELinux IPC handling).

Fixes runtime errors like these:

  OCI runtime create failed: running exec setns process for init caused exit

Signed-off-by: Geoff Levand <[email protected]>
glevand added a commit to glevand/coreos--mantle that referenced this pull request May 7, 2018
A docker bug causes the docker daemon to fail in creating a container
when the '--userns-remap' option is used and SELinux is enforcing.
Set SELinux to permisive mode so this test can run.
See: opencontainers/runc#1562 (nsenter:
improve namespace creation and SELinux IPC handling).

Fixes runtime errors like these:

  OCI runtime create failed: running exec setns process for init caused exit

Signed-off-by: Geoff Levand <[email protected]>
glevand added a commit to glevand/coreos--mantle that referenced this pull request May 14, 2018
A docker bug causes the docker daemon to fail in creating a container
when the '--userns-remap' option is used and SELinux is enforcing.
Set SELinux to permisive mode so this test can run.
See: opencontainers/runc#1562 (nsenter:
improve namespace creation and SELinux IPC handling).

Fixes runtime errors like these:

  OCI runtime create failed: running exec setns process for init caused exit

Signed-off-by: Geoff Levand <[email protected]>
glevand added a commit to glevand/coreos--mantle that referenced this pull request May 14, 2018
A docker bug causes the docker daemon to fail in creating a container
when the '--userns-remap' option is used and SELinux is enforcing.
Set SELinux to permisive mode so this test can run.
See: opencontainers/runc#1562 (nsenter:
improve namespace creation and SELinux IPC handling).

Fixes runtime errors like these:

  OCI runtime create failed: running exec setns process for init caused exit

Signed-off-by: Geoff Levand <[email protected]>
glevand added a commit to glevand/coreos--mantle that referenced this pull request May 29, 2018
A docker bug causes the docker daemon to fail in creating a container
when the '--userns-remap' option is used and SELinux is enforcing.
Set SELinux to permisive mode so this test can run.
See: opencontainers/runc#1562 (nsenter:
improve namespace creation and SELinux IPC handling).

Fixes runtime errors like these:

  OCI runtime create failed: running exec setns process for init caused exit

Signed-off-by: Geoff Levand <[email protected]>
glevand added a commit to glevand/coreos--mantle that referenced this pull request May 29, 2018
A docker bug causes the docker daemon to fail in creating a container
when the '--userns-remap' option is used and SELinux is enforcing.
Set SELinux to permisive mode so this test can run.
See: opencontainers/runc#1562 (nsenter:
improve namespace creation and SELinux IPC handling).

Fixes runtime errors like these:

  OCI runtime create failed: running exec setns process for init caused exit

Signed-off-by: Geoff Levand <[email protected]>
glevand added a commit to glevand/coreos--mantle that referenced this pull request Sep 10, 2018
A docker bug causes the docker daemon to fail in creating a container
when the '--userns-remap' option is used and SELinux is enforcing.
Set SELinux to permisive mode so this test can run.
See: opencontainers/runc#1562 (nsenter:
improve namespace creation and SELinux IPC handling).

Fixes runtime errors like these:

  OCI runtime create failed: running exec setns process for init caused exit

Signed-off-by: Geoff Levand <[email protected]>
glevand added a commit to glevand/coreos--mantle that referenced this pull request Sep 10, 2018
A docker bug causes the docker daemon to fail in creating a container
when the '--userns-remap' option is used and SELinux is enforcing.
Set SELinux to permisive mode so this test can run.
See: opencontainers/runc#1562 (nsenter:
improve namespace creation and SELinux IPC handling).

Fixes runtime errors like these:

  OCI runtime create failed: running exec setns process for init caused exit

Signed-off-by: Geoff Levand <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants