Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--userns-remap=default and --ipc=host: operation not permitted on /dev/mqueue #36674

Closed
flx42 opened this issue Mar 23, 2018 · 13 comments
Closed

Comments

@flx42
Copy link
Contributor

flx42 commented Mar 23, 2018

On ubuntu 16.04 with 4.13.0-37-generic and docker 18.03.0-ce, with --userns-remap=default in my configuration:

$ docker run --ipc=host --rm ubuntu:16.04
docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"rootfs_linux.go:58: mounting \\\"mqueue\\\" to rootfs \\\"/var/lib/docker/165536.165536/overlay2/ba5d5452e932a9b4e4d50417d3958dfc4d375b17fc57f49948f5f2f7a64c0495/merged\\\" at \\\"/dev/mqueue\\\" caused \\\"operation not permitted\\\"\"": unknown.

Looks like the kernel path is rather simple
https://github.com/torvalds/linux/blob/v4.13/ipc/mqueue.c#L338-L340
then
https://github.com/torvalds/linux/blob/v4.13/fs/super.c#L1023-L1027

At first I thought it was just an oversight, and that this case should be disabled, for instance like this:

$ docker run --pid=host --rm ubuntu:16.04
docker: Error response from daemon: cannot share the host PID namespace when user namespaces are enabled.

But, it doesn't seem so simple, the code has changed for 4.16, for instance:
torvalds/linux@36735a6
And with 4.16.0-rc6 (debian buster), it works:

$ uname -r
4.16.0-rc6-amd64
$ mkdir -p /tmp/mqueue
$ unshare -m -U -r -- mount -t mqueue mqueue /tmp/mqueue
$ echo $?
0

With 4.15.0 also on debian buster:

$ uname -r
4.15.0-1-amd64
$ mkdir -p /tmp/mqueue
$ unshare -m -U -r -- mount -t mqueue mqueue /tmp/mqueue
mount: /tmp/mqueue: permission denied.
$ echo $?
32

Any kernel maintainer to check if it's a kernel regression? Or is it intended and might be backported to distro kernels? Or are both behaviors valid and something changed in 4.16?
If the 4.15 behavior is the right one, we will have to disable --ipc=host with userns, similarly to --net=host and --pid=host.
@crosbymichael @cyphar?

@flx42 flx42 changed the title --userns-remap=default and --ipc=host: permission denied on /dev/mqueue --userns-remap=default and --ipc=host: operation not permitted on /dev/mqueue Mar 23, 2018
@cyphar
Copy link
Contributor

cyphar commented Mar 23, 2018

The new behaviour is what is meant to happen. Ubuntu has been well-known to break mqueue and containers quite a few times (though in this case it was actually an upstream kernel issue under certain configurations). The issue was that (if you have certain configurations in a distro kernel) then you cannot use mqueue in user namespaces because of a flaw in how the internal mqueue mount was set up (it was owned by a different user namespace). I have a patch for runc that works around the issue for the SELinux case (opencontainers/runc#1562) but I don't think it would help with the mounting case.

The patch you posted is quite interesting though, because it'll fix that long-standing issue with SELinux that caused permission errors. @mrunalp @rhatdan are probably interested in that (it'll be a kernel-side fix for opencontainers/runc#1562 which was the internal mqueue mount problem on SELinux -- if it hasn't already been backported to RHEL I would recommend it).

@flx42
Copy link
Contributor Author

flx42 commented Mar 23, 2018

Thanks @cyphar.

The issue was that (if you have certain configurations in a distro kernel) then you cannot use mqueue in user namespaces because of a flaw in how the internal mqueue mount was set up (it was owned by a different user namespace)

To clarify, I can use mqueue in a user namespace if I do unshare the IPC namespace:

$ uname -r
4.15.0-1-amd64

$ unshare -m -U -r -- mount -t mqueue mqueue /tmp/mqueue
mount: /tmp/mqueue: permission denied.
$ docker run --rm --ipc=host ubuntu:16.04
docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"rootfs_linux.go:58: mounting \\\"mqueue\\\" to rootfs \\\"/var/lib/docker/165536.165536/overlay2/20b58824702ceeea1c5ca3dfd3f36005133ed8f5b3188cfb58ed8fab7b51b457/merged\\\" at \\\"/dev/mqueue\\\" caused \\\"operation not permitted\\\"\"": unknown.

$ unshare -i -m -U -r -- mount -t mqueue mqueue /tmp/mqueue
$ docker run --rm ubuntu:16.04
$ echo $?
0

Any chance of detecting which case we have at runtime in runc/moby and provide a cleaner error message for users? I can do the patch, but not sure how to detect that.

@cyphar
Copy link
Contributor

cyphar commented Mar 23, 2018

Oh, I might've misread a part of your initial question.

To clarify, I can use mqueue in a user namespace if I do unshare the IPC namespace:

I looked at your output again, and what I would actually expect is that you don't have permission to mount mqueue if you aren't in an IPC namespace where you have privileges (which is what you initially pointed to). So the fact the change allows this now is ... odd. Sorry, I was a little mixed up in the world of SELinux permission issues that I completely missed that you didn't unshare -i in your example. 😉

Looking at the kernel patch again with (slightly) fresher eyes, I now understand why this is permitted. Effectively, if you are in an IPC namespace that already has a mount set up (which will be true in the host) then attempting to mount it will give you the already-created mount. I am actually not sure if this is intentional -- because without doing a permission check I would think that you would allow mounts of a host-side resource inside a container (which is usually very bad). But I've never used mqueue directly so I'm not sure whether they have additional ACL permissions that make this safe.

I will send a mail to Eric Biederman and Al Viro to see if they have an opinion on this.

@cyphar
Copy link
Contributor

cyphar commented Mar 23, 2018

I've posted a patch that should fix the issue to LKML (I've only compile-tested it, so if you can test it that'd be great -- I'm going to be busy until Sunday). https://lkml.org/lkml/2018/3/23/25

@thaJeztah
Copy link
Member

/cc @kolyshkin @justincormack @estesp

avagin pushed a commit to avagin/linux that referenced this issue Mar 25, 2018
This reverts commit 36735a6.

Aleksa Sarai <[email protected]> writes:
> [REGRESSION v4.16-rc6] [PATCH] mqueue: forbid unprivileged user access to internal mount
>
> Felix reported weird behaviour on 4.16.0-rc6 with regards to mqueue[1],
> which was introduced by 36735a6 ("mqueue: switch to on-demand
> creation of internal mount").
>
> Basically, the reproducer boils down to being able to mount mqueue if
> you create a new user namespace, even if you don't unshare the IPC
> namespace.
>
> Previously this was not possible, and you would get an -EPERM. The mount
> is the *host* mqueue mount, which is being cached and just returned from
> mqueue_mount(). To be honest, I'm not sure if this is safe or not (or if
> it was intentional -- since I'm not familiar with mqueue).
>
> To me it looks like there is a missing permission check. I've included a
> patch below that I've compile-tested, and should block the above case.
> Can someone please tell me if I'm missing something? Is this actually
> safe?
>
> [1]: moby/moby#36674

The issue is a lot deeper than a missing permission check.  sb->s_user_ns
was is improperly set as well.  So in addition to the filesystem being
mounted when it should not be mounted, so things are not allow that should
be.

We are practically to the release of 4.16 and there is no agreement between
Al Viro and myself on what the code should looks like to fix things properly.
So revert the code to what it was before so that we can take our time
and discuss this properly.

Fixes: 36735a6 ("mqueue: switch to on-demand creation of internal mount")
Reported-by: Felix Abecassis <[email protected]>
Reported-by: Aleksa Sarai <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
@cyphar
Copy link
Contributor

cyphar commented Mar 25, 2018

The relevant patch has been reverted upstream in torvalds/linux@cfb2f6f. My patch fixed the basic issue, but it turns out that the problem runs much deeper within VFS and so Eric and Al decided to revert the patch and they'll fix it properly in the v4.17 cycle.

This can now be closed I think.

@flx42
Copy link
Contributor Author

flx42 commented Mar 25, 2018

Shouldn't we explicitly prevent this case? Like --net=host and --pid=host with userns.

@cyphar
Copy link
Contributor

cyphar commented Mar 25, 2018

Maybe, though really those restrictions are more to reduce user confusion than anything else -- I'm not sure that blocking --ipc=host is ideal. There's nothing unsafe about running a process in a user namespace with a shared set of host namespaces (in theory).

Then again, it's very unlikely that you'll be able to enable --ipc=host and actually access any host-side IPC resources in a user namespace (because you won't have privileges for things like shm that are owned by another user).

@flx42
Copy link
Contributor Author

flx42 commented Mar 25, 2018

Same for --net=host, we could argue.

But, it seems in the current status it can't work with docker run (you can't decide to opt-out of mqueue), so I guess we do need a better error message.
@thaJeztah thoughts?

@thaJeztah
Copy link
Member

Yes, I think that blocking the other options (--net=host and --pid=host) was done primarily so that a useful error-message is printed. If there's no actual use for having --ipc=host in combination with user-namespaces, I'd be +1 for explicitly blocking it, and printing a more useful error-message.

ping @kolyshkin @estesp any thoughts?

@rhatdan
Copy link
Contributor

rhatdan commented Mar 26, 2018

Well there is nothing to state that a shared IPC between two containers is not possible if they have overrlapping UID's.

@flx42
Copy link
Contributor Author

flx42 commented Mar 26, 2018

@rhatdan this would not be affected, this is fine with user namespaces today:

$ docker run --name=nginx --detach nginx
$ docker run -ti --ipc=container:nginx ubuntu:16.04

@bsousaa
Copy link

bsousaa commented Jul 10, 2023

It looks like this issue went stale. Let me close it.

@bsousaa bsousaa closed this as not planned Won't fix, can't repro, duplicate, stale Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants