-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
main: support rootless mode in userns #1688
Conversation
cc @cyphar |
ee8195d
to
9104981
Compare
utils_linux.go
Outdated
// especially when running within `unshare -m -r`. | ||
// So we use system.GetParentNSeuid() here. | ||
// | ||
// TODO(AkihiroSuda): how to support nested userns? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than checking UID, it might be better to attempt some lightweight operation that requires "real" root privilege.
e.g.
chown(tmpFile, nonZeroUID, nonZeroGID)
readdir(opendir(getpwuid(0).pw_dir)))
Any thought?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The simplest test would be to do mknod
, but I'm not sure I think this is the best idea in the world to be honest. There are plans in the future to provide the ability for previously disallowed syscalls to be allowed through the whole SECCOMP syscall emulation mechanism that's being worked on, so trying to pin down "we are in a user namespace because X doesn't work" is a bit of a stretch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func isRootless() { return os.Geteuid() != 0 || readFile("/proc/self/setgroups") == "deny") }
is fine then?
EDIT: even /proc/self/setgroups == alow
, it might still require rootless... any thought?
Does this PR address this issue: #1658 |
@frezbo no |
@@ -130,6 +130,33 @@ func RunningInUserNS() bool { | |||
return true | |||
} | |||
|
|||
// GetParentNSeuid returns the euid within the parent user namespace | |||
func GetParentNSeuid() int { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @brauner had a patch like this for Ubuntu? This code should also check /proc/self/setgroups
and ensure that it's set to deny
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As commented above, I think we need to find an alternative check to support nested userns
9104981
to
6f6720a
Compare
Updated PR
I think this PR is mergeable now. |
@AkihiroSuda I can't seem to build your branch, here;s the error:
|
@frezbo mv AkihiroSuda opencontainers |
Any thought? |
On second thought I came up with a ternary CLI flag |
I am currently on vacation unfortunately. I will do a review of this when I get back. |
rebased |
6f6720a
to
2995cbc
Compare
Any thought about this? Although |
I think a tertiary flag is okay. I will need to re-review this. |
2995cbc
to
402a379
Compare
@cyphar updated to use ternary flag |
utils_linux.go
Outdated
// So we use system.GetParentNSeuid() here. | ||
// | ||
// TODO(AkihiroSuda): how to support nested userns? | ||
return system.GetParentNSeuid() != 0, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be safer to assume "yes" everytime it is running in an userNS (i.e. doesn't have the full 0-4294967295 range available in /proc/self/uid_map)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually a good point. I think doing both (and if either is true, we assume we are in rootless mode) would work as well. The main problem with the 4294967295 check is that you can create user namespaces that have a full mapping but still require rootless tricks (and as I said above there's no real way of checking if that is true from userspace).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that because it is still theoretically possible to setup the full range and have root mapped to some other user in the host namespace? If I am not missing something, that sounds quite difficult to achieve as it requires an intermediate user namespace with at least two mappings (root in the intermediate namespace mapped to non root in the host) that still has enough privileges to setup the full range for the final userNS.
In any case, I fully agree that having both checks is safer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that because it is still theoretically possible to setup the full range and have root mapped to some other user in the host namespace?
That is also possible, but it's actually much simpler than that. Even if you have a 1-to-1 mapping of all users in a new user namespace certain operations will still fail purely because you are not in &init_user_ns
(such as mknod(2)
or most of the mounting we do) -- because capabilities are scoped to your user namespace and your host UID is irrelevant. So we'd need to apply rootless tricks even in that case.
While this is an edge-case it means that in general we have no way of checking whether we are in a user namespace or the host user namespace. The closest thing we have is ioctl(NS_GET_PARENT)
but returns -EPERM
even if there is no parent -- I would argue that it should return -ENOENT
in that case but @ebiederm might not agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the kernel the inode for the initial user namespace is hardcoded, and it seems like a reliable (and undocumented AFAICS) way to detect if we are running in the init userNS:
$ cat > userns_checker.c << EOF
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>
int main ()
{
struct stat sb;
if (stat ("/proc/self/ns/user", &sb) < 0)
return -1;
printf ("is main userns? %s\n", 0xEFFFFFFDU == sb.st_ino ? "yes" : "no");
return 0;
}
EOF
$ gcc -o userns_checker userns_checker.c
$ ./userns_checker
is main userns? yes
$ unshare -r ./userns_checker
is main userns? no
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebased PR and added the 4294967295 check.
The 0xEFFFFFFD check can be follow-up PR after discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell this is not guaranteed -- it just happens to be that I checked again and it is hardcoded as ns_alloc_inum
returns the same thing on each boot. But any kernel change or module that results in proc_alloc_inum
being called before &init_user_ns
is instantiated is going to result in that inode number changing.PROC_USER_INIT_INO
.
Not to mention that (as Eric said) special inode numbers in general are not something you should depend on (a perfect example is "detecting" chroot(2)
by the inode number of /
-- something that always breaks every time someone tries it with a new filesystem because they all use different inode numbers for /
).
Giuseppe Scrivano <[email protected]> writes:
@giuseppe commented on this pull request.
-------------------------------------------------------------------------------------------------------------------
In utils_linux.go:
> + if context != nil {
+ b, err := parseBoolOrAuto(context.GlobalString("rootless"))
+ if err != nil {
+ return false, err
+ }
+ if b != nil {
+ return *b, nil
+ }
+ // nil b stands for "auto detect"
+ }
+ // Even if os.Geteuid() == 0, it might still require rootless mode,
+ // especially when running within userns.
+ // So we use system.GetParentNSeuid() here.
+ //
+ // TODO(AkihiroSuda): how to support nested userns?
+ return system.GetParentNSeuid() != 0, nil
In the kernel the inode for the initial user namespace is hardcoded, and it seems like a reliable (and undocumented AFAICS) way to
detect if we are running in the init userNS:
$ cat > userns_checker.c << EOF
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>
int main ()
{
struct stat sb;
if (stat ("/proc/self/ns/user", &sb) < 0)
return -1;
printf ("is main userns? %s\n", 0xEFFFFFFDU == sb.st_ino ? "yes" : "no");
return 0;
}
EOF
$ gcc -o userns_checker userns_checker.c
$ ./userns_checker
is main userns? yes
$ unshare -r ./userns_checker
is main userns? no
What do you think?
It is not safe to hard code inode numbers. They are an implementation
detail.
Eric
|
Signed-off-by: Akihiro Suda <[email protected]>
Running rootless containers in userns is useful for mounting filesystems (e.g. overlay) with mapped euid 0, but without actual root privilege. Usage: (Note that `unshare --mount` requires `--map-root-user`) user$ mkdir lower upper work rootfs user$ curl http://dl-cdn.alpinelinux.org/alpine/v3.7/releases/x86_64/alpine-minirootfs-3.7.0-x86_64.tar.gz | tar Cxz ./lower || ( true; echo "mknod errors were ignored" ) user$ unshare --mount --map-root-user mappedroot# runc spec --rootless mappedroot# sed -i 's/"readonly": true/"readonly": false/g' config.json mappedroot# mount -t overlay -o lowerdir=./lower,upperdir=./upper,workdir=./work overlayfs ./rootfs mappedroot# runc run foo Signed-off-by: Akihiro Suda <[email protected]>
Signed-off-by: Akihiro Suda <[email protected]>
@ebiederm How do you feel about making |
Aleksa Sarai <[email protected]> writes:
@ebiederm How do you feel about making NS_GET_PARENT return -ENOENT if the user is in the namespace and the namespace is the
init_ns?
From a basic standpoint I don't think there should be any difference
between a parent you are not allowed to observe and not having a parent
at all.
I think you should find a way to test if the functionality you want is
supported. As what you are allowed to do in a user namespace is not
necessarily fixed. Or a lsm might deny you something for reasons of
it's own.
In the long run it is going to be better to test if things you want to
do can be done, or if not do them another way.
Eric
|
Signed-off-by: Akihiro Suda <[email protected]>
Added commit c938157 for allowing |
@dqminh PTAL? |
Caused by: * opencontainers#1688 0e56164 * opencontainers#1759 dd67ab1 Signed-off-by: Akihiro Suda <[email protected]>
Caused by: * opencontainers#1688 0e56164 * opencontainers#1759 dd67ab1 Signed-off-by: Akihiro Suda <[email protected]>
/* | ||
* We assume we are in the initial user namespace if we have a full | ||
* range - 4294967295 uids starting at uid 0. | ||
*/ | ||
if a == 0 && b == 0 && c == 4294967295 { | ||
if len(uidmap) == 1 && uidmap[0].ID == 0 && uidmap[0].ParentID == 0 && uidmap[0].Count == 4294967295 { | ||
return false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this broke ARM builds in Moby;
linux_amd64_netgo:/usr/local/go/pkg/linux_amd64_netgo" -e GOARM=6 "docker-dev:master" hack/make.sh binary
04:21:31
04:21:32 Removing bundles/
04:21:32
04:21:32 ---> Making bundle: binary (in bundles/binary)
04:21:32 Building: bundles/binary-daemon/dockerd-18.06.0-ce-dev
04:22:29 # github.com/docker/docker/vendor/github.com/opencontainers/runc/libcontainer/system
04:22:29 vendor/github.com/opencontainers/runc/libcontainer/system/linux.go:119:89: constant 4294967295 overflows int
The --rootless flag was introduced in opencontainers/runc#1688. In most cases runc itself can detect the appropriate value, but it is considered to be there are some corner cases. Signed-off-by: Akihiro Suda <[email protected]>
The --rootless flag was introduced in opencontainers/runc#1688. In most cases runc itself can detect the appropriate value, but it is considered to be there are some corner cases. Signed-off-by: Akihiro Suda <[email protected]>
Running rootless containers in userns is useful for mounting
filesystems (e.g. overlay) with mapped euid 0, but without actual root
privilege.
Usage: (Note that
unshare --mount
requires--map-root-user
)Note: unprivileged overlay is only supported in Ubuntu and a few distros (http://kernel.ubuntu.com/git/ubuntu/ubuntu-artful.git/commit/fs/overlayfs?h=Ubuntu-4.13.0-25.29&id=0a414bdc3d01f3b61ed86cfe3ce8b63a9240eba7)
Signed-off-by: Akihiro Suda [email protected]