main: support rootless mode in userns #1688

AkihiroSuda · 2018-01-13T07:54:45Z

Running rootless containers in userns is useful for mounting
filesystems (e.g. overlay) with mapped euid 0, but without actual root
privilege.

Usage: (Note that unshare --mount requires --map-root-user)

  user$ mkdir lower upper work rootfs
  user$ curl http://dl-cdn.alpinelinux.org/alpine/v3.7/releases/x86_64/alpine-minirootfs-3.7.0-x86_64.tar.gz | tar Cxz ./lower || ( true; echo "mknod errors were ignored" )
  user$ unshare --mount --map-root-user
  mappedroot# runc spec --rootless
  mappedroot# sed -i 's/"readonly": true/"readonly": false/g' config.json
  mappedroot# mount -t overlay -o lowerdir=./lower,upperdir=./upper,workdir=./work overlayfs ./rootfs
  mappedroot# runc run foo

Note: unprivileged overlay is only supported in Ubuntu and a few distros (http://kernel.ubuntu.com/git/ubuntu/ubuntu-artful.git/commit/fs/overlayfs?h=Ubuntu-4.13.0-25.29&id=0a414bdc3d01f3b61ed86cfe3ce8b63a9240eba7)

Signed-off-by: Akihiro Suda [email protected]

AkihiroSuda · 2018-01-13T07:54:57Z

cc @cyphar

AkihiroSuda · 2018-01-13T09:07:37Z

utils_linux.go

+	// especially when running within `unshare -m -r`.
+	// So we use system.GetParentNSeuid() here.
+	//
+	// TODO(AkihiroSuda): how to support nested userns?


Rather than checking UID, it might be better to attempt some lightweight operation that requires "real" root privilege.

e.g.

~~chown(tmpFile, nonZeroUID, nonZeroGID)~~

~~readdir(opendir(getpwuid(0).pw_dir)))~~

Any thought?

The simplest test would be to do mknod, but I'm not sure I think this is the best idea in the world to be honest. There are plans in the future to provide the ability for previously disallowed syscalls to be allowed through the whole SECCOMP syscall emulation mechanism that's being worked on, so trying to pin down "we are in a user namespace because X doesn't work" is a bit of a stretch.

~~func isRootless() { return os.Geteuid() != 0 || readFile("/proc/self/setgroups") == "deny") } is fine then?~~
EDIT: even /proc/self/setgroups == alow, it might still require rootless... any thought?

frezbo · 2018-01-13T10:53:03Z

Does this PR address this issue: #1658

AkihiroSuda · 2018-01-13T12:18:44Z

@frezbo no

cyphar · 2018-01-13T13:44:55Z

libcontainer/system/linux.go

@@ -130,6 +130,33 @@ func RunningInUserNS() bool {
 	return true
 }

+// GetParentNSeuid returns the euid within the parent user namespace
+func GetParentNSeuid() int {


I think @brauner had a patch like this for Ubuntu? This code should also check /proc/self/setgroups and ensure that it's set to deny.

As commented above, I think we need to find an alternative check to support nested userns

AkihiroSuda · 2018-01-16T09:13:53Z

Updated PR

Parser commit eb671b is shared with rootless: optional support for generating config with subuid map #1692 (bea27f1)
Add global bool flag --force-rootless to support nested userns.

I think this PR is mergeable now.

frezbo · 2018-01-16T12:49:56Z

@AkihiroSuda I can't seem to build your branch, here;s the error:

go build -buildmode=pie  -ldflags "-X main.gitCommit="6f6720a754f7cc69274762586d678b85aecfbeba" -X main.version=1.0.0-rc4+dev " -tags "seccomp" -o runc .
# github.com/AkihiroSuda/runc
./signals.go:137:28: cannot use ws (type "github.com/AkihiroSuda/runc/vendor/golang.org/x/sys/unix".WaitStatus) as type "github.com/opencontainers/runc/vendor/golang.org/x/sys/unix".WaitStatus in argument to utils.ExitStatus
./utils_linux.go:230:9: undefined: system.GetParentNSeuid
./utils_linux.go:239:7: cannot use spec (type *"github.com/AkihiroSuda/runc/vendor/github.com/opencontainers/runtime-spec/specs-go".Spec) as type *"github.com/opencontainers/runc/vendor/github.com/opencontainers/runtime-spec/specs-go".Spec in field value
make: *** [Makefile:32: runc] Error 2
runc (aws:default)(git:unshare-m-r)$

AkihiroSuda · 2018-01-16T13:19:53Z

@frezbo mv AkihiroSuda opencontainers

AkihiroSuda · 2018-01-23T04:56:16Z

Any thought?
This is the (only) blocker for rootless containerd: containerd/containerd#2006

AkihiroSuda · 2018-01-30T03:29:27Z

On second thought I came up with a ternary CLI flag --rootless=auto/true/false rather than boolean --force-rootless=true/false. But I'm not sure such a ternary flag is acceptable.

cyphar · 2018-02-23T09:29:21Z

I am currently on vacation unfortunately. I will do a review of this when I get back.

AkihiroSuda · 2018-03-19T06:19:58Z

rebased

AkihiroSuda · 2018-03-23T04:27:26Z

On second thought I came up with a ternary CLI flag --rootless=auto/true/false rather than boolean --force-rootless=true/false. But I'm not sure such a ternary flag is acceptable.

Any thought about this?

Although isRootless() auto-detection might not be perfect, I think this PR can be mergeable when we agree on the CLI flag.
The auto-detection logic can be improved later.

cc @ehotinger (genuinetools/img#69 (comment))

cyphar · 2018-03-23T10:30:37Z

I think a tertiary flag is okay. I will need to re-review this.

AkihiroSuda · 2018-03-26T06:20:37Z

@cyphar updated to use ternary flag

cyphar · 2018-05-08T06:08:53Z

LGTM, I've played around with this and it works pretty well. As you noted in your comment there isn't really a nice way of detecting whether we are in a user namespace in a non-destructive way (unfortunately) -- so this will have to do for now.

/cc @opencontainers/runc-maintainers

giuseppe · 2018-05-08T12:57:10Z

utils_linux.go

+	// So we use system.GetParentNSeuid() here.
+	//
+	// TODO(AkihiroSuda): how to support nested userns?
+	return system.GetParentNSeuid() != 0, nil


would it be safer to assume "yes" everytime it is running in an userNS (i.e. doesn't have the full 0-4294967295 range available in /proc/self/uid_map)?

This is actually a good point. I think doing both (and if either is true, we assume we are in rootless mode) would work as well. The main problem with the 4294967295 check is that you can create user namespaces that have a full mapping but still require rootless tricks (and as I said above there's no real way of checking if that is true from userspace).

is that because it is still theoretically possible to setup the full range and have root mapped to some other user in the host namespace? If I am not missing something, that sounds quite difficult to achieve as it requires an intermediate user namespace with at least two mappings (root in the intermediate namespace mapped to non root in the host) that still has enough privileges to setup the full range for the final userNS.

In any case, I fully agree that having both checks is safer

is that because it is still theoretically possible to setup the full range and have root mapped to some other user in the host namespace?

That is also possible, but it's actually much simpler than that. Even if you have a 1-to-1 mapping of all users in a new user namespace certain operations will still fail purely because you are not in &init_user_ns (such as mknod(2) or most of the mounting we do) -- because capabilities are scoped to your user namespace and your host UID is irrelevant. So we'd need to apply rootless tricks even in that case.

While this is an edge-case it means that in general we have no way of checking whether we are in a user namespace or the host user namespace. The closest thing we have is ioctl(NS_GET_PARENT) but returns -EPERM even if there is no parent -- I would argue that it should return -ENOENT in that case but @ebiederm might not agree.

In the kernel the inode for the initial user namespace is hardcoded, and it seems like a reliable (and undocumented AFAICS) way to detect if we are running in the init userNS:

$ cat > userns_checker.c << EOF #include <sys/types.h> #include <sys/stat.h> #include <unistd.h> #include <stdio.h> #include <string.h> int main () { struct stat sb; if (stat ("/proc/self/ns/user", &sb) < 0) return -1; printf ("is main userns? %s\n", 0xEFFFFFFDU == sb.st_ino ? "yes" : "no"); return 0; } EOF $ gcc -o userns_checker userns_checker.c $ ./userns_checker is main userns? yes $ unshare -r ./userns_checker is main userns? no

What do you think?

Rebased PR and added the 4294967295 check.
The 0xEFFFFFFD check can be follow-up PR after discussion.

As far as I can tell this is not guaranteed -- it just happens to be that ns_alloc_inum returns the same thing on each boot. But any kernel change or module that results in proc_alloc_inum being called before &init_user_ns is instantiated is going to result in that inode number changing. I checked again and it is hardcoded as PROC_USER_INIT_INO .

Not to mention that (as Eric said) special inode numbers in general are not something you should depend on (a perfect example is "detecting" chroot(2) by the inode number of / -- something that always breaks every time someone tries it with a new filesystem because they all use different inode numbers for /).

ebiederm · 2018-05-10T02:42:55Z

Giuseppe Scrivano <[email protected]> writes:

@giuseppe commented on this pull request. ------------------------------------------------------------------------------------------------------------------- In utils_linux.go: > + if context != nil { + b, err := parseBoolOrAuto(context.GlobalString("rootless")) + if err != nil { + return false, err + } + if b != nil { + return *b, nil + } + // nil b stands for "auto detect" + } + // Even if os.Geteuid() == 0, it might still require rootless mode, + // especially when running within userns. + // So we use system.GetParentNSeuid() here. + // + // TODO(AkihiroSuda): how to support nested userns? + return system.GetParentNSeuid() != 0, nil In the kernel the inode for the initial user namespace is hardcoded, and it seems like a reliable (and undocumented AFAICS) way to detect if we are running in the init userNS: $ cat > userns_checker.c << EOF #include <sys/types.h> #include <sys/stat.h> #include <unistd.h> #include <stdio.h> #include <string.h> int main () { struct stat sb; if (stat ("/proc/self/ns/user", &sb) < 0) return -1; printf ("is main userns? %s\n", 0xEFFFFFFDU == sb.st_ino ? "yes" : "no"); return 0; } EOF $ gcc -o userns_checker userns_checker.c $ ./userns_checker is main userns? yes $ unshare -r ./userns_checker is main userns? no What do you think?

It is not safe to hard code inode numbers. They are an implementation detail. Eric

Signed-off-by: Akihiro Suda <[email protected]>

Running rootless containers in userns is useful for mounting filesystems (e.g. overlay) with mapped euid 0, but without actual root privilege. Usage: (Note that `unshare --mount` requires `--map-root-user`) user$ mkdir lower upper work rootfs user$ curl http://dl-cdn.alpinelinux.org/alpine/v3.7/releases/x86_64/alpine-minirootfs-3.7.0-x86_64.tar.gz | tar Cxz ./lower || ( true; echo "mknod errors were ignored" ) user$ unshare --mount --map-root-user mappedroot# runc spec --rootless mappedroot# sed -i 's/"readonly": true/"readonly": false/g' config.json mappedroot# mount -t overlay -o lowerdir=./lower,upperdir=./upper,workdir=./work overlayfs ./rootfs mappedroot# runc run foo Signed-off-by: Akihiro Suda <[email protected]>

Signed-off-by: Akihiro Suda <[email protected]>

cyphar · 2018-05-10T03:33:34Z

@ebiederm How do you feel about making NS_GET_PARENT return -ENOENT if the user is in the namespace and the namespace is the init_ns?

cyphar · 2018-05-10T03:35:08Z

LGTM on the updated patchset.

ebiederm · 2018-05-10T03:45:33Z

Aleksa Sarai <[email protected]> writes:

@ebiederm How do you feel about making NS_GET_PARENT return -ENOENT if the user is in the namespace and the namespace is the init_ns?

From a basic standpoint I don't think there should be any difference between a parent you are not allowed to observe and not having a parent at all. I think you should find a way to test if the functionality you want is supported. As what you are allowed to do in a user namespace is not necessarily fixed. Or a lsm might deny you something for reasons of it's own. In the long run it is going to be better to test if things you want to do can be done, or if not do them another way. Eric

Signed-off-by: Akihiro Suda <[email protected]>

AkihiroSuda · 2018-05-24T06:00:49Z

Added commit c938157 for allowing setgroups.
(Already reviewed by @cyphar in genuinetools/img#96 (comment) )

AkihiroSuda · 2018-05-28T17:51:30Z

@dqminh PTAL?

cyphar · 2018-05-29T19:00:43Z

LGTM again.

crosbymichael · 2018-05-29T19:41:12Z

LGTM

Caused by: * opencontainers#1688 0e56164 * opencontainers#1759 dd67ab1 Signed-off-by: Akihiro Suda <[email protected]>

thaJeztah · 2018-06-14T14:33:04Z

libcontainer/system/linux.go

 	/*
 	 * We assume we are in the initial user namespace if we have a full
 	 * range - 4294967295 uids starting at uid 0.
 	 */
-	if a == 0 && b == 0 && c == 4294967295 {
+	if len(uidmap) == 1 && uidmap[0].ID == 0 && uidmap[0].ParentID == 0 && uidmap[0].Count == 4294967295 {
 		return false


Looks like this broke ARM builds in Moby;

linux_amd64_netgo:/usr/local/go/pkg/linux_amd64_netgo" -e GOARM=6 "docker-dev:master" hack/make.sh binary 04:21:31 04:21:32 Removing bundles/ 04:21:32 04:21:32 ---> Making bundle: binary (in bundles/binary) 04:21:32 Building: bundles/binary-daemon/dockerd-18.06.0-ce-dev 04:22:29 # github.com/docker/docker/vendor/github.com/opencontainers/runc/libcontainer/system 04:22:29 vendor/github.com/opencontainers/runc/libcontainer/system/linux.go:119:89: constant 4294967295 overflows int

The --rootless flag was introduced in opencontainers/runc#1688. In most cases runc itself can detect the appropriate value, but it is considered to be there are some corner cases. Signed-off-by: Akihiro Suda <[email protected]>

AkihiroSuda force-pushed the unshare-m-r branch 3 times, most recently from ee8195d to 9104981 Compare January 13, 2018 09:01

AkihiroSuda commented Jan 13, 2018

View reviewed changes

cyphar reviewed Jan 13, 2018

View reviewed changes

This was referenced Jan 15, 2018

experimental support for rootless mode containerd/containerd#2006

Closed

Support for arbitrary bind mount directories on read-only FS for rootless containers #1671

Open

olifre mentioned this pull request Jan 15, 2018

Overlay could be implemented portably, unprivileged apptainer/singularity#1207

Closed

AkihiroSuda force-pushed the unshare-m-r branch from 9104981 to 6f6720a Compare January 16, 2018 09:12

AkihiroSuda mentioned this pull request Jan 23, 2018

why does buildkitd need root privilege? moby/buildkit#252

Closed

cyphar added the rootless-containers label Jan 24, 2018

cyphar self-assigned this Feb 4, 2018

cyphar assigned cyphar and unassigned cyphar Feb 23, 2018

cwbeitel mentioned this pull request Feb 25, 2018

Investigate img and kaniko for building containers kubeflow/kubeflow#289

Closed

AkihiroSuda force-pushed the unshare-m-r branch from 6f6720a to 2995cbc Compare March 19, 2018 06:20

AkihiroSuda force-pushed the unshare-m-r branch from 2995cbc to 402a379 Compare March 26, 2018 06:20

giuseppe reviewed May 8, 2018

View reviewed changes

AkihiroSuda added 3 commits May 10, 2018 12:16

libcontainer: add parser for /etc/sub{u,g}id and /proc/PID/{u,g}id_map

9c7d8bc

Signed-off-by: Akihiro Suda <[email protected]>

main: add condition to isRootless()

cdb7f23

Signed-off-by: Akihiro Suda <[email protected]>

AkihiroSuda force-pushed the unshare-m-r branch from 402a379 to cdb7f23 Compare May 10, 2018 03:25

libcontainer: remove extra CAP_SETGID check for SetgroupAttr

c938157

Signed-off-by: Akihiro Suda <[email protected]>

AkihiroSuda mentioned this pull request May 24, 2018

setgroups broken genuinetools/img#96

Closed

crosbymichael merged commit 0e56164 into opencontainers:master May 29, 2018

AkihiroSuda added a commit to AkihiroSuda/runc that referenced this pull request May 30, 2018

Fix merge conflict

63bb0fe

Caused by: * opencontainers#1688 0e56164 * opencontainers#1759 dd67ab1 Signed-off-by: Akihiro Suda <[email protected]>

AkihiroSuda mentioned this pull request May 30, 2018

Fix merge conflict #1808

Merged

AkihiroSuda added a commit to AkihiroSuda/runc that referenced this pull request May 30, 2018

Fix merge conflict

1e44f9e

Caused by: * opencontainers#1688 0e56164 * opencontainers#1759 dd67ab1 Signed-off-by: Akihiro Suda <[email protected]>

astj mentioned this pull request Jun 14, 2018

Build fails with GOOS=linux GOARCH=386 due to int overflow #1818

Closed

thaJeztah reviewed Jun 14, 2018

View reviewed changes

tiborvass mentioned this pull request Jun 14, 2018

libcontainer: fix compilation on GOARCH=arm GOARM=6 (32 bits) #1819

Merged

AkihiroSuda mentioned this pull request Jul 2, 2018

rootless: fix Docker-in-LXD regression #1833

Closed

This was referenced Jul 2, 2018

add support for --rootless containerd/go-runc#43

Merged

Don't always enable rootless mode in userns #1837

Closed

tianon mentioned this pull request Sep 12, 2018

hard coded constants outside 'int' range on 32-bit platforms #1885

Closed

thaJeztah mentioned this pull request Mar 22, 2021

libcontainer/system: move userns utilities, remove GetParentNSeuid, UIDMapInUserNS #2850

Merged

main: support rootless mode in userns #1688

main: support rootless mode in userns #1688

Conversation

AkihiroSuda commented Jan 13, 2018 • edited Loading

AkihiroSuda commented Jan 13, 2018

AkihiroSuda Jan 13, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AkihiroSuda Jan 13, 2018 • edited Loading

Choose a reason for hiding this comment

frezbo commented Jan 13, 2018

AkihiroSuda commented Jan 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AkihiroSuda commented Jan 16, 2018

frezbo commented Jan 16, 2018

AkihiroSuda commented Jan 16, 2018

AkihiroSuda commented Jan 23, 2018

AkihiroSuda commented Jan 30, 2018

cyphar commented Feb 23, 2018

AkihiroSuda commented Mar 19, 2018

AkihiroSuda commented Mar 23, 2018

cyphar commented Mar 23, 2018

AkihiroSuda commented Mar 26, 2018

cyphar commented May 8, 2018 • edited Loading

Choose a reason for hiding this comment

cyphar May 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cyphar May 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cyphar May 10, 2018 • edited Loading

Choose a reason for hiding this comment

ebiederm commented May 10, 2018 via email

cyphar commented May 10, 2018

cyphar commented May 10, 2018 • edited by caniszczyk Loading

ebiederm commented May 10, 2018 via email

AkihiroSuda commented May 24, 2018

AkihiroSuda commented May 28, 2018

cyphar commented May 29, 2018 • edited by caniszczyk Loading

crosbymichael commented May 29, 2018 • edited by caniszczyk Loading

Choose a reason for hiding this comment

AkihiroSuda commented Jan 13, 2018 •

edited

Loading

AkihiroSuda Jan 13, 2018 •

edited

Loading

AkihiroSuda Jan 13, 2018 •

edited

Loading

cyphar commented May 8, 2018 •

edited

Loading

cyphar May 9, 2018 •

edited

Loading

cyphar May 9, 2018 •

edited

Loading

cyphar May 10, 2018 •

edited

Loading

cyphar commented May 10, 2018 •

edited by caniszczyk

Loading

cyphar commented May 29, 2018 •

edited by caniszczyk

Loading

crosbymichael commented May 29, 2018 •

edited by caniszczyk

Loading