Defense in Depth - User Namespaces #228

apyrgio · 2022-10-17T15:39:50Z

Parent issue: #221

User namespaces are very important, since they ensure that:

Root within the container maps to the parent user outside the container.
Users within the container map to non-existing users outside the container.

By ensuring that the user within the container (dangerzone, UID 1000) maps to a non-existing user outside the container, we complicate the attacker significantly. The current situation is:

On Linux, we don't use user namespaces fully, since we run containers with --userns keep-id, which makes the dangerzone user within the container have the same UID as the user outside the container.
On Windows/MacOS, they don't support user namespaces (see WSL 2: userns-remap not working / daemon.json not picked up docker/for-win#6897 and Daemon fails to start with "userns-remap" enabled docker/for-mac#3280 respectively).

Linux

Decide on a UID mapping (1000 inside the container, x > 1000 outside the container) before starting the container.
Create temporary directories for container I/O, owned by x > 1000 outside the container.
Copy in the source files to the temporary directory for the first container (will also fix Permission denied: container can't write to /dangerzone #157)
Run podman and specify the mapping for the container.
Copy out the converted files.

Windows/MacOS

Test Podman Desktop and check if it uses user namespaces.

Linux User Namespaces

References:

Linux User Namespaces got introduced in Linux Kernel 3.8. They look similar to
PID namespaces, where PID 1 inside the namespace is mapped to a different PID
outside the namespace. However, they are trickier than that, as they are also a
namespace for user capabilities. Due to their sensitive nature, several OSes had
disabled them years after their inclusion, until they reach a stable status.

Let's demystify them:

User namespaces are more than just namespaces for UIDs and GIDs. They are also a
namespace for user capabilities (see capabilities(7)), i.e., what makes a user
root. We won't touch on this subject here.

All users (unless restricted by system configuration) can create a user
namespace. User namespaces are basically a mapping between UIDs/GIDs inside the
namespace, and UIDs/GIDs outside the namespace:

<UID in namespace>  <UID in parent namespace>   <range>
<UID in namespace>  <UID in parent namespace>   <range>
...

Examples:

# UID 0 in the namespace maps to UID 0 in the parent namespace, UID 1000 in the
# namespace maps to UID 1000 in the parent namespace, and that's all.
0       0       1
1000    1000    1

# UID 0 in the namespace maps to UID 0 in the parent namespace, UID 1 in
# namespace maps to UID 1 in the parent namespace, and so forth up until UID
# 999 -> 999.
0       0       1000

# UID 0 in the namespace maps to UID 100000 in the parent namespace, UID 1 in
# namespace maps to UID 100001 in the parent namespace, and so forth up until
# UID 65535 -> 165535. This is a pretty typical configuration.
0       100000  65536

# UID 1001 in the namespace maps to UID 1000 in the parent namespace, and that's
all.
1001    1000    1

This mapping is available through /proc/self/{u,g}id_map. For the root
namespace, this mapping is a dummy one (all UIDs in the namespace map to the
same UIDs in the parent namespace), but for the created namespace, the mapping
is empty by default:

$ cat /proc/self/{u,g}id_map
0          0 4294967295
0          0 4294967295
$ unshare -U cat /proc/self/{u,g}id_map

For user namespaces with empty mappings, we need to have some things in mind:

The Linux Kernel has an overflow UID (proc/sys/kernel/overflowuid), which
by default is nobody/65534. If a user namespace has no mapping, all IDs in
that namespace will show up as nobody.
Processes in that namespace inherit the UID of the user that started them in
the parent namespace, even if they show up as nobody. This means that they
can see the files that a user in the parent namespace can see.
Until a mapping exists, processes within that namespace cannot perform any
UID action (e.g., chown), even though they have a UID of their own, because
the kernel cannot translate it to a UID in the parent namespace.

This mapping is writable only by processes with sufficient rights, and only
once (see user_namespaces(7)).

I think that the simplest mapping that can exist is just assigning a container
UID to the user's UID in the parent namespace. Anything more than that
essentially requires root permissions.

Once a mapping exists, then:

There can be a UID 0 process in that namespace.
Any UID/GID action that processes perform will be translated by the Linux
Kernel, e.g., for fs permissions.

Rootless Podman and Linux User Namespaces

References:

Let's see how rootless Podman deals with user namespaces.

When Podman creates a new user namespace, it needs to assign a UID mapping to
that. Since it's rootless though, it's not easy to do so, because it doesn't
have the necessary capabilities. That's where new{u,g}idmap binaries come into
play. They are setuid binaries (verify this with either ls -l $(which newuidmap) or
getcap $(which newuidmap)) which consult the /etc/sub{u,g}id
(which are writable only by root) files and assign the mapping. These files have
a different format than /proc/self/uid_map:

<username/UID>:<start of subordinate UIDs>:<count>
<username/UID>:<start of subordinate UIDs>:<count>
...

Basically, they define the range of host UIDs (subordinate UIDs) that a user has
at their disposal, when creating a container. A range like user:100000:65536
means that the user can specify a UID mapping in the container like 0 100000 65536.

If there are no /etc/sub{u,g}id files, then the default mapping is:

$ podman unshare cat /proc/self/uid_map
0       1000          1

That is, the root in the container maps to the user outside the container, which
is the most Linux Kernel allows. If there are though (e.g.,
user:100000:65536), the default mapping is:

$ podman unshare cat /proc/self/uid_map
0       1000        1
1       100000      65536

Essentially, the root user in the container maps to the user outside the
container, and every other UID in the container maps to UIDS >= 100000 in the
host. Also note that Podman will create a single user namespace per container,
so these mappings are shared between all rootless containers.

Podman has several options to control the mapping (see
https://docs.podman.io/en/latest/markdown/podman-run.1.html#userns-mode). Let's
see some in action:

# --userns="" (or no --userns passed)
$ podman run -it --rm docker.io/library/alpine:edge cat /proc/self/uid_map
0       1000          1
1     100000      65536

# --userns keep-id
$ podman run -it --rm --userns keep-id docker.io/library/alpine:edge cat /proc/self/uid_map
0           1           1000
1000        0           1
1001        1001        64536

In the first case, we see that the container root maps to the user it started
the container, and all UIDs after that match the subordinate UIDs of the user in
/etc/subuid.

In the second case, we notice something weird. The root of the
container maps to host UID 1, and UID 1000 within the container maps to host UID
0. This is not the case of course. Podman uses intermediate UIDs, when it
performs its own mapping. In practice, the second column stops becoming "host
UID" and becomes "Nth subordinate UID". So if /etc/subuid contains
user:100000:65536, the above can be translated to:

# --userns keep-id (translated)
$ podman run -it --rm --userns keep-id docker.io/library/alpine:edge cat /proc/self/uid_map
0           100000      1000   # root in the container maps to 1st subordinate UID (100000) up to 100999
1000        1000        1      # 1000 in the container maps to user in the host (1000)
1001        101000      64536  # 1001 in the container maps to 1000th subordinate UID (101000) up to 165535

To make translation easier, one can check the UID mapping from the parent
namespace, where they'll get the proper values.

In the above examples, we see that either the root or the user within the
container maps to the user outside the container (1000). We can circumvent this
with --uidmap 0:1:65536 --gidmap 0:1:65536, which maps the root of the
container to the 1st subordinate UID (e.g., 100000), and the rest of the UIDs follow
suit. Alternatively, users can pass --userns nomap, but it's only present in
recent versions.

Problems with insufficient UID/GID mappings will occur either when pulling an
OCI image, or when creating a copy of a layer when attempting to run a
container from an image.

apyrgio · 2022-11-09T15:42:25Z

Dangerzone and Linux User Namespaces

Now that we've seen how Linux User Namespaces work, and how Podman handles them, let's see how Dangerzone should handle them.

Requirements

We'll start with some requirements and how we can cover them for Dangerzone:

1. The user IDs within the Dangerzone container should not map to any user in the host

The reason is that we don't want any container escape to have any effect to the host. The escaped user should effectively be treated as nobody.

Best way to achieve this is to use --userns nomap. This will map all the UIDs in the container to the subordinate UIDs in the host (so root -> 100000, dangerzone -> 101000). This is not available in older Podman versions though, so we need mimic what it does in our code.

Podman's implementation can be found here: https://github.com/containers/podman/blob/67c533b85a80fd40228bedbca89a61912ca8a9a5/pkg/util/utils.go#L404. Basically, what Podman does is:

Read /etc/sub{u,g}id and get the ID ranges (subordinate UID, count). Remember that there can be more than one line for the same user.
Iterate these ranges and create a mapping that starts with UID 0 in the container -> 1st subordinate UID in the host, until it reaches the max number of allowed subordinate UIDs.

2. The files/folders mounted to the Dangerzone container should be accessible by UID/GID 1000 (`dangerzone`) within this container

We will take advantage of two facts:

The root in a user namespace can make actions on behalf of every UID in that namespace.
podman unshare maps the root of the user namespace to the user in the host.

This way, we can chown directories to the dangerzone user in the container, without being root in the host.

Note that the containers and the folders that are used in each step are:

Pre-conversion step:
- A temporary dir that will hold the artifacts for the whole conversion (e.g., tmp/)
First container:
- File to get converted (e.g., ~/input_file)
- Directory that will hold the pixel data of the conversion (tmp/pixels/)
Second container:
- Directory that holds the pixel data of the previous conversion (tmp/pixels/)
- Directory that holds the final PDF (safe/)
Post-conversion step:
- Copy the converted file (tmp/safe/safe-output-compressed.pdf) to the destination that the user chose (e.g. ~/output_file)

Proposed Implementation

Create the temporary directory (e.g., tmp/) for the conversion process, and the necessary subdirectories, as usual.
Copy the file to be converted in the temporary directory.
Run podman unshare chown 1001:1001 tmp/*.
- This means that these files will be owned by the 1001st subordinate UID in the host.
- This UID will be UID 1000 in the actual container that will do the conversion process.
- From this point on, the user outside the container will not be able to affect the chown'ed files and dirs, unless they use podman unshare.
Get the number of subordinate UIDs using podman info.
- We must not read /etc/sub{u,g}id, because it may differ from the user namespace that Podman has already created (e.g., because the user changed it and forgot to run podman system migrate).
Run the rest of the Dangerzone containers with the following changes:
- Ditch --userns keep-id. We don't want this as it maps the user in the container to the user in the host.
- Use --uidmap 0:1:<num of sub UIDs> --gidmap 0:1:<num of sub GIDs>:
  - This means that root in the container will map to the 1st subordinate UID in the host, and dangerzone in the container will map to the 1001st subordinate UID in the host
- Mount the file to be converted in the container from the temporary director (e.g., tmp/input_file), instead of its original path.
  - Also fixes Permission denied: container can't write to /dangerzone #157.
Copy the converted file to the destination that the user chose, as usual.

Implementation Details

An interesting side-effect of user namespaces is that we can mount tmpfs within that user namespace, which is not possible for the regular user in the host. This means that we can run podman unshare mount -t tmpfs tmpfs tmp/ in Step 1 and ensure that the sensitive file will never be written to the disk, during the conversion process at least.

When we run our Dangerzone environments through dev_scripts/env.py, we use the Podman flag `--userns keep-id`. This option maps the UID in the host to the *same* UID in the container. This way, the container can access mounted files from the host. The reason this works is because the user within the container has UID 1000, and the user in the host *typically* has UID 1000 as well. This setup can break though if the user outside the host has a different UID. For instance, the UID of the GitHub actions user that runs our CI command is 1001. To fix this, we need to always map the host user UID (whatever that is) to container UID 1000. We can achieve this with the following mapping: 1000:0:1 # Map container UID 1000 to subordinate UID 0 # (sub UID 0 = owner of the user ns = host user UID) 0:1:1000 # Map container UIDs 0-999 to subordinate UIDs 1-1000 1001:1001:64536 # Map container UIDs 1001-65535 to subordinate UIDs 1001-65535 Refs #228

apyrgio · 2024-06-10T16:59:46Z

We can close this issue once we merge #590, since gVisor will run rootless, and the host user will not be mapped to the inner container. As a bonus, we will remove the --userns keep-id flag from the outer container, and make sure to use --userns nomap in platforms that have Podman >= 4.1.

This wraps the existing container image inside a gVisor-based sandbox. gVisor is an open-source OCI-compliant container runtime. It is a userspace reimplementation of the Linux kernel in a memory-safe language. It works by creating a sandboxed environment in which regular Linux applications run, but their system calls are intercepted by gVisor. gVisor then redirects these system calls and reinterprets them in its own kernel. This means the host Linux kernel is isolated from the sandboxed application, thereby providing protection against Linux container escape attacks. It also uses `seccomp-bpf` to provide a secondary layer of defense against container escapes. Even if its userspace kernel gets compromised, attackers would have to additionally have a Linux container escape vector, and that exploit would have to fit within the restricted `seccomp-bpf` rules that gVisor adds on itself. Fixes #126 Fixes #224 Fixes #225 Fixes #228

This wraps the existing container image inside a gVisor-based sandbox. gVisor is an open-source OCI-compliant container runtime. It is a userspace reimplementation of the Linux kernel in a memory-safe language. It works by creating a sandboxed environment in which regular Linux applications run, but their system calls are intercepted by gVisor. gVisor then redirects these system calls and reinterprets them in its own kernel. This means the host Linux kernel is isolated from the sandboxed application, thereby providing protection against Linux container escape attacks. It also uses `seccomp-bpf` to provide a secondary layer of defense against container escapes. Even if its userspace kernel gets compromised, attackers would have to additionally have a Linux container escape vector, and that exploit would have to fit within the restricted `seccomp-bpf` rules that gVisor adds on itself. Fixes freedomofpress#126 Fixes freedomofpress#224 Fixes freedomofpress#225 Fixes freedomofpress#228

apyrgio changed the title ~~Defense in Depth - user namespaces~~ Defense in Depth - User Namespaces Oct 17, 2022

apyrgio added the security label Oct 17, 2022

apyrgio mentioned this issue Oct 17, 2022

Defense in Depth #221

Open

8 tasks

apyrgio added this to the 0.4.0 milestone Oct 19, 2022

eloquence modified the milestones: 0.4.0, 0.5.0 Nov 9, 2022

apyrgio modified the milestones: 0.5.0, 0.5.1 Jun 13, 2023

apyrgio mentioned this issue Feb 29, 2024

Sandbox all document processing in gVisor #590

Merged

apyrgio closed this as completed in f03bc71 Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defense in Depth - User Namespaces #228

Defense in Depth - User Namespaces #228

apyrgio commented Oct 17, 2022 •

edited

Loading

apyrgio commented Nov 3, 2022 •

edited

Loading

apyrgio commented Nov 9, 2022

apyrgio commented Jun 10, 2024

Defense in Depth - User Namespaces #228

Defense in Depth - User Namespaces #228

Comments

apyrgio commented Oct 17, 2022 • edited Loading

Linux

Windows/MacOS

apyrgio commented Nov 3, 2022 • edited Loading

Linux User Namespaces

Rootless Podman and Linux User Namespaces

apyrgio commented Nov 9, 2022

Dangerzone and Linux User Namespaces

Requirements

1. The user IDs within the Dangerzone container should not map to any user in the host

2. The files/folders mounted to the Dangerzone container should be accessible by UID/GID 1000 (dangerzone) within this container

Proposed Implementation

Implementation Details

apyrgio commented Jun 10, 2024

apyrgio commented Oct 17, 2022 •

edited

Loading

apyrgio commented Nov 3, 2022 •

edited

Loading

2. The files/folders mounted to the Dangerzone container should be accessible by UID/GID 1000 (`dangerzone`) within this container