Binary execution across Linux mount-namespaces
fxe
is a small, pure-Rust Linux program which demonstrates how to execute binaries across mount-namespaces.
This technique is suitable for several usecases, as it allows to ship minimal containers with specialized binaries and then to run them in namespaces where they are not available.
For example, a bare-minimal ContainerLinux OS can augmented with a mount-foo
container to mount foo
volumes directly on the host.
This program is provided for illustrative purpose only, it is not supposed to be run as-is in production.
As the name suggests, fxe
core functionality is built around fexecve(3)
. Short description from its manpage says:
fexecve() performs the same task as execve(), with the difference that the file to be executed is specified
via a file descriptor rather than via a pathname.
This allows fxe
to get an handle to a binary available inside its container (i.e. mount-namespace), move to a different target, and execute the binary there.
This repository contains a demo program which runs a modinfo crc16
using the busybox
container.
However, the directory containing kernel modules is not available inside the container; instead the process changes its mount-namespace to the target one (e.g. host) and runs the modinfo
binary there.
A pre-built binary is available as a Docker image at quay.io/lucab/fxe
.
To try it, simply do a make run
:
$ make run
docker run --privileged --pid=host quay.io/lucab/fxe:latest /fxe /proc/1/ns/mnt
filename: /lib/modules/4.11.0-1-amd64/kernel/lib/crc16.ko
description: CRC16 calculations
license: GPL
depends:
intree: Y
vermagic: 4.11.0-1-amd64 SMP mod_unload modversions
This will use /proc/1/ns/mnt
as the host mount-namespace target. Other targets can be used, as long as they are bind-mounted inside the container.
The --privileged
flag is a shortcut to add CAP_SYS_ADMIN
and CAP_SYS_CHROOT
(required by setns(2)
) and to prevent the default SECCOMP filter to block it. Both can be allowed with finer granularity settings (this is left as an exercise).
The --pid=host
flag is required for proper fexecve()
execution. It can be changed to any arbitrary target, here it is set to host
only for demonstration purpose.
Due to how setns(2)
and fexecve(3)
are implemented on Linux, there are some conditions imposed on the running environment:
- setns:
CAP_SYS_ADMIN
andCAP_SYS_CHROOT
are required - setns: the target mount-namespace must be available as a file descriptor
- setns: to be allowed to change mount-namespace, the process must be single-thread
- fexecve:
/proc
must be available - fexecve: source and target processes must be running in the same PID-namespace
- fexecve: scripts and dynamic binaries resources must be available in the target
See notes in both manpages for further details and explanations.
The demo in this repository can be quickly built via make
.
Pre-requisites are:
make
- a stable rustc/cargo toolchain for the
x86_64-unknown-linux-musl
target (available via rustup) docker run
available to the current user
This currently depends on a pending PR to nix.