RunCVM (Run Container Virtual Machine) is an experimental open-source Docker container runtime for Linux, created by Struan Bartlett at NewsNow Labs, that makes launching standard containerised workloads and system workloads (e.g. Systemd, Docker, even OpenWrt) in VMs as easy as launching a container.
Install:
curl -s -o - https://raw.githubusercontent.com/newsnowlabs/runcvm/main/runcvm-scripts/runcvm-install-runtime.sh | sudo sh
Now launch an nginx VM listening on port 8080:
docker run --runtime=runcvm --name nginx1 --rm -p 8080:80 nginx
Launch a MariaDB VM, with 2 cpus and 2G memory, listening on port 3306:
docker run --runtime=runcvm --name mariadb1 --rm -p 3306:3306 --cpus 2 --memory 2G --env=MARIADB_ALLOW_EMPTY_ROOT_PASSWORD=1 mariadb
Launch a vanilla ubuntu VM, with interactive terminal:
docker run --runtime=runcvm --name ubuntu1 --rm -it ubuntu
Gain another interactive console on ubuntu1
:
docker exec -it ubuntu1 bash
Launch a VM with 1G memory and a 1G ext4-formatted backing file mounted at /var/lib/docker
and stored in the underlying container's filesystem:
docker run -it --runtime=runcvm --memory=1G --env=RUNCVM_DISKS=/disks/docker,/var/lib/docker,ext4,1G <docker-image>
Launch a VM with 2G memory and a 5G ext4-formatted backing file mounted at /var/lib/docker
and stored in a dedicated Docker volume on the host:
docker run -it --runtime=runcvm --memory=2G --mount=type=volume,src=runcvm-disks,dst=/disks --env=RUNCVM_DISKS=/disks/docker,/var/lib/docker,ext4,5G <docker-image>
Launch a 3-node Docker Swarm on a network with 9000 MTU and, on the swarm, an http global service:
git clone https://github.com/newsnowlabs/runcvm.git && \
cd runcvm/tests/00-http-docker-swarm && \
NODES=3 MTU=9000 ./test
Docker+Sysbox runtime demo - Launch Ubuntu running Systemd and Docker with Sysbox runtime; then within it run an Alpine Sysbox container; and, within that install dockerd and run a container from the 'hello-world' image:
cat <<EOF | docker build --tag=ubuntu-docker-sysbox -
FROM ubuntu:jammy
RUN apt update && apt -y install apt-utils kmod wget iproute2 systemd \
ca-certificates curl gnupg udev dbus && \
curl -fsSL https://get.docker.com | bash
RUN wget -O /tmp/sysbox.deb \
https://downloads.nestybox.com/sysbox/releases/v0.6.2/sysbox-ce_0.6.2-0.linux_amd64.deb && \
apt -y install /tmp/sysbox.deb
ENTRYPOINT ["/lib/systemd/systemd"]
ENV RUNCVM_DISKS='/disks/docker,/var/lib/docker,ext4,1G;/disks/sysbox,/var/lib/sysbox,ext4,1G'
VOLUME /disks
EOF
docker run -d --runtime=runcvm -m 2g --name=ubuntu-docker-sysbox ubuntu-docker-sysbox
docker exec ubuntu-docker-sysbox bash -c "docker run --rm --runtime=sysbox-runc alpine ash -x -c 'apk add docker; dockerd &>/dev/null & sleep 5; docker run --rm hello-world'"
docker rm -fv ubuntu-docker-sysbox
Nested RunCVM demo - Launch Ubuntu running Systemd and Docker with RunCVM runtime installed; then within it run an Alpine RunCVM Container/VM; and, within that install dockerd and, within that, run a container from the 'hello-world' image:
cat <<EOF | docker build --tag=ubuntu-docker-runcvm -
FROM ubuntu:jammy
RUN apt update && apt -y install apt-utils kmod wget iproute2 systemd \
ca-certificates curl gnupg udev dbus && \
curl -fsSL https://get.docker.com | bash
COPY --from=newsnowlabs/runcvm:latest /opt /opt/
RUN rm -f /etc/init.d/docker && \
bash /opt/runcvm/scripts/runcvm-install-runtime.sh --no-dockerd && \
echo kvm_intel >>/etc/modules
ENTRYPOINT ["/lib/systemd/systemd"]
ENV RUNCVM_DISKS='/disks/docker,/var/lib/docker,ext4,1G'
VOLUME /disks
EOF
docker run -d --runtime=runcvm -m 2g --name=ubuntu-docker-runcvm ubuntu-docker-runcvm
docker exec ubuntu-docker-runcvm bash -c "docker run --rm --runtime=runcvm alpine ash -x -c 'apk add docker; dockerd &>/dev/null & sleep 5; docker run --rm hello-world'"
docker rm -fv ubuntu-docker-runcvm
Docker+GVisor runtime demo - Launch Ubuntu running Systemd and Docker with GVisor runtime; then within it run the 'hello-world' image in a GVisor container:
cat <<EOF | docker build --tag=ubuntu-docker-gvisor -
FROM ubuntu:jammy
RUN apt update && apt -y install apt-utils kmod wget iproute2 systemd \
ca-certificates curl gnupg udev dbus jq && \
curl -fsSL https://get.docker.com | bash
RUN curl -fsSL https://gvisor.dev/archive.key | gpg --dearmor -o /usr/share/keyrings/gvisor-archive-keyring.gpg
RUN echo "deb [arch=amd64 signed-by=/usr/share/keyrings/gvisor-archive-keyring.gpg] https://storage.googleapis.com/gvisor/releases release main" >/etc/apt/sources.list.d/gvisor.list && \
apt update && \
apt-get install -y runsc
RUN [ ! -f /etc/docker/daemon.json ] && echo '{}' > /etc/docker/daemon.json; cat /etc/docker/daemon.json | jq '.runtimes.runsc.path="/usr/bin/runsc"' | tee /etc/docker/daemon.json
ENTRYPOINT ["/lib/systemd/systemd"]
ENV RUNCVM_DISKS='/disks/docker,/var/lib/docker,ext4,1G'
VOLUME /disks
EOF
docker run -d --runtime=runsc -m 2g --name=ubuntu-docker-gvisor ubuntu-docker-gvisor
docker exec ubuntu-docker-gvisor bash -c "docker run --rm --runtime=runsc hello-world"
docker rm -fv ubuntu-docker-gvisor
Launch OpenWrt - with port forward to LuCI web UI on port 10080:
docker import --change='ENTRYPOINT ["/sbin/init"]' https://archive.openwrt.org/releases/23.05.2/targets/x86/generic/openwrt-23.05.2-x86-generic-rootfs.tar.gz openwrt-23.05.2 && \
docker network create --subnet 172.128.0.0/24 runcvm-openwrt && \
echo -e "config interface 'loopback'\n\toption device 'lo'\n\toption proto 'static'\n\toption ipaddr '127.0.0.1'\n\toption netmask '255.0.0.0'\n\nconfig device\n\toption name 'br-lan'\n\toption type 'bridge'\n\tlist ports 'eth0'\n\nconfig interface 'lan'\n\toption device 'br-lan'\n\toption proto 'static'\n\toption ipaddr '172.128.0.5'\n\toption netmask '255.255.255.0'\n\toption gateway '172.128.0.1'\n" >/tmp/runcvm-openwrt-network && \
docker run -it --rm --runtime=runcvm --name=openwrt --network=runcvm-openwrt --ip=172.128.0.5 -v /tmp/runcvm-openwrt-network:/etc/config/network -p 10080:80 openwrt-23.05.2
RunCVM was born out of difficulties experienced using the Docker and Podman CLIs to launch Kata Containers v2, and a belief that launching containerised workloads in VMs using Docker needn't be so complicated.
Motivations included: efforts to re-add OCI CLI commands for docker/podman to Kata v2 to support Docker & Podman; other Kata issues #3358, #1123, #1133, #3038; #5321; #6861; Podman issues #8579 and #17070; and Kubernetes issue #40114; though please note, since authoring RunCVM some of these issues may have been resolved.
Like Kata, RunCVM aims to be a secure container runtime with lightweight virtual machines that feel and perform like containers, but provide stronger workload isolation using hardware virtualisation technology.
However, while Kata aims to launch standard container images inside a restricted-privileges namespace inside a VM running a single fixed and heavily customised kernel and Linux distribution optimised for this purpose, RunCVM intentionally aims to launch container or VM images as the VM's root filesystem using stock or bespoke Linux kernels, the upshot being RunCVM's can run VM workloads that Kata's security and kernel model would explicitly prevent.
For example:
- RunCVM can launch system images expecting to interface directly with hardware, like OpenWRT
- RunCVM can launch VMs nested inside a RunCVM VM - i.e. an 'inner' RunCVM Container/VM guest can be launched by Docker running within an 'outer' RunCVM Container/VM guest (assuming the host supports nested VMs) - in this sense, RunCVM is 'reentrant'.
RunCVM features:
- Compatible with
docker run
(with experimental support forpodman run
). - Uses a lightweight 'wrapper-runtime' technology that subverts the behaviour of the standard container runtime
runc
to cause a VM to be launched within the container (making its code footprint and external dependencies extremely small, and its internals extremely simple and easy to understand and tailor for specific purposes). - Highly portable among Linux distributions and development platforms providing KVM. Can even be installed on GitHub Codespaces!
- Written, using off-the-shelf open-source components, almost entirely in shell script for simplicity, portability and ease of development.
RunCVM makes some trade-offs in return for this simplicity. See the full list of features and limitations.
- Introduction
- Quick start
- Motivation
- Licence
- Project aims
- Project ambitions
- Applications for RunCVM
- How RunCVM works
- System requirements
- Installation
- Upgrading
- Features and Limitations
- RunCVM vs Kata comparison
- Kernel selection
- Option reference
- Advanced usage
- Developing
- Building
- Testing
- Contributing
- Support
- Uninstallation
- Legals
RunCVM is free and open-source, licensed under the Apache Licence, Version 2.0. See the LICENSE file for details.
- Run any standard container workload in a VM using
docker run
with no need to customise images or the command line (except adding--runtime=runcvm
) - Run unusual container workloads, like
dockerd
andsystemd
that will not run in standard container runtimes - Maintain a similar experience within a RunCVM VM as within a container: process table, network interfaces, stdio, exit code handling should be broadly similar to maximise compatibility
- Container start/stop/kill semantics respected, where possible providing clean VM shutdown on stop
- VM console accessible as one would expect using
docker run -it
,docker start -ai
anddocker attach
(and so on), generally good support for otherdocker container
subcommands - Efficient container startup, by using virtiofs to serve a container's filesystem directly to a VM (instead of unpacking an image into a backing file)
- Improved security compared to the standard container runtime, and as much security as possible without compromising the simplicity of the implementation
- Command-line and image-embedded options for customising the a container's VM specifications, devices, kernel
- Intelligent kernel selection, according to the distribution used in the image being launched
- No external dependencies, except for Docker/Podman and relevant Linux kernel modules (
kvm
andtun
) - Support multiple Docker network interfaces attached to a created (but not yet running) container using
docker run --network=<network>
anddocker network connect
(excluding IPv6)
- Support for booting VM with a file-backed disk root fs generated from the container image, instead of only virtiofs root
- Support running foreign-architecture VMs by using QEMU dynamic CPU emulation for the entire VM (instead of the approach used by https://github.com/multiarch/qemu-user-static which uses dynamic CPU emulation for each individual binary)
- Support for QEMU microvm or potentially Amazon Firecracker
- More natural console support with independent stdout and stderr channels for
docker run -it
- Improve VM boot time and other behaviours using custom kernel
- Support for specific hardware e.g. graphics display served via VNC
The main applications for RunCVM are:
- Running and testing applications that:
- don't work with (or require enhanced privileges to work with) standard container runtimes (e.g.
systemd
,dockerd
, Docker swarm services, Kubernetes) - require a running kernel, or a kernel version or modules not available on the host
- require specific hardware that can be emulated e.g. disks, graphics displays
- don't work with (or require enhanced privileges to work with) standard container runtimes (e.g.
- Running existing container workloads with increased security
- Testing container workloads that are already intended to launch in VM environments, such as on fly.io
- Developing any of the above applications in Dockside (see RunCVM and Dockside)
RunCVM's 'wrapper' runtime, runcvm-runtime
, receives container create commands triggered by docker
run
/create
commands, modifies the configuration of the requested container in such a way that the created container will launch a VM that boots from the container's filesystem, and then passes the request on to the standard container runtime (runc
) to actually create and start the container.
For a deep dive into RunCVM's internals, see the section on Developing RunCVM.
RunCVM should run on any amd64 (x86_64) hardware (or VM) running Linux Kernel >= 5.10, and that supports KVM and Docker. So if your host can already run KVM VMs and Docker then it should run RunCVM.
RunCVM has no other host dependencies, apart from Docker (or experimentally, Podman) and the kvm
and tun
kernel modules. RunCVM comes packaged with all binaries and libraries it needs to run (including its own QEMU binary).
RunCVM is tested on Debian Bullseye and GitHub Codespaces.
For RunCVM to support Docker DNS within Container/VMs, the following condition on /proc/sys/net/ipv4/conf/
must be met:
- the max of
all/rp_filter
and<bridge>/rp_filter
should be 0 ('No Source Validation') or 2 (Loose mode as defined in RFC3704 Loose Reverse Path) (where<bridge>
is any bridge underpinning a Docker network to which RunCVM Container/VMs will be attached)
This means that:
- if
all/rp_filter
will be set to 0, then<bridge>/rp_filter
must be set to 0 or 2 (or, if<bridge>
is not yet or might not yet have been created, thendefault/rp_filter
must be set to 0 or 2) - if
all/rp_filter
will be set to 1, then<bridge>/rp_filter
must be set to 2 (or, if<bridge>
is not yet or might not yet have been created, thendefault/rp_filter
must be set to 2) - if
all/rp_filter
will be set to 2, then no further action is needed
At time of writing:
- the Debian default is
0
; - the Ubuntu default is
2
; - the Google Cloud Debian image has default
1
andrp_filter
settings in/etc/sysctl.d/60-gce-network-security.conf
must be modified or overridden to support RunCVM.
We recommend all/rp_filter
be set to 2, as this is the simplest change and provides a good balance of security.
Run:
curl -s -o - https://raw.githubusercontent.com/newsnowlabs/runcvm/main/runcvm-scripts/runcvm-install-runtime.sh | sudo sh
This will:
- Install the RunCVM software package to
/opt/runcvm
(installation elsewhere is currently unsupported) - For Docker support:
- Enable the RunCVM runtime, by patching
/etc/docker/daemon.json
to addruncvm
to theruntimes
property - Restart
dockerd
, if it can be detected how for your system (e.g.systemctl restart docker
) - Verify that RunCVM is recognised via
docker info
- Enable the RunCVM runtime, by patching
- For Podman support (experimental)
- Display instructions on patching
/etc/containers/containers.conf
- Display instructions on patching
- Check your system network device
rp_filter
settings, and amend them if necessary
Following installation, launch a basic test RunCVM Container/VM:
docker run --runtime=runcvm --rm -it hello-world
Create an image that will allow instances to have VMX capability:
gcloud compute images create debian-12-vmx --source-image-project=debian-cloud --source-image-family=debian-12 --licenses="https://compute.googleapis.com/compute/v1/projects/vm-options/global/licenses/enable-vmx"
Now launch a VM, install Docker and RunCVM:
cat >/tmp/startup-script.sh <<EOF
#!/bin/bash
apt update && apt -y install apt-utils kmod wget iproute2 systemd \
ca-certificates curl gnupg udev dbus jq && \
mkdir -p /etc/docker && echo '{"userland-proxy": false}' >/etc/docker/daemon.json && \
curl -fsSL https://get.docker.com | bash && \
curl -s -o - https://raw.githubusercontent.com/newsnowlabs/runcvm/main/runcvm-scripts/runcvm-install-runtime.sh | sudo REPO=newsnowlabs/runcvm:latest sh
EOF
gcloud compute instances create runcvm-vmx-test --zone=us-central1-a --machine-type=n2-highmem-2 --network-interface=network-tier=PREMIUM,stack-type=IPV4_ONLY,subnet=default --metadata-from-file=startup-script=/tmp/startup-script.sh --no-restart-on-failure --maintenance-policy=TERMINATE --provisioning-model=SPOT --instance-termination-action=STOP --no-service-account --no-scopes --create-disk=auto-delete=yes,boot=yes,image=debian-12-vmx,mode=rw,size=50,type=pd-ssd --no-shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring --labels=goog-ec-src=vm_add-gcloud --reservation-affinity=any
To upgrade, follow this procedure:
- Stop all RunCVM containers.
- Run
/opt/runcvm/scripts/runcvm-install-runtime.sh
(or rerun the installation command - which runs the same script) - Start any RunCVM containers.
In the below summary of RunCVM's current main features and limitations, [+] is used to indicate an area of compatibility with standard container runtimes and [-] is used indicate a feature of standard container runtimes that is unsupported.
N.B.
docker run
anddocker exec
options not listed below are unsupported and their effect, if used, is unspecified.
docker run
- Mounts
- [+]
--mount
(or-v
) is supported for volume mounts, tmpfs mounts, and host file and directory bind-mounts (thedst
mount path/disks
is reserved) - [-] Bind-mounting host sockets or devices, and
--device
is unsupported
- [+]
- Networking
- [+] The default bridge network is supported
- [+] Custom/user-defined networks specified using
--network
are supported, including Docker DNS resolution of container names and respect for custom network MTU - [+] Multiple network interfaces - when attached via
docker run --network
ordocker network connect
(but only to a created and not yet running container) - are supported (includingscope=overlay
networks and those with multiple subnets) - [+]
--publish
(or-p
) is supported - [+]
--dns
,--dns-option
,--dns-search
are supported - [+]
--ip
is supported - [+]
--hostname
(or-h
) is supported - [-]
docker network connect
on a running container is not supported - [-]
--network=host
and--network=container:name|id
are not supported - [-] IPv6 is not supported
- Execution environment
- [+]
--user
(or-u
) is supported - [?]
--workdir
(or-w
) is supported - [+]
--env
(or-e
),--env-file
is supported - [+]
--entrypoint
is supported - [+]
--init
- is supported (but runs RunCVM's own VM init process rather than Docker's default,tini
)
- [+]
- stdio/Terminals
- [+]
--detach
(or-d
) is supported - [+]
--interactive
(or-i
) is supported - [+]
--tty
(or-t
) is supported (but to enter CTRL-T one must press CTRL-T twice) - [+]
--attach
(or-a
) is supported - [+] Stdout and Stderr output should be broadly similar to running the same workload in a standard
runc
container - [-] Stdout and Stderr are not independently multiplexed so
docker run --runtime=runcvm debian bash -c 'echo stdout; echo stderr >&2' >/tmp/stdout 2>/tmp/stderr
does not produce the expected result - [-] Stdout and Stderr sent very soon after VM launch might be corrupted due to serial console issues
- [-] Stdout and Stderr sent immediately before VM shutdown might not always be fully flushed
- [+]
- Resource allocation and limits
- [+]
--cpus
is supported to specify number of VM CPUs - [+]
--memory
(or-m
) is supported to specify VM memory - [-] Other container resource limit options such as (
--cpu-*
), block IO (--blkio-*
), kernel memory (--kernel-memory
) are unsupported or untested
- [+]
- Exit code
- [+] Returning the entrypoint's exit code is supported, but it currently requires application support
- [-] To return an exit code, your entrypoint may either write its exit code to
/.runcvm/exit-code
(supported exit codes 0-255) or call/opt/runcvm/sbin/qemu-exit <code>
(supported exit codes 0-127). Automatic handling of exit codes from the entrypoint will be provided in a later version.
- Disk performance
- [+] No mountpoints are required for basic operation for most applications. Volume or disk mountpoints may be needed for running
dockerd
or to improve disk performance - [-]
dockerd
mileage will vary unless a volume or disk is mounted over/var/lib/docker
- [+] No mountpoints are required for basic operation for most applications. Volume or disk mountpoints may be needed for running
- Mounts
docker exec
- [+]
--user
(or-u
),--workdir
(or-w
),--env
(or-e
),--env-file
,--detach
(or-d
),--interactive
(or-i
) and--tty
(or-t
) are all supported - [+] Stdout and Stderr are independently multiplexed so
docker exec <container> bash -c 'echo stdout; echo stderr >&2' >/tmp/stdout 2>/tmp/stderr
does produce the expected result
- [+]
- Security
- The RunCVM software package at
/opt/runcvm
is mounted read-only within RunCVM containers. Container applications cannot compromise RunCVM, but they can execute binaries from within the RunCVM package. The set of binaries available to the VM may be reduced to a minimum in a later version.
- The RunCVM software package at
- Kernels
- [+] Use any kernel, either one pre-packaged with RunCVM or roll your own
- [+] RunCVM will try to select an appropriate kernel to use based on examination of
/etc/os-release
within the image being launched.
This table provides a high-level comparison of RunCVM and Kata across various features like kernels, networking/DNS, memory allocation, namespace handling, method of operation, and performance characteristics:
Feature | RunCVM | Kata |
---|---|---|
Methodology | Boots VM from distribution kernels with container's filesystem directly mounted as root filesystem, using virtiofs. VM setup code and kernel modules are bind-mounted into the container. VM's PID1 runs setup code to reproduce the container's networking environment within the VM before executing the container's original entrypoint. | Boots VM from custom kernel with custom root disk image, mounts the virtiofsd-shared host container filesystem to a target folder and executes the container's entrypoint within a restricted namespace having chrooted to that folder. |
Privileges/restrictions | Container code has full root access to VM and its devices. It may run anything that runs in a VM, mounting filesystems, installing kernel modules, accessing devices. RunCVM helper processes are visible to ps etc. |
Runs container code inside a VM namespace with restricted privileges. Use of mounts, kernel modules is restricted. Kata helper processes (like kata-agent and chronyd) are invisible to ps . |
Kernels | Launches stock Alpine, Debian, Ubuntu kernels. Kernel /lib/modules automatically mounted within VM. Install any needed modules without host reconfiguration. |
Launches custom kernels. Kernel modules aren't mounted and need host reconfiguration to be installed. |
Networking/DNS | Docker container networking + internal/external DNS out-of-the-box. No support for docker network connect/disconnect |
DNS issues presented: with custom network, external ping works, but DNS lookups fail both for internal docker hosts and external hosts.1 |
Memory | VM assigned and reports total memory as per --memory <mem> |
VM total memory reported by free appears unrelated to --memory <mem> specified 2 |
CPUs | VM assigned and reports CPUs as per --cpus <cpus> |
CPUs must be hardcoded in Kata host config |
Performance | Custom kernel optimisations may deliver improved startup (~3.2s) or operational performance (~15%) | |
virtiofsd | Runs virtiofsd in container namespace |
Unknown |
When creating a container, RunCVM will examine the image being launched to try to determine a suitable kernel to boot the VM with. Its process is as follows:
- If
--env=RUNCVM_KERNEL=<dist>[/<version>]
specified, use the indicated kernel - Otherwise, identify distro from
/etc/os-release
- If one is found in the appropriate distro-specific location in the image, select an in-image kernel. The locations are:
- Debian:
/vmlinuz
and/initrd.img
- Ubuntu:
/boot/vmlinuz
and/boot/initrd.img
- Alpine:
/boot/vmlinuz-virt
/boot/initramfs-virt
- Debian:
- Otherwise, if found in the RunCVM package, select the latest kernel compatible with the distro
- Finally, use the Debian kernel from the RunCVM package
- If one is found in the appropriate distro-specific location in the image, select an in-image kernel. The locations are:
RunCVM options are specified either via standard docker run
options or via --env=<RUNCVM_KEY>=<VALUE>
options on the docker run
command line. The following env options are user-configurable:
Specify with which RunCVM kernel (from /opt/runcvm/kernels
) to boot the VM. Values must be of the form <dist>/<version>
, where <dist>
is a directory under /opt/runcvm/kernels
and <version>
is a subdirectory (or symlink to a subdirectory) under that. If <version>
is omitted, latest
will be assumed. Here is an example command that will list available values of <dist>/<version>
on your installation.
$ find /opt/runcvm/kernels/ -maxdepth 2 | sed 's!^/opt/runcvm/kernels/!!; /^$/d'
debian
debian/latest
debian/5.10.0-16-amd64
alpine
alpine/latest
alpine/5.15.59-0-virt
ubuntu
ubuntu/latest
ubuntu/5.15.0-43-generic
ol
ol/5.14.0-70.22.1.0.1.el9_0.x86_64
ol/latest
Example:
docker run --rm --runtime=runcvm --env=RUNCVM_KERNEL=ol hello-world
Any custom kernel command line options e.g. apparmor=0
or systemd.unified_cgroup_hierarchy=0
.
Automatically create, format, prepopulate and mount backing files as virtual disks on the VM.
Each <diskN>
should be a comma-separated list of values of the form: <src>,<dst>,<filesystem>[,<size>]
.
<src>
is the path within the container where the virtual disk backing file should be located. This may be in the container's overlayfs or within a volume (mounted using--mount=type=volume
).<dst>
is both (a) the path within the VM where the virtual disk should be mounted; and (b) the location of the directory with which contents the disk should be prepopulated.<filesystem>
is the filesystem with which the backing disk should be formatted when first created.<size>
is the size of the backing file (intruncate
format), and must be specified if<src>
does not exist.
When first created, the backing file will be created as a sparse file to the specified <size>
and formatted with the specified <filesystem>
using mke2fs
and prepopulated with any files preexisting at <dst>
.
When RunCVM creates a Container/VM, fstab entries will be drafted. After the VM boots, the fstab entries will be mounted. Typically, the first disk will be mounted as /dev/vda
, the second as /dev/vdb
, and so on.
docker run -it --runtime=runcvm --env=RUNCVM_DISKS=/disk1,/home,ext4,5G <docker-image>
In this example, RunCVM will check for existence of a file at /disk1
within <docker-image>
, and if not found create a 5G backing file (in the container's filesystem, typically overlay2) with an ext4 filesystem prepopulated with any preexisting contents of /home
, then add the disk to /etc/fstab
and mount it within the VM at /home
.
docker run -it --runtime=runcvm --mount=type=volume,src=runcvm-disks,dst=/disks --env='RUNCVM_DISKS=/disks/disk1,/home,ext4,5G;/disks/disk2,/opt,ext4,2G' <docker-image>
This example behaves similarly, except that the runcvm-disks
persistent Docker volume is first mounted at /disks
within the container's filesystem, and therefore the backing files at /disks/disk1
and /disks/disk2
(mounted in the VM at /home
and /opt
respectively) are stored in the persistent volume (typically stored in /var/lib/docker
on the host, bypassing overlay2).
N.B.
/disks
and any paths below it are reserved mountpoints. Unlike other mountpoints, these are NOT mounted into the VM but only into the container, and are therefore suitable for use for mounting VM disks from bscking files that cannot be accessed within the VM's filesystem.
Select a specific QEMU display. Currently only curses
is supported, but others may trivially be added by customising the build.
By default, virtiofsd
is not launched with -o modcaps=+sys_admin
(and containers are not granted CAP_SYS_ADMIN
). Use this option if you need to change this.
If a RunCVM kernel (as opposed to an in-image kernel) is chosen to launch a VM, by default that kernel's modules will be mounted at /lib/modules/<version>
in the VM. If this variables is set, that kernel's modules will instead be mounted over /lib/modules
.
Enable kernel logging (sets kernel console=ttyS0
).
By default BIOS console output is hidden. Enable it with this option.
Enable debug logging for the runtime (the portion of RunCVM directly invoked by docker run
, docker exec
etc).
Debug logs are written to files in /tmp
.
Enable breakpoints (falling to bash shell) during the RunCVM Container/VM boot process.
<values>
must be a comma-separated list of: prenet
, postnet
, preqemu
.
[EXPERIMENTAL] Enable use of preallocated hugetlb memory backend, which can improve performance in some scenarios.
Configures cgroupfs mountpoints in the VM, which may be needed to run applications like Docker if systemd is not running. Acceptable values are:
none
/systemd
- do nothing; leave to the application or to systemd (if running)1
/cgroup1
- mount only cgroup v1 filesystems supported by the running kernel to subdirectories of/sys/fs/cgroup
2
/cgroup2
- mount only cgroup v2 filesystem to/sys/fs/cgroup
hybrid
/mixed
- mount cgroup v1 filesystems and mount cgroup v2 filesystem to/sys/fs/cgroup/unified
Please note that if RUNCVM_CGROUPFS
is left undefined or set to an empty string, then RunCVM selects an appropriate
default behaviour according to these rules:
- If specified entrypoint (or, if a symlink, its target) matches the regex
/systemd$
then assume a default value ofnone
; - Else, assume a default value of
hybrid
.
These rules work well in the cases of running Docker in (a) stock Alpine/Debian/Ubuntu distributions in which Docker has been installed but Systemd is not running; and (b) distributions in which Systemd is running. Of course you should set RUNCVM_CGROUPFS
if you need to override the default behaviour.
Please also note that in the case your distribution is running Systemd you may instead set --env=RUNCVM_KERNEL_APPEND='systemd.unified_cgroup_hierarchy=<boolean>'
(where <boolean>
is 0
or 1
) to request Systemd to create either hybrid or cgroup2-only cgroup filesystem(s) itself.
If running Docker within a VM, it is recommended that you mount a disk backing file at /var/lib/docker
to allow dockerd
to use the preferred overlay filesystem and avoid it opting to use the extremely sub-performant vfs
storage driver.
e.g. To launch a VM with a 1G ext4-formatted backing file, stored in the underlying container's overlay filesystem, and mounted at /var/lib/docker
, run:
docker run -it --runtime=runcvm --env=RUNCVM_DISKS=/disks/docker,/var/lib/docker,ext4,1G <docker-image>
To launch a VM with a 5G ext4-formatted backing file, stored in a dedicated Docker volume on the host, and mounted at /var/lib/docker
, run:
docker run -it --runtime=runcvm --mount=type=volume,src=runcvm-disks,dst=/disks --env=RUNCVM_DISKS=/disks/docker,/var/lib/docker,ext4,5G <docker-image>
In both cases, RunCVM will check for existence of a file /disks/docker
and, if not found, will create the disk backing file of the given size and format as an ext4 filesystem. It will add the disk to /etc/fstab
.
For full documentation of RUNCVM_DISKS
, see above.
Doing this is not recommended, but if running Docker within a VM, you can enable dockerd
to use the overlay filesystem (at the cost of security) by launching with --env=RUNCVM_SYS_ADMIN=1
. e.g.
docker run --runtime=runcvm --mount=type=volume,src=mydocker1,dst=/var/lib/docker --env=RUNCVM_SYS_ADMIN=1 <docker-image>
N.B. This option adds
CAP_SYS_ADMIN
capabilities to the container and then launchesvirtiofsd
with-o modcaps=+sys_admin
.
The following deep dive should help explain the inner workings of RunCVM, and which files to modify to implement fixes, improvements and extensions.
RunCVM's 'wrapper' runtime, runcvm-runtime
, intercepts container create
and exec
commands and their specifications in JSON format (config.json
and process.json
respectively) that are normally provided (by docker
run
/create
and docker exec
respectively) to a standard container runtime like runc
.
The JSON file is parsed to retrieve properties of the command, and is modified to allow RunCVM to piggyback by overriding the originally intended behaviour with new behaviour.
The modifications to create
are designed to make the created container launch a VM that boots off the container's filesystem, served using virtiofsd
.
The modifications to exec
are designed to run commands within the VM instead of the container.
In more detail, the RunCVM runtime create
process:
- Modifies the
config.json
file to:- Modify the container's entrypoint, to prepend
runcvm-ctr-entrypoint
to the container's original entrypoint and if an--init
argument was detected, remove any init process and set the container env varRUNCVM_INIT
to1
- Set the container env var
RUNCVM_UIDGID
to<uid>:<gid>:<additionalGids>
as intended for the container, then resets both the<uid>
and<gid>
to0
. - Set the container env var
RUNCVM_CPUS
to the intended--cpus
count so it can be passed to the VM - Extract and delete all requested tmpfs mounts (these will be independently mounted by the VM).
- Add a bind mount from
/
to/vm
that will recursively mount the following preceding mounts:- A bind mount from
/opt/runcvm
on the host to/opt/runcvm
in the container. - A tmpfs mounted at
/.runcvm
- A bind mount from
- Add a tmpfs at
/run
in the container only. - Map all requested bind mounts from their original mountpoint
<mnt>
to/vm/<mnt>
(except where<mnt>
is at or below/disks
). - Determine a suitable VM launch kernel by looking for one inside the container's image, choosing a stock RunCVM kernel matching the image, or according to an env var argument.
- Add a bind mount to
/vm/lib/modules/<version>
for the kernel's modules - Set container env vars
RUNCVM_KERNEL_PATH
,RUNCVM_KERNEL_INITRAMFS_PATH
andRUNCVM_KERNEL_ROOT
- Add a bind mount to
- Add device mounts for
/dev/kvm
and/dev/net/tun
. - Set the seccomp profile to 'unconfined'.
- Set
/dev/shm
to the size desired for the VM's memory and set container env var accordingly. - Add necessary capabilities, if not already present (
NET_ADMIN
,NET_RAW
,MKNOD
,AUDIT_WRITE
). - Only if requested by
--env=SYS_ADMIN=1
, add theSYS_ADMIN
capability.
- Modify the container's entrypoint, to prepend
- Executes the standard container runtime
runc
with the modifiedconfig.json
.
The runcvm-ctr-entrypoint
:
- Is always launched as PID1 within the standard Docker container.
- Saves the container's originally-intended entrypoint and command line, environment variables and network configuration to files inside
/.runcvm
. - Creates a bridge (acting as a hub) for each container network interface, to join that interface to a VM tap network interface.
- Launches
virtiofsd
to serve the container's root filesystem. - Configures
/etc/resolv.conf
in the container. - Adds container firewall rules, launches
dnsmasq
and modifies/vm/etc/resolv.conf
to proxy DNS requests from the VM to Docker's DNS. - Execs RunCVM's own
runcvm-init
init process to superviseruncvm-ctr-qemu
to launch the VM.
The runcvm-init
process:
- Is RunCVM's custom init process, that takes over as PID1 within the container, supervising
runcvm-ctr-qemu
to launch the VM. - Waits for a TERM signal. On receiving one, it spawns
runcvm-ctr-shutdown
, which cycles through a number of methods to try to shut down the VM cleanly. - Waits for its child (QEMU) to exit. When it does, execs
runcvm-ctr-exit
to retrieve any saved exit code (written by the application to/.runcvm/exit-code
) and exit with this code.
The runcvm-ctr-qemu
script:
- Prepares disk backing files as specified by
--env=RUNCVM_DISKS=<disks>
- Prepares network configuration as saved from the container (modifying the MAC address of each container interface)
- Launches QEMU with the required kernel, network interfaces, disks, display, and with a root filesystem mounted via virtiofs from the container and with
runcvm-vm-init
as the VM's init process.
The runcvm-vm-init
process:
- Runs as PID1 within the VM.
- Retrieves the container configuration - network, environment, disk and tmpfs mounts - saved by
runcvm-ctr-entrypoint
to/.runcvm
, and reproduces it within the VM - Launches the container's pre-existing entrypoint, in one of two ways.
- If
RUNCVM_INIT
is1
(i.e. the container was originally intended to be launched with Docker's own init process) then it configures and execs busyboxinit
, which becomes the VM's PID1, to supervisedropbear
, runruncvm-vm-start
andpoweroff
the VM if signalled to do so. - Else, it backgrounds
dropbear
, then execs (viaruncvm-init
, purely to create a controlling tty)runcvm-vm-start
, which runs as the VM's PID1.
- If
The runcvm-vm-start
script:
- Restores the container's originally-intended environment variables,
<uid>
,<gid>
,<additionalGids>
and<cwd>
, and execs that entrypoint.
The RunCVM runtime exec
process:
- Modifies the
process.json
file to:- Retrieve the intended
<uid>
,<gid>
,<additionalGids>
,<terminal>
and<cwd>
for the command, as well as indicating the existence of a HOME environment variable. - Resets both the
<uid>
and<gid>
to0
and the<cwd>
to/
. - Prepend
runcvm-ctr-exec '<uid>:<gid>:<additionalGids>' '<cwd>' '<hasHome>' '<terminal>'
to the originally intended command.
- Retrieve the intended
- Executes the standard container runtime
runc
with the modifiedprocess.json
.
The runcvm-ctr-exec
script:
- Uses the Dropbear
dbclient
SSH client to execute the intended command, with the intended arguments within the VM, via theruncvm-vm-exec
process, propagate the returned stdout and stderr and return the command's exit code.
Building RunCVM requires Docker. To build RunCVM, first clone the repo, then run the build script, as follows:
cd runcvm
./build/build.sh
The build script creates a Docker image named newsnowlabs/runcvm:latest
.
Now follow the main installation instructions to install your built RunCVM from the Docker image.
Test RunCVM using nested RunCVM. You can do this using a Docker image capable of installing RunCVM, or an image built with a version of RunCVM preinstalled.
Build a suitable image as follows:
cat <<EOF | docker build --tag=ubuntu-docker-runcvm -
FROM ubuntu:jammy
# Install needed packages and create and configure 'runcvm' user account
RUN apt update && \
apt -y install \
apt-utils kmod wget iproute2 systemd \
ca-certificates curl gnupg udev dbus sudo psmisc && \
curl -fsSL https://get.docker.com | bash && \
echo kvm_intel >>/etc/modules && \
useradd --create-home --shell /bin/bash --groups sudo,docker runcvm && \
echo runcvm:runcvm | chpasswd && \
echo 'runcvm ALL=(ALL) NOPASSWD: ALL' >/etc/sudoers.d/runcvm
WORKDIR /home/runcvm
ENTRYPOINT ["/lib/systemd/systemd"]
VOLUME /disks
# Mount formatted backing files at:
# - /var/lib/docker for speed and overlay2 support
# - /opt/runcvm to avoid nested virtiofs, which works, but can't be great for speed
ENV RUNCVM_DISKS='/disks/docker,/var/lib/docker,ext4,2G;/disks/runcvm,/opt/runcvm,ext4,2G'
# # Uncomment this block to preinstall RunCVM from the specified image
#
# COPY --from=newsnowlabs/runcvm:latest /opt /opt/
# RUN rm -f /etc/init.d/docker && \
# bash /opt/runcvm/scripts/runcvm-install-runtime.sh --no-dockerd
EOF
(Uncomment the final block to build an image with RunCVM preinstalled, or leave the block commented to test RunCVM installation).
To launch, run:
docker run -d --runtime=runcvm -m 2g --name=ubuntu-docker-runcvm ubuntu-docker-runcvm
Optionally modify this
docker run
command by:
- adding
--rm
- to automatically remove the container after systemd shutdown- removing
-d
and adding--env=RUNCVM_KERNEL_DEBUG=1
- to see kernel and systemd boot logs- removing
-d
and adding-it
- to provide a console
Then docker exec -it -u runcvm ubuntu-docker-runcvm bash
to obtain a command prompt and perform testing.
Run docker rm -fv ubuntu-docker-runcvm
to clean up after testing.
Support launching images: If you encounter any Docker image that launches in a standard container runtime that does not launch in RunCVM, or launches but with unexpected behaviour, please raise an issue titled Launch failure for image <image>
or Unexpected behaviour for image <image>
and include log excerpts and an explanation of the failure, or expected and unexpected behaviour.
For all other issues: please still raise an issue
You can also reach out to us on the NewsNow Labs Slack Workspace.
We are typically available to respond to queries Monday-Friday, 9am-5pm UK time, and will be happy to help.
If you would like to contribute a feature suggestion or code, please raise an issue or submit a pull request.
Shut down any RunCVM containers.
Then run sudo rm -f /opt/runcvm
.
RunCVM and Dockside are designed to work together in two alternative ways.
- Dockside can be used to launch devtainers (development environments) in RunCVM VMs, allowing you to provision containerised online IDEs for developing applications like
dockerd
, Docker swarm,systemd
, applications that require a running kernel, or kernel modules not available on the host, or specific hardware e.g. a graphics display. Follow the instructions for adding a runtime to your Dockside profiles. - Dockside can itself be launched inside a RunCVM VM with its own
dockerd
to provide increased security and compartmentalisation from a host. e.g.
docker run --rm -it --runtime=runcvm --memory=2g --name=docksidevm -p 443:443 -p 80:80 --mount=type=volume,src=dockside-data,dst=/data --mount=type=volume,src=dockside-disks,dst=/disks --env=RUNCVM_DISKS=/disks/disk1,/var/lib/docker,ext4,5G newsnowlabs/dockside --run-dockerd --ssl-builtin
This project (known as "RunCVM"), comprising the files in this Git repository (but excluding files containing a conflicting copyright notice and licence), is copyright 2023 NewsNow Publishing Limited, Struan Bartlett, and contributors.
RunCVM is an open-source project licensed under the Apache License, Version 2.0 (the "License"); you may not use RunCVM or its constituent files except in compliance with the License.
You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
N.B. In order to run, RunCVM relies upon other third-party open-source software dependencies that are separate to and independent from RunCVM and published under their own independent licences.
RunCVM Docker images made available at https://hub.docker.com/repository/docker/newsnowlabs/runcvm are distributions designed to run RunCVM that comprise: (a) the RunCVM project source and/or object code; and (b) third-party dependencies that RunCVM needs to run; and which are each distributed under the terms of their respective licences.
Footnotes
-
docker network create --scope=local testnet >/dev/null && docker run --name=test --rm --runtime=kata --network=testnet --entrypoint=/bin/ash alpine -c 'for n in test google.com 8.8.8.8; do echo "ping $n ..."; ping -q -c 8 -i 0.5 $n; done'; docker network rm testnet >/dev/null
succeeds onrunc
andruncvm
but at time of writing (2023-12-31) the DNS lookups needed fail onkata
.
↩$ docker network create --scope=local testnet >/dev/null && docker run --name=test --rm -it --runtime=kata --network=testnet --entrypoint=/bin/ash alpine -c 'for n in test google.com 8.8.8.8; do echo "ping $n ..."; ping -q -c 8 -i 0.5 $n; done'; docker network rm testnet >/dev/null ping test ... ping: bad address 'test' ping google.com ... ping: bad address 'google.com' ping 8.8.8.8 ... PING 8.8.8.8 (8.8.8.8): 56 data bytes --- 8.8.8.8 ping statistics --- 8 packets transmitted, 8 packets received, 0% packet loss round-trip min/avg/max = 0.911/1.716/3.123 ms $ docker network create --scope=local testnet >/dev/null && docker run --name=test --rm -it --runtime=runcvm --network=testnet --entrypoint=/bin/ash alpine -c 'for n in test google.com 8.8.8.8; do echo "ping $n ..."; ping -q -c 8 -i 0.5 $n; done'; docker network rm testnet >/dev/null ping test ... PING test (172.25.8.2): 56 data bytes --- test ping statistics --- 8 packets transmitted, 8 packets received, 0% packet loss round-trip min/avg/max = 0.033/0.085/0.137 ms ping google.com ... PING google.com (172.217.16.238): 56 data bytes --- google.com ping statistics --- 8 packets transmitted, 8 packets received, 0% packet loss round-trip min/avg/max = 8.221/8.398/9.017 ms ping 8.8.8.8 ... PING 8.8.8.8 (8.8.8.8): 56 data bytes --- 8.8.8.8 ping statistics --- 8 packets transmitted, 8 packets received, 0% packet loss round-trip min/avg/max = 1.074/1.491/1.801 ms
-
docker run --rm -it --runtime=kata --entrypoint=/bin/ash -m 500m alpine -c 'free -h; df -h /dev/shm'
↩$ docker run --rm --runtime=kata --name=test -m 2g --env=RUNCVM_KERNEL_DEBUG=1 -it alpine ash -c 'free -h' total used free shared buff/cache available Mem: 3.9G 94.4M 3.8G 0 3.7M 3.8G Swap: 0 0 0 $ docker run --rm --runtime=kata --name=test -m 3g --env=RUNCVM_KERNEL_DEBUG=1 -it alpine ash -c 'free -h' total used free shared buff/cache available Mem: 4.9G 107.0M 4.8G 0 3.9M 4.8G Swap: 0 0 0 $ docker run --rm --runtime=kata --name=test -m 0g --env=RUNCVM_KERNEL_DEBUG=1 -it alpine ash -c 'free -h' total used free shared buff/cache available Mem: 1.9G 58.8M 1.9G 0 3.4M 1.9G Swap: 0 0 0