-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dmz: use overlayfs to write-protect /proc/self/exe if possible #4448
dmz: use overlayfs to write-protect /proc/self/exe if possible #4448
Conversation
a16141f
to
581d923
Compare
FWIW, I did some tests and this is about the same performance as |
logrus.Debugf("using overlayfs for /proc/self/exe sealing") | ||
return overlayFile, nil | ||
} | ||
logrus.Debugf("could not use overlayfs for /proc/self/exe sealing (%v) -- falling back to standard memfd copy", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs comment lines to compare overlayfs with standard memfd copy (and also with bind-mount)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a short paragraph or two. Let me know if you think it needs more information.
581d923
to
c23ab45
Compare
I suggest using a benchmark I added in #4432, it's closer to runc init, so to say. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have yet to take a closer look, but my biggest concert is having a new mount will trigger systemd which is subscribed to mount/umount events and re-reads /proc/self/mountinfo to sync the internal state of mount units. With many quick runc exec
invocations this degrades system performance a lot.
Or is this done in a mount namespace which host systemd can't see?
libcontainer/dmz/overlayfs_linux.go
Outdated
// The only reasonable option would be to hash both files and compare them, | ||
// but this would require fully reading both files which would produce a | ||
// similar performance overhead to memfd cloning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would kcmp(2) work here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately no, the two files are completely different from the kernel's perspective (if they were the same we could just check the inode number and device).
@kolyshkin The mount is created with I have a feeling that benchmarking that way would give unrealistically good performance numbers for this approach, and comparing a full container creation/shutdown with the actual |
Using #4432, here are the performance numbers for my machine: baseline (no
|
fb01a0c
to
d3ea5aa
Compare
// effectively no performance overhead (it is on par with both | ||
// MS_BIND+MS_RDONLY and no binary cloning at all) while memfd copying adds | ||
// around ~60% overhead during container startup. | ||
overlayFile, err := sealedOverlayfs("/proc/self/exe", tmpDir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't need the kernel locks that affected clusters with churn using the bind mount logic before? I mean, that was causing this issue: #2532
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That particular issue was because of systemd's mount tracking, which doesn't apply here (the mounts are internal to the process -- systemd will never see them).
However, to be honest I expected it to have an impact, but it doesn't have a noticeable one. Looking into it, the main reason is that CLONE_NEWNS
actually requires taking namespace_lock()
to make a copy of the existing mount table, but the anonymous allocation of a private mount namespace doesn't and so there is no lock contention on global locks. So there is no locking issue.
It's a bit hard to compare the performance of bind-mount because the code was removed in 1.2.x, so I'd need to write a new version, but if you compare the performance with 1.1.14 this patch is ~5% faster (using the hyperfine benchmark). I can try to benchmark against a synthetic version of bindfd on 1.2.x.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah, I don't think we need it in 1.2 just to compare. IMHO it is more interesting to create a way to repro #2532 and make sure this doesn't cause similar consequences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably can't reproduce that exact issue, but we could have a daemon in our tests that tracks whether there are any new mounts created during testing (which is what that issue is really about) and error out if runc ever creates mounts on the host.
Unfortunately, there is still no notification API for mounts (we discussed designing one based on fanotify at LSF/MM this year but that won't be in a kernel we can use in CI for a while) so we would need to either loop over /proc/self/mountinfo
or preferably use listmount(2)
to just get a list of mount IDs once CI supports it (Linux 6.8). We might not be able to catch races though, and I can't see an obvious way of avoiding that (mount propagation could be used to copy the mount elsewhere but MS_SLAVE
wouldn't stop the umount
from propagating). We might need to run runc
several hundred times to try to catch a race.
A slightly crazy idea would be to use bpftrace
to detect it, but that would be far too brittle for CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Yeah, I think CI for this sounds complicated, at least for now. I'll check if I can repro locally and then give this a try, this sounds enough for now. I'll update if I manage to do it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, there is still no notification API for mounts
The one we have (and which systemd is using) is epoll on /proc/self/mountinfo.
Caveats:
- still have to read the file (or use
listmount(2)
) once an event is received; - easy to miss a short-lived mount.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right! In that case we can just listen for any epoll event and error out if we see one. If we assume the system is not doing other mounts on the host during tests (a reasonable assumption) we should not see any mount events at all during the test.
A completely different idea that I want to experiment with is using landlock (although not all kernels support it, it is supported for a while already). Landlock does work with magic-links too, which is great. I'm not sure if it would be easy to integrate, though, for this protection. But probably if we can make it work, it would be a very light protection (when the kernel supports it). |
I'm not sure landlock can protect against every possible way of accessing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. This is nice 👍🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice idea! The code LGTM and the tests I did seem to work just fine too :)
I did some stress-tests for possible regressions on #2532. This is what I used to test:
- configure containerd to use systemd cgroups
- start a k8s cluster
- create a pod with 100 replicas and readinessProbe that exec's into the container, every 1 second
I did this with runc 1.1.14, 1.1.15 and runc main with this PR applied.
runc 1.1.15 and runc main showed only containerd/kubelet with high CPU usage (20-40%), after the containers were created udisksd is not using significant CPU (during creating it uses more CPU, but we are doing mounts for the new containers there) and I don't see any mount things in "journalctl -f".
With runc 1.1.14, the udisks process is taking ~40%CPU, with peaks ~55%, constantly (after the containers are running too), the kubelet/containerd take 20/40%CPU and I see entries in the journalctl -f output as follows, even after all containers are running:
Oct 18 14:15:50 lindsay systemd[1]: run-containerd-runc-k8s.io-b1deab2f119eccb4447dd9afbca48c615f7a24246e76840725cb3b7e895b6c5e-runc.zSElNB.mount: Deactivated successfully.
Therefore, I can't notice any regressions here for that issue. This is great :)
click to see the pod.yaml and containerd config
pod.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: php-apache
namespace: default
spec:
replicas: 100
selector:
matchLabels:
run: php-apache
template:
metadata:
labels:
run: php-apache
spec:
containers:
- image: registry.k8s.io/hpa-example
imagePullPolicy: Always
livenessProbe:
exec:
command:
- curl
- http://localhost:80
failureThreshold: 1
periodSeconds: 1
successThreshold: 1
timeoutSeconds: 10
name: php-apache
ports:
- containerPort: 80
protocol: TCP
readinessProbe:
exec:
command:
- curl
- http://localhost:80
failureThreshold: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
resources:
limits:
cpu: 500m
requests:
cpu: 30m
containerd config:
version = 3
root = '/var/lib/containerd-rata'
state = '/run/containerd-rata'
temp = ''
plugin_dir = ''
disabled_plugins = []
required_plugins = []
oom_score = 0
imports = []
[grpc]
address = '/run/containerd-rata/containerd.sock'
tcp_address = ''
tcp_tls_ca = ''
tcp_tls_cert = ''
tcp_tls_key = ''
uid = 0
gid = 0
max_recv_message_size = 16777216
max_send_message_size = 16777216
[ttrpc]
address = ''
uid = 0
gid = 0
[debug]
address = ''
uid = 0
gid = 0
level = ''
format = ''
[metrics]
address = ''
grpc_histogram = false
[plugins]
[plugins.'io.containerd.cri.v1.images']
snapshotter = 'overlayfs'
disable_snapshot_annotations = true
discard_unpacked_layers = false
max_concurrent_downloads = 3
image_pull_progress_timeout = '5m0s'
image_pull_with_sync_fs = false
stats_collect_period = 10
[plugins.'io.containerd.cri.v1.images'.pinned_images]
sandbox = 'registry.k8s.io/pause:3.10'
[plugins.'io.containerd.cri.v1.images'.registry]
config_path = ''
[plugins.'io.containerd.cri.v1.images'.image_decryption]
key_model = 'node'
[plugins.'io.containerd.cri.v1.runtime']
enable_selinux = false
selinux_category_range = 1024
max_container_log_line_size = 16384
disable_apparmor = false
restrict_oom_score_adj = false
disable_proc_mount = false
unset_seccomp_profile = ''
tolerate_missing_hugetlb_controller = true
disable_hugetlb_controller = true
device_ownership_from_security_context = false
ignore_image_defined_volumes = false
netns_mounts_under_state_dir = false
enable_unprivileged_ports = true
enable_unprivileged_icmp = true
enable_cdi = true
cdi_spec_dirs = ['/etc/cdi', '/var/run/cdi']
drain_exec_sync_io_timeout = '0s'
ignore_deprecation_warnings = []
[plugins.'io.containerd.cri.v1.runtime'.containerd]
default_runtime_name = 'runc'
ignore_blockio_not_enabled_errors = false
ignore_rdt_not_enabled_errors = false
[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes]
[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc]
runtime_type = 'io.containerd.runc.v2'
runtime_path = ''
pod_annotations = []
container_annotations = []
privileged_without_host_devices = false
privileged_without_host_devices_all_devices_allowed = false
base_runtime_spec = ''
cni_conf_dir = ''
cni_max_conf_num = 0
snapshotter = ''
sandboxer = 'podsandbox'
io_type = ''
[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc.options]
BinaryName = ''
CriuImagePath = ''
CriuWorkPath = ''
IoGid = 0
IoUid = 0
NoNewKeyring = false
Root = ''
ShimCgroup = ''
SystemdCgroup = true
[plugins.'io.containerd.cri.v1.runtime'.cni]
bin_dir = '/opt/cni/bin'
conf_dir = '/etc/cni/net.d'
max_conf_num = 1
setup_serially = false
conf_template = ''
ip_pref = ''
use_internal_loopback = false
[plugins.'io.containerd.gc.v1.scheduler']
pause_threshold = 0.02
deletion_threshold = 0
mutation_threshold = 100
schedule_delay = '0s'
startup_delay = '100ms'
[plugins.'io.containerd.grpc.v1.cri']
disable_tcp_service = true
stream_server_address = '127.0.0.1'
stream_server_port = '0'
stream_idle_timeout = '4h0m0s'
enable_tls_streaming = false
[plugins.'io.containerd.grpc.v1.cri'.x509_key_pair_streaming]
tls_cert_file = ''
tls_key_file = ''
[plugins.'io.containerd.image-verifier.v1.bindir']
bin_dir = '/opt/containerd/image-verifier/bin'
max_verifiers = 10
per_verifier_timeout = '10s'
[plugins.'io.containerd.internal.v1.opt']
path = '/opt/containerd'
[plugins.'io.containerd.internal.v1.tracing']
[plugins.'io.containerd.metadata.v1.bolt']
content_sharing_policy = 'shared'
[plugins.'io.containerd.monitor.container.v1.restart']
interval = '10s'
[plugins.'io.containerd.monitor.task.v1.cgroups']
no_prometheus = false
[plugins.'io.containerd.nri.v1.nri']
disable = false
socket_path = '/var/run/nri/nri.sock'
plugin_path = '/opt/nri/plugins'
plugin_config_path = '/etc/nri/conf.d'
plugin_registration_timeout = '5s'
plugin_request_timeout = '2s'
disable_connections = false
[plugins.'io.containerd.runtime.v2.task']
platforms = ['linux/amd64']
[plugins.'io.containerd.service.v1.diff-service']
default = ['walking']
sync_fs = false
[plugins.'io.containerd.service.v1.tasks-service']
blockio_config_file = ''
rdt_config_file = ''
[plugins.'io.containerd.shim.v1.manager']
env = []
[plugins.'io.containerd.snapshotter.v1.blockfile']
root_path = ''
scratch_file = ''
fs_type = ''
mount_options = []
recreate_scratch = false
[plugins.'io.containerd.snapshotter.v1.btrfs']
root_path = ''
[plugins.'io.containerd.snapshotter.v1.devmapper']
root_path = ''
pool_name = ''
base_image_size = ''
async_remove = false
discard_blocks = false
fs_type = ''
fs_options = ''
[plugins.'io.containerd.snapshotter.v1.native']
root_path = ''
[plugins.'io.containerd.snapshotter.v1.overlayfs']
root_path = ''
upperdir_label = false
sync_remove = false
slow_chown = false
mount_options = []
[plugins.'io.containerd.tracing.processor.v1.otlp']
[plugins.'io.containerd.transfer.v1.local']
max_concurrent_downloads = 3
max_concurrent_uploaded_layers = 3
config_path = ''
[cgroup]
path = ''
[timeouts]
'io.containerd.timeout.bolt.open' = '0s'
'io.containerd.timeout.metrics.shimstats' = '2s'
'io.containerd.timeout.shim.cleanup' = '5s'
'io.containerd.timeout.shim.load' = '5s'
'io.containerd.timeout.shim.shutdown' = '3s'
'io.containerd.timeout.task.state' = '2s'
[stream_processors]
[stream_processors.'io.containerd.ocicrypt.decoder.v1.tar']
accepts = ['application/vnd.oci.image.layer.v1.tar+encrypted']
returns = 'application/vnd.oci.image.layer.v1.tar'
path = 'ctd-decoder'
args = ['--decryption-keys-path', '/etc/containerd/ocicrypt/keys']
env = ['OCICRYPT_KEYPROVIDER_CONFIG=/etc/containerd/ocicrypt/ocicrypt_keyprovider.conf']
[stream_processors.'io.containerd.ocicrypt.decoder.v1.tar.gzip']
accepts = ['application/vnd.oci.image.layer.v1.tar+gzip+encrypted']
returns = 'application/vnd.oci.image.layer.v1.tar+gzip'
path = 'ctd-decoder'
args = ['--decryption-keys-path', '/etc/containerd/ocicrypt/keys']
env = ['OCICRYPT_KEYPROVIDER_CONFIG=/etc/containerd/ocicrypt/ocicrypt_keyprovider.conf']
Need rebase. |
Btw this approach can also work for rootless containers as well on new enough kernels ( |
d3ea5aa
to
a2ac644
Compare
# use a temporary overlayfs instead of making a memfd clone of | ||
# /proc/self/exe. | ||
[[ "$output" = *"runc-dmz: using overlayfs for sealed /proc/self/exe"* ]] | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn’t seem to work when running as the UID 0 in a userNS with a vanilla kernel 5.1-5.10 that lacks support for mounting overlayfs in a userNS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but does anyone run the whole test suite inside a userns (maybe rootlesskit?)? I can change this test to check for that, but runc it self will gracefully fall back to doing memfds in that case.
cf0baa2
to
c098b3e
Compare
Signed-off-by: Aleksa Sarai <[email protected]>
c098b3e
to
2e6fc77
Compare
Commit b999376 ("nsenter: cloned_binary: remove bindfd logic entirely") removed the read-only bind-mount logic from our cloned binary code because it wasn't really safe because a container with CAP_SYS_ADMIN could remove the MS_RDONLY bit and get write access to /proc/self/exe (even with user namespaces this could've been an issue because it's not clear if the flags are locked). However, copying a binary does seem to have a minor performance impact. The only way to have no performance impact would be for the kernel to block these write attempts, but barring that we could try to reduce the overhead by coming up with a mount that cannot have it's read-only bits cleared. The "simplest" solution is to create a temporary overlayfs using fsopen(2) which uses the directory where runc exists as a lowerdir, ensuring that the container cannot access the underlying file -- and we don't have to do any copies. While fsopen(2) is not free because mount namespace cloning is usually expensive (and so it seems like the difference would be marginal), some basic performance testing seems to indicate there is a ~60% improvement doing it this way and that it has effectively no overhead even when compared to just using /proc/self/exe directly: % hyperfine --warmup 50 \ > "./runc-noclone run -b bundle ctr" \ > "./runc-overlayfs run -b bundle ctr" \ > "./runc-memfd run -b bundle ctr" Benchmark 1: ./runc-noclone run -b bundle ctr Time (mean ± σ): 13.7 ms ± 0.9 ms [User: 6.0 ms, System: 10.9 ms] Range (min … max): 11.3 ms … 16.1 ms 184 runs Benchmark 2: ./runc-overlayfs run -b bundle ctr Time (mean ± σ): 13.9 ms ± 0.9 ms [User: 6.2 ms, System: 10.8 ms] Range (min … max): 11.8 ms … 16.0 ms 180 runs Benchmark 3: ./runc-memfd run -b bundle ctr Time (mean ± σ): 22.6 ms ± 1.3 ms [User: 5.7 ms, System: 20.7 ms] Range (min … max): 19.9 ms … 26.5 ms 114 runs Summary ./runc-noclone run -b bundle ctr ran 1.01 ± 0.09 times faster than ./runc-overlayfs run -b bundle ctr 1.65 ± 0.15 times faster than ./runc-memfd run -b bundle ctr Signed-off-by: Aleksa Sarai <[email protected]>
2e6fc77
to
515f09f
Compare
@@ -396,7 +402,7 @@ function requires() { | |||
fi | |||
;; | |||
root) | |||
if [ $EUID -ne 0 ]; then | |||
if [ $EUID -ne 0 ] || in_userns; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this change probably results in many tests being skipped which were not skipped before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't run the suite in user namespaces, though? But yeah this change wasn't strictly necessary so we can revert it if it's causing issues.
Commit b999376 ("nsenter: cloned_binary: remove bindfd logic
entirely") removed the read-only bind-mount logic from our cloned binary
code because it wasn't really safe because a container with
CAP_SYS_ADMIN could remove the MS_RDONLY bit and get write access to
/proc/self/exe (even with user namespaces this could've been an issue
because it's not clear if the flags are locked).
However, copying a binary does seem to have a minor performance impact.
The only way to have no performance impact would be for the kernel to
block these write attempts, but barring that we could try to reduce the
overhead by coming up with a mount that cannot have it's read-only bits
cleared.
The "simplest" solution is to create a temporary overlayfs using
fsopen(2) which uses the directory where runc exists as a lowerdir,
ensuring that the container cannot access the underlying file -- and we
don't have to do any copies.
While fsopen(2) is not free because mount namespace cloning is usually
expensive (and so it seems like the difference would be marginal), some
basic performance testing seems to indicate there is a ~60% improvement
doing it this way and that it has effectively no overhead even when
compared to just using /proc/self/exe directly:
Signed-off-by: Aleksa Sarai [email protected]