-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libct/nsenter: namespace the bindfd shuffle #3599
Conversation
be35225
to
2e73e34
Compare
Looks good overall, and it's amazing that it works! Left some nits. |
703a133
to
6621d55
Compare
6621d55
to
121002c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, and thanks!
@AkihiroSuda @cyphar PTAL |
libcontainer/nsenter/cloned_binary.c
Outdated
/* | ||
* The kernel refuses to bind-mount from the magic symlink when the | ||
* process has been cloned (or unshared) into a new mount namespace, | ||
* at least on Linux 4.4. | ||
*/ | ||
char linkbuf[PATH_MAX + 1] = { 0 }; | ||
ssize_t linkpathlen = readlink("/proc/self/exe", linkbuf, sizeof(linkbuf)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we cannot use /proc/self/exe
(or /proc/$runc_pid/exe
) then we can't really do this. It's not safe to be operating on random paths that you get from resolving /proc/self/exe
(and more importantly, this won't be doable once we start work on porting runc to libpathrs). (Obviously this is running "on the host" but I just don't feel comfortable with resolving magiclinks because they aren't symlinks at all and any similarity to symlinks is a very leaky illusion.)
However, I'm not sure it's actually true that you cannot use the magiclink as a mount source -- based on a quick test with unshare
(Linux 5.19.10) it works without issue and I'm not sure what would've caused it to fail in older kernel versions (nd_jump_link
really doesn't care about what you're doing with the path and the mount infrastructure doesn't care about the source IIRC). You probably cannot use it as a mount target but that shouldn't be an issue. What problem did you run into when using it as a mount source?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On both Linux 4.4.0-1122-aws and Linux 5.15.0-33-generic (Ubuntu) any attempts to bind-mount /proc/self/exe
in an unshared mount namespace yield EINVAL
.
ubuntu@ubuntu2204:~$ sudo strace -f -e trace=mount ./runc run --bundle bundle foobar
strace: Process 62293 attached
strace: Process 62292 attached
strace: Process 62294 attached
strace: Process 62295 attached
strace: Process 62296 attached
strace: Process 62298 attached
[pid 62298] mount("none", "/", NULL, MS_REC|MS_SLAVE, NULL) = 0
[pid 62298] mount("/proc/self/exe", "/run/runc/foobar/runc.G52pod", 0x563c345027e2, MS_BIND, 0x563c345027e2) = -1 EINVAL (Invalid argument)
[pid 62298] +++ exited with 22 +++
[...]
mount(8)
canonicalizes the source path with realpath(3)
which is why it appeared to work in your test.
ubuntu@ubuntu2204:~$ sudo unshare --mount
root@ubuntu2204:/home/ubuntu# strace -e trace=readlink,mount mount --bind /proc/self/exe selfexe
readlink("/proc", 0x7fff7a3b6420, 1023) = -1 EINVAL (Invalid argument)
readlink("/proc/self", "62403", 1023) = 5
readlink("/proc/62403", 0x7fff7a3b6420, 1023) = -1 EINVAL (Invalid argument)
readlink("/proc/62403/exe", "/usr/bin/mount", 1023) = 14
readlink("/usr", 0x7fff7a3b6420, 1023) = -1 EINVAL (Invalid argument)
readlink("/usr/bin", 0x7fff7a3b6420, 1023) = -1 EINVAL (Invalid argument)
readlink("/usr/bin/mount", 0x7fff7a3b6420, 1023) = -1 EINVAL (Invalid argument)
readlink("/home/ubuntu/selfexe", 0x7fff7a3b6420, 1023) = -1 EINVAL (Invalid argument)
mount("/usr/bin/mount", "/home/ubuntu/selfexe", 0x563f8a351840, MS_BIND, NULL) = 0
+++ exited with 0 +++
Minimal reproducer:
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sched.h>
#include <sys/mount.h>
int main(int argc, char** argv) {
if (argc > 1) {
puts("Unsharing mount namespace...");
if (unshare(CLONE_NEWNS) < 0) {
perror("unshare");
exit(EXIT_FAILURE);
}
}
if (mount("/proc/self/exe", "selfexe", "", MS_BIND, "") < 0) {
perror("mount");
exit(EXIT_FAILURE);
}
int ret = system("/bin/sh");
if (ret < 0) {
perror("system");
exit(EXIT_FAILURE);
}
return ret;
}
$ gcc main.c && touch selfexe && sudo ./a.out unshare
Unsharing mount namespace...
mount: Invalid argument
$ sudo ./a.out
# exit
$ sudo umount selfexe
Indirecting the mount through /proc/self/fd/<n>
also yields EINVAL
, regardless of whether the file descriptor to /proc/self/exe
was opened before or after the mount namespace was unshared, with or without O_PATH
. My best guess of what's happening is that /proc/self/exe
resolves to a struct path
value which, being magic, has its mnt->mnt_ns
set to the parent mount namespace, which trips the check_mnt()
validation here. If my interpretation is correct, there is no way to successfully bind-mount a process' executable across namespaces without going through path resolution machinery in the process' current mount namespace.
Would you feel more comfortable if there was more assurance that the path resolved to the same file as /proc/self/exe
@cyphar? Here's what I'm thinking:
open(O_PATH)
both/proc/self/exe
andrealpath("/proc/self/exe")
- Check that the file descriptors refer to the same file by
fstat
'ing both and testing whether they have the samest_dev
andst_ino
values - Bind-mount the
/proc/self/fd/<n>
for the realpath fd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cyphar WDYT? ^^^
If readlink(/proc/self/exe
) is considered to be unsafe, my naive approach would be to use os.Args[0]
, with or without additional checks using fstat, kcmp with KCMP_FILE etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you were both quite right -- I had forgotten about the annoying way mount(8)
handles paths and that (of course) since we are creating a mountns you would run afoul of the check_mnt
checks for sources.
The future libpathrs issue is still a problem but I think we need to drop this bindfd code before then anyway, so we can cross that bridge if we ever get to it.
The current stat-based approach is okay. I still am not the biggest fan of all of this, but to fix it completely requires the safe procfs stuff I'm working on with libpathrs, and current runc is also vulnerable to the same issues.
ee114af
to
e2b1a5a
Compare
Needs rebase |
This issue is also affecting us, if it can be fixed we will be very grateful. |
This should help with docker/for-linux#679, as my (currently untested) recollection is that systemd doesn't log mounts outside the root namespace. |
2ea2444
to
a0d8203
Compare
This PR has languished for far too long without any feedback from the maintainer who requested changes, even after I made changes which directly address the reason that changes were requested. @crosbymichael could you please take a look? |
anything update? we are strongly looking forward to this. |
* save an unnecessary context switch as we'd just be | ||
* waiting for the child process to exit anyway. | ||
*/ | ||
CLONE_NEWNS | CLONE_VFORK, (void *)args); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my tests, significant sys cpu usage by runc processes is observed, which seems come out of creating and removing a new mountns each time. Especially on removing, fierce lock competition appeared on a global seqlock, mount_lock, which is used to protect changes to the vfsmount hash. Related kernel stack is as follows.
lock_mount_hash
mntput_no_expire
mntput
namespace_unlock
drop_collected_mounts
put_mnt_ns
free_nsproxy
switch_task_namespaces
exit_task_namespaces
do_exit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thaJeztah this maybe a serious problem need to take into consideration. the overall impact maybe even higher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately the only alternative is to remove bindfd (which I would like to do!) but will probably have to wait until runc 1.3 since it could cause regressions. The cost being described is the cost of creating and destroying a new mount namespace and there really isn't any way to avoid the lock contention -- though it should be noted that the previous setup would've also caused contention on mount_lock
(it was creating a mount, after all).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cyphar what about instead of create/destroy the mountns everytime exec executed, create one long running mountns once containerd started and paas the handle of mountns to runc when it was forked, the benefit of this approach is
- no need to create/destroy mountns everytime
- the handle of mountns is optional, it's better for backward compatibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I encountered a similar issue, and this PR did not solve my problem; on the contrary, from a black-box perspective, it has exacerbated my problem. My scenario is to deploy 100 pods on a single node, each with a container.
Before the modification, due to the high CPU usage of systemd, a large number of pods failed to deploy. However, there is a chance that the pods can successfully restart. After the merger of this PR, the CPU usage of systemd indeed decreased. But the hang up phenomenon still exists.
A large number of pods still fail to deploy, and retries also fail. Not using try_bindfd but using memfd works very well, all pods are deployed successfully for the first time. Even deploying 170 pods at the same time will still succeed.
I think try_bindfd should die and be replaced with just the memfd code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cyphar ya, you have your point, thanks for clarify, maybe the best option at this moment is to avoid exec probe for massive pods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cyphar ya, you have your point, thanks for clarify, maybe the best option at this moment is to avoid exec probe for massive pods
For services migrating from virtual machines to containers, changing the probe from exec mode to non-exec mode would be a substantial amount of work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@113xiaoji yes, we have the same problem here, I have tried another option to setup a new mountns and "move the containerd and kubelet into the mountns", you can refer to this article https://docs.openshift.com/container-platform/4.12/scalability_and_performance/optimization/optimizing-cpu-usage.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cyphar If our situation is rather urgent and we cannot wait for the 1.3.0 version, we are considering releasing our own version, in which we will eliminate the try_bindfd related logic to fit our business scenario. We would like to hear your thoughts on this. As for memory overhead, could you detail how it works? Does it only consume the host's memory, or does it also consume the container's memory? Does each invocation of runc init result in memory consumption? Besides memory overhead, do you foresee any other serious issues that we should be aware of?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It contributes to the container's memory accounting limit, but only while the runc binary is actually running -- the runc process is generally very short-lived (it only exists when it needs to do container operations, once a container process is spawned it disappears -- at least, this is how it works when you use it in the context of Docker/cri-o/containerd/Kubernetes). Each invocation of runc causes a ~10MB (the size of the runc binary) memory allocation in whatever container's cgroup is associated with that runc invocation.
The patch to remove try_bindfd
is very trivial, you can see the patch in #3509 (I discovered some kernel bugs in memfd_create
when I was working on this, I was going to send it this morning...)
The thing is, creating a container involves doing many mounts already, so I'm still a little surprised that this one mount is causing so many performance issues. I suspect that the the host bind-mount would be okay if it weren't for systemd ingesting every mount event on the system, but adding the short-lived namespace might be triggering a different contention issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think try_bindfd
should die and be replaced with just the memfd code (which eventually will no longer be needed when I finish working on the magic-link hardening work to block these re-opening attacks -- though we will need to keep it around for a long while...).
However, looking at this PR again, it probably makes sense to carry this (even if we end up dropping bindfd in runc 1.3.0). The extra CPU cost of creating a new mount namespace for such a trivial task kind of sucks, but them's the breaks (you could even argue this makes it clearer that we should drop bindfd sooner).
Sadly the proposal I had a while ago for removing the cost of memfds doesn't work (I figured out this out a while ago but it's probably a good idea to write that down somewhere) -- you cannot bind-mount memfds at all because they are all in the MNT_NS_INTERNAL
namespace.
...from nsexec.c so they can be used in cloned_binary.c. Signed-off-by: Cory Snider <[email protected]>
...so the compiler can warn about mismatches between the format string and varargs. Signed-off-by: Cory Snider <[email protected]>
Modify receive_fd() and send_fd() so they can be more readily reused in cloned_binary.c. Change receive_fd() to have a single responsibility: receiving and returning a single file descriptor over a UNIX domain socket. Make send_fd() useable in precarious execution contexts such as a clone(CLONE_VFORK|CLONE_VM) "thread" where allocating heap memory or calling exit() would be dangerous. Signed-off-by: Cory Snider <[email protected]>
Signed-off-by: Cory Snider <[email protected]>
Processes can watch /proc/self/mounts or /mountinfo, and the kernel will notify them whenever the namespace's mount table is modified. The notified process still needs to read and parse the mountinfo to determine what changed once notified. Many such processes, including udisksd and SystemD < v248, make no attempt to rate-limit their mountinfo notifications. This tends to not be a problem on many systems, where mount tables are small and mounting and unmounting is uncommon. Every runC exec which successfully uses the try_bindfd container-escape mitigation performs two mount()s and one umount() in the host's mount namespace, causing any mount-watching processes to wake up and parse the mountinfo file three times in a row. Consequently, using 'exec' health checks on containers has a larger-than-expected impact on system load when such mount-watching daemons are running. Furthermore, the size of the mount table in the host's mount namespace tends to be proportional to the number of OCI containers as a unique mount is required for the rootfs of each container. Therefore, on systems with mount-watching processes, the system load increases *quadratically* with the number of running containers which use health checks! Prevent runC from incidentally modifying the host's mount namespace for container-escape mitigations by setting up the mitigation in a temporary mount namespace. Signed-off-by: Cory Snider <[email protected]>
cannot agree more |
Processes can watch
/proc/self/mounts
or/mountinfo
, and the kernel will notify them whenever the namespace's mount table is modified. The notified process still needs to read and parse the mountinfo to determine what changed once notified. Many such processes, includingudisksd
and systemd < v248, make no attempt to rate-limit their mountinfo notifications. This tends to not be a problem on many systems, where mount tables are small and mounting and unmounting is uncommon. Every runC exec which successfully uses thetry_bindfd
container-escape mitigation performs twomount()
s and oneumount()
in the host's mount namespace, causing any mount-watching processes to wake up and parse the mountinfo file three times in a row. Consequently, using 'exec' health checks on containers has a larger-than-expected impact on system load when such mount-watching daemons are running. Furthermore, the size of the mount table in the host's mount namespace tends to be proportional to the number of OCI containers as a unique mount is required for the rootfs of each container. Therefore, on systems with mount-watching processes, the system load increases quadratically with the number of running containers which use health checks!Prevent runC from incidentally modifying the host's mount namespace for container-escape mitigations by setting up the mitigation in a private mount namespace.