Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libct/nsenter: namespace the bindfd shuffle #3599

Merged
merged 5 commits into from
Jul 4, 2023

Conversation

corhere
Copy link
Contributor

@corhere corhere commented Sep 9, 2022

Processes can watch /proc/self/mounts or /mountinfo, and the kernel will notify them whenever the namespace's mount table is modified. The notified process still needs to read and parse the mountinfo to determine what changed once notified. Many such processes, including udisksd and systemd < v248, make no attempt to rate-limit their mountinfo notifications. This tends to not be a problem on many systems, where mount tables are small and mounting and unmounting is uncommon. Every runC exec which successfully uses the try_bindfd container-escape mitigation performs two mount()s and one umount() in the host's mount namespace, causing any mount-watching processes to wake up and parse the mountinfo file three times in a row. Consequently, using 'exec' health checks on containers has a larger-than-expected impact on system load when such mount-watching daemons are running. Furthermore, the size of the mount table in the host's mount namespace tends to be proportional to the number of OCI containers as a unique mount is required for the rootfs of each container. Therefore, on systems with mount-watching processes, the system load increases quadratically with the number of running containers which use health checks!

Prevent runC from incidentally modifying the host's mount namespace for container-escape mitigations by setting up the mitigation in a private mount namespace.

@corhere corhere changed the title libct/nsenter: namespace the bundfd shuffle libct/nsenter: namespace the bindfd shuffle Sep 9, 2022
@corhere corhere force-pushed the bindfd-in-private branch 2 times, most recently from be35225 to 2e73e34 Compare September 9, 2022 23:03
@kolyshkin
Copy link
Contributor

Looks good overall, and it's amazing that it works! Left some nits.

@kolyshkin kolyshkin added this to the 1.2.0 milestone Sep 10, 2022
@corhere corhere force-pushed the bindfd-in-private branch 2 times, most recently from 703a133 to 6621d55 Compare September 12, 2022 22:31
AkihiroSuda
AkihiroSuda previously approved these changes Sep 12, 2022
kolyshkin
kolyshkin previously approved these changes Sep 21, 2022
Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, and thanks!

@kolyshkin
Copy link
Contributor

@AkihiroSuda @cyphar PTAL

Comment on lines 425 to 433
/*
* The kernel refuses to bind-mount from the magic symlink when the
* process has been cloned (or unshared) into a new mount namespace,
* at least on Linux 4.4.
*/
char linkbuf[PATH_MAX + 1] = { 0 };
ssize_t linkpathlen = readlink("/proc/self/exe", linkbuf, sizeof(linkbuf));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we cannot use /proc/self/exe (or /proc/$runc_pid/exe) then we can't really do this. It's not safe to be operating on random paths that you get from resolving /proc/self/exe (and more importantly, this won't be doable once we start work on porting runc to libpathrs). (Obviously this is running "on the host" but I just don't feel comfortable with resolving magiclinks because they aren't symlinks at all and any similarity to symlinks is a very leaky illusion.)

However, I'm not sure it's actually true that you cannot use the magiclink as a mount source -- based on a quick test with unshare (Linux 5.19.10) it works without issue and I'm not sure what would've caused it to fail in older kernel versions (nd_jump_link really doesn't care about what you're doing with the path and the mount infrastructure doesn't care about the source IIRC). You probably cannot use it as a mount target but that shouldn't be an issue. What problem did you run into when using it as a mount source?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On both Linux 4.4.0-1122-aws and Linux 5.15.0-33-generic (Ubuntu) any attempts to bind-mount /proc/self/exe in an unshared mount namespace yield EINVAL.

ubuntu@ubuntu2204:~$ sudo strace -f -e trace=mount ./runc run --bundle bundle foobar
strace: Process 62293 attached
strace: Process 62292 attached
strace: Process 62294 attached
strace: Process 62295 attached
strace: Process 62296 attached
strace: Process 62298 attached
[pid 62298] mount("none", "/", NULL, MS_REC|MS_SLAVE, NULL) = 0
[pid 62298] mount("/proc/self/exe", "/run/runc/foobar/runc.G52pod", 0x563c345027e2, MS_BIND, 0x563c345027e2) = -1 EINVAL (Invalid argument)
[pid 62298] +++ exited with 22 +++
[...]

mount(8) canonicalizes the source path with realpath(3) which is why it appeared to work in your test.

ubuntu@ubuntu2204:~$ sudo unshare --mount
root@ubuntu2204:/home/ubuntu# strace -e trace=readlink,mount mount --bind /proc/self/exe selfexe
readlink("/proc", 0x7fff7a3b6420, 1023) = -1 EINVAL (Invalid argument)
readlink("/proc/self", "62403", 1023)   = 5
readlink("/proc/62403", 0x7fff7a3b6420, 1023) = -1 EINVAL (Invalid argument)
readlink("/proc/62403/exe", "/usr/bin/mount", 1023) = 14
readlink("/usr", 0x7fff7a3b6420, 1023)  = -1 EINVAL (Invalid argument)
readlink("/usr/bin", 0x7fff7a3b6420, 1023) = -1 EINVAL (Invalid argument)
readlink("/usr/bin/mount", 0x7fff7a3b6420, 1023) = -1 EINVAL (Invalid argument)
readlink("/home/ubuntu/selfexe", 0x7fff7a3b6420, 1023) = -1 EINVAL (Invalid argument)
mount("/usr/bin/mount", "/home/ubuntu/selfexe", 0x563f8a351840, MS_BIND, NULL) = 0
+++ exited with 0 +++
Minimal reproducer:
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sched.h>
#include <sys/mount.h>

int main(int argc, char** argv) {
	if (argc > 1) {
		puts("Unsharing mount namespace...");
		if (unshare(CLONE_NEWNS) < 0) {
			perror("unshare");
			exit(EXIT_FAILURE);
		}
	}
	if (mount("/proc/self/exe", "selfexe", "", MS_BIND, "") < 0) {
		perror("mount");
		exit(EXIT_FAILURE);
	}
	int ret = system("/bin/sh");
	if (ret < 0) {
		perror("system");
		exit(EXIT_FAILURE);
	}
	return ret;
}
$ gcc main.c && touch selfexe && sudo ./a.out unshare
Unsharing mount namespace...
mount: Invalid argument
$ sudo ./a.out
# exit
$ sudo umount selfexe

Indirecting the mount through /proc/self/fd/<n> also yields EINVAL, regardless of whether the file descriptor to /proc/self/exe was opened before or after the mount namespace was unshared, with or without O_PATH. My best guess of what's happening is that /proc/self/exe resolves to a struct path value which, being magic, has its mnt->mnt_ns set to the parent mount namespace, which trips the check_mnt() validation here. If my interpretation is correct, there is no way to successfully bind-mount a process' executable across namespaces without going through path resolution machinery in the process' current mount namespace.

Would you feel more comfortable if there was more assurance that the path resolved to the same file as /proc/self/exe @cyphar? Here's what I'm thinking:

  1. open(O_PATH) both /proc/self/exe and realpath("/proc/self/exe")
  2. Check that the file descriptors refer to the same file by fstat'ing both and testing whether they have the same st_dev and st_ino values
  3. Bind-mount the /proc/self/fd/<n> for the realpath fd

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyphar WDYT? ^^^

If readlink(/proc/self/exe) is considered to be unsafe, my naive approach would be to use os.Args[0], with or without additional checks using fstat, kcmp with KCMP_FILE etc.

Copy link
Member

@cyphar cyphar Jul 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you were both quite right -- I had forgotten about the annoying way mount(8) handles paths and that (of course) since we are creating a mountns you would run afoul of the check_mnt checks for sources.

The future libpathrs issue is still a problem but I think we need to drop this bindfd code before then anyway, so we can cross that bridge if we ever get to it.

The current stat-based approach is okay. I still am not the biggest fan of all of this, but to fix it completely requires the safe procfs stuff I'm working on with libpathrs, and current runc is also vulnerable to the same issues.

libcontainer/nsenter/cloned_binary.c Outdated Show resolved Hide resolved
libcontainer/nsenter/cloned_binary.c Outdated Show resolved Hide resolved
libcontainer/nsenter/cloned_binary.c Outdated Show resolved Hide resolved
libcontainer/nsenter/cloned_binary.c Show resolved Hide resolved
@AkihiroSuda
Copy link
Member

Needs rebase

@adamqqqplay
Copy link

This issue is also affecting us, if it can be fixed we will be very grateful.

@neersighted
Copy link
Contributor

neersighted commented May 30, 2023

This should help with docker/for-linux#679, as my (currently untested) recollection is that systemd doesn't log mounts outside the root namespace.

@corhere
Copy link
Contributor Author

corhere commented May 30, 2023

This PR has languished for far too long without any feedback from the maintainer who requested changes, even after I made changes which directly address the reason that changes were requested. @crosbymichael could you please take a look?

@LastNight1997
Copy link

anything update? we are strongly looking forward to this.

* save an unnecessary context switch as we'd just be
* waiting for the child process to exit anyway.
*/
CLONE_NEWNS | CLONE_VFORK, (void *)args);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my tests, significant sys cpu usage by runc processes is observed, which seems come out of creating and removing a new mountns each time. Especially on removing, fierce lock competition appeared on a global seqlock, mount_lock, which is used to protect changes to the vfsmount hash. Related kernel stack is as follows.

lock_mount_hash
mntput_no_expire
mntput
namespace_unlock
drop_collected_mounts
put_mnt_ns
free_nsproxy
switch_task_namespaces
exit_task_namespaces
do_exit

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thaJeztah this maybe a serious problem need to take into consideration. the overall impact maybe even higher.

Copy link
Member

@cyphar cyphar Jul 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately the only alternative is to remove bindfd (which I would like to do!) but will probably have to wait until runc 1.3 since it could cause regressions. The cost being described is the cost of creating and destroying a new mount namespace and there really isn't any way to avoid the lock contention -- though it should be noted that the previous setup would've also caused contention on mount_lock (it was creating a mount, after all).

Copy link

@xujihui1985 xujihui1985 Jul 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyphar what about instead of create/destroy the mountns everytime exec executed, create one long running mountns once containerd started and paas the handle of mountns to runc when it was forked, the benefit of this approach is

  1. no need to create/destroy mountns everytime
  2. the handle of mountns is optional, it's better for backward compatibility

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I encountered a similar issue, and this PR did not solve my problem; on the contrary, from a black-box perspective, it has exacerbated my problem. My scenario is to deploy 100 pods on a single node, each with a container.

Before the modification, due to the high CPU usage of systemd, a large number of pods failed to deploy. However, there is a chance that the pods can successfully restart. After the merger of this PR, the CPU usage of systemd indeed decreased. But the hang up phenomenon still exists.

A large number of pods still fail to deploy, and retries also fail. Not using try_bindfd but using memfd works very well, all pods are deployed successfully for the first time. Even deploying 170 pods at the same time will still succeed.

I think try_bindfd should die and be replaced with just the memfd code.

Copy link

@xujihui1985 xujihui1985 Jul 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyphar ya, you have your point, thanks for clarify, maybe the best option at this moment is to avoid exec probe for massive pods

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyphar ya, you have your point, thanks for clarify, maybe the best option at this moment is to avoid exec probe for massive pods
For services migrating from virtual machines to containers, changing the probe from exec mode to non-exec mode would be a substantial amount of work.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@113xiaoji yes, we have the same problem here, I have tried another option to setup a new mountns and "move the containerd and kubelet into the mountns", you can refer to this article https://docs.openshift.com/container-platform/4.12/scalability_and_performance/optimization/optimizing-cpu-usage.html

Copy link

@113xiaoji 113xiaoji Jul 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyphar If our situation is rather urgent and we cannot wait for the 1.3.0 version, we are considering releasing our own version, in which we will eliminate the try_bindfd related logic to fit our business scenario. We would like to hear your thoughts on this. As for memory overhead, could you detail how it works? Does it only consume the host's memory, or does it also consume the container's memory? Does each invocation of runc init result in memory consumption? Besides memory overhead, do you foresee any other serious issues that we should be aware of?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It contributes to the container's memory accounting limit, but only while the runc binary is actually running -- the runc process is generally very short-lived (it only exists when it needs to do container operations, once a container process is spawned it disappears -- at least, this is how it works when you use it in the context of Docker/cri-o/containerd/Kubernetes). Each invocation of runc causes a ~10MB (the size of the runc binary) memory allocation in whatever container's cgroup is associated with that runc invocation.

The patch to remove try_bindfd is very trivial, you can see the patch in #3509 (I discovered some kernel bugs in memfd_create when I was working on this, I was going to send it this morning...)

The thing is, creating a container involves doing many mounts already, so I'm still a little surprised that this one mount is causing so many performance issues. I suspect that the the host bind-mount would be okay if it weren't for systemd ingesting every mount event on the system, but adding the short-lived namespace might be triggering a different contention issue.

Copy link
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think try_bindfd should die and be replaced with just the memfd code (which eventually will no longer be needed when I finish working on the magic-link hardening work to block these re-opening attacks -- though we will need to keep it around for a long while...).

However, looking at this PR again, it probably makes sense to carry this (even if we end up dropping bindfd in runc 1.3.0). The extra CPU cost of creating a new mount namespace for such a trivial task kind of sucks, but them's the breaks (you could even argue this makes it clearer that we should drop bindfd sooner).

Sadly the proposal I had a while ago for removing the cost of memfds doesn't work (I figured out this out a while ago but it's probably a good idea to write that down somewhere) -- you cannot bind-mount memfds at all because they are all in the MNT_NS_INTERNAL namespace.

...from nsexec.c so they can be used in cloned_binary.c.

Signed-off-by: Cory Snider <[email protected]>
...so the compiler can warn about mismatches between the format string
and varargs.

Signed-off-by: Cory Snider <[email protected]>
Modify receive_fd() and send_fd() so they can be more readily reused in
cloned_binary.c. Change receive_fd() to have a single responsibility:
receiving and returning a single file descriptor over a UNIX domain
socket. Make send_fd() useable in precarious execution contexts such as
a clone(CLONE_VFORK|CLONE_VM) "thread" where allocating heap memory or
calling exit() would be dangerous.

Signed-off-by: Cory Snider <[email protected]>
Processes can watch /proc/self/mounts or /mountinfo, and the kernel
will notify them whenever the namespace's mount table is modified. The
notified process still needs to read and parse the mountinfo to
determine what changed once notified. Many such processes, including
udisksd and SystemD < v248, make no attempt to rate-limit their
mountinfo notifications. This tends to not be a problem on many systems,
where mount tables are small and mounting and unmounting is uncommon.
Every runC exec which successfully uses the try_bindfd container-escape
mitigation performs two mount()s and one umount() in the host's mount
namespace, causing any mount-watching processes to wake up and parse the
mountinfo file three times in a row. Consequently, using 'exec' health
checks on containers has a larger-than-expected impact on system load
when such mount-watching daemons are running. Furthermore, the size of
the mount table in the host's mount namespace tends to be proportional
to the number of OCI containers as a unique mount is required for the
rootfs of each container. Therefore, on systems with mount-watching
processes, the system load increases *quadratically* with the number of
running containers which use health checks!

Prevent runC from incidentally modifying the host's mount namespace for
container-escape mitigations by setting up the mitigation in a temporary
mount namespace.

Signed-off-by: Cory Snider <[email protected]>
@cyphar cyphar closed this in 35eff7c Jul 4, 2023
@cyphar cyphar merged commit 35eff7c into opencontainers:main Jul 4, 2023
@corhere corhere deleted the bindfd-in-private branch July 5, 2023 00:36
@rata rata mentioned this pull request Jul 5, 2023
5 tasks
@113xiaoji
Copy link

try_bindfd

I think try_bindfd should die and be replaced with just the memfd code (which eventually will no longer be needed when I finish working on the magic-link hardening work to block these re-opening attacks -- though we will need to keep it around for a long while...).

However, looking at this PR again, it probably makes sense to carry this (even if we end up dropping bindfd in runc 1.3.0). The extra CPU cost of creating a new mount namespace for such a trivial task kind of sucks, but them's the breaks (you could even argue this makes it clearer that we should drop bindfd sooner).

Sadly the proposal I had a while ago for removing the cost of memfds doesn't work (I figured out this out a while ago but it's probably a good idea to write that down somewhere) -- you cannot bind-mount memfds at all because they are all in the MNT_NS_INTERNAL namespace.

cannot agree more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

too many mount/umount syscalls