Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested container can't start #168

Open
easeway opened this issue May 11, 2022 · 7 comments
Open

Nested container can't start #168

easeway opened this issue May 11, 2022 · 7 comments

Comments

@easeway
Copy link

easeway commented May 11, 2022

1. Issue or feature description

On AWS EKS g4dn-xlarge node, inside a privileged container requesting GPU resource, a nested container failed with error:

mount "proc" to "/proc": Operation not permitted

2. Steps to reproduce the issue

  • Create an EKS cluster with g4dn-xlarge nodes and also proper k8s labels on the nodes;
  • Create a privileged Pod (can use a container image like ubuntu:22.04) and claiming GPU resource;
  • Inside the Pod, install OCI runtime (e.g. apt-get install runc);
  • Prepare a minimum rootfs
  • Create an OCI spec which creates all new namespaces: user, ipc, mount, net, uts, cgroup etc.
  • Add a "proc" mount to "/proc"
  • Run a container using that OCI spec.

To reproduce this issue, using unshare and mount -N maybe simpler than writing a full OCI spec.

3. Root cause

The reason causing `mount "proc" to "/proc": Operation not permitted" is: nvidia container runtime will create the following mountpoints on the outer container:

  • /proc/driver/nvidia/gpus/BUS/...
  • /proc/driver/nvidia

After unmount these mountpoints, the nested container can be started without issue.

4. Thoughts

Not sure why nvidia container runtime will create mountpoints under "/proc". Based on observation, without the mountpoints, the files like /proc/driver/nvidia/gpus/... and /proc/driver/nvidia are visible and accessible to the Pod. Is that for the isolation purpose in case there are multiple GPU devices on the system and only allowing the Pod to see the allocated device?

We also experimented on GKE, which doesn't have this issue. We don't see the mountpoints on /proc on GKE.

@elezar
Copy link
Member

elezar commented May 11, 2022

The NVIDIA Container CLI ensures that only the proc paths for devices requested are mounted into the container. The /proc/driver/nvidia/params file is also updated to ensure that tools such as nvidia-smi don't create the device nodes for devices not requested.

Since you mention GKE did you install the NVIDIA Container Runtime there, or are you launching a pod using their device plugin?

@easeway
Copy link
Author

easeway commented May 11, 2022

Thanks @elezar for explanation!

Regarding GKE, we followed https://cloud.google.com/kubernetes-engine/docs/how-to/gpus, and we didn't dig deeper into what's configured on the VM, and we didn't do specific things on the VMs.

@elezar
Copy link
Member

elezar commented May 12, 2022

@easeway the default GKE installation does not use the NVIDIA Container Toolkit which would explain the different experience there. We are working on aligning things getter across the Cloud providers and including better support for nested containers.

@easeway
Copy link
Author

easeway commented May 12, 2022

@elezar Thanks! I'm looking forward to it!

@bsilver8192
Copy link

For reference, I ran into this same problem while trying to use Bazel's linux-sandbox. Unfortunately I don't have a solution, but here's some info about what's happening that might help.

I think the problem is the kernel enforces this (from mount(2)):

EINVAL In an unprivileged mount namespace (i.e., a mount
              namespace owned by a user namespace that was created by an
              unprivileged user), a bind mount operation (MS_BIND) was
              attempted without specifying (MS_REC), which would have
              revealed the filesystem tree underneath one of the
              submounts of the directory being bound.

Even though this isn't technically a bind mount, it has the same effect, so I can see how it makes sense to enforce the restriction. I can't find any documentation about it though.

opencontainers/runc#1658 (comment) (and other discussion in that bug) is the best discussion of the history behind this limitation I could find. Although that discussion links to some people saying that a fresh mount (like this project and Bazel both attempt) works, which does not seem to be true with the kernel versions I tried.

@Ryang20718
Copy link

made a PR to submit a patch to bazel. @bsilver8192

bazelbuild/bazel#18069

@bsilver8192
Copy link

I didn't put something here because it didn't work in the end, but I attempted the same approach as @Ryang20718 at bazelbuild/bazel#17574, and concluded it was fundamentally broken and wouldn't work (sorry for the duplicate work). bazelbuild/bazel#17574 (comment) has some of my thoughts on workarounds.

@elezar elezar transferred this issue from NVIDIA/nvidia-docker Nov 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants