Nested container can't start #168

easeway · 2022-05-11T04:09:35Z

1. Issue or feature description

On AWS EKS g4dn-xlarge node, inside a privileged container requesting GPU resource, a nested container failed with error:

mount "proc" to "/proc": Operation not permitted

2. Steps to reproduce the issue

Create an EKS cluster with g4dn-xlarge nodes and also proper k8s labels on the nodes;
Create a privileged Pod (can use a container image like ubuntu:22.04) and claiming GPU resource;
Inside the Pod, install OCI runtime (e.g. apt-get install runc);
Prepare a minimum rootfs
Create an OCI spec which creates all new namespaces: user, ipc, mount, net, uts, cgroup etc.
Add a "proc" mount to "/proc"
Run a container using that OCI spec.

To reproduce this issue, using unshare and mount -N maybe simpler than writing a full OCI spec.

3. Root cause

The reason causing `mount "proc" to "/proc": Operation not permitted" is: nvidia container runtime will create the following mountpoints on the outer container:

/proc/driver/nvidia/gpus/BUS/...
/proc/driver/nvidia

After unmount these mountpoints, the nested container can be started without issue.

4. Thoughts

Not sure why nvidia container runtime will create mountpoints under "/proc". Based on observation, without the mountpoints, the files like /proc/driver/nvidia/gpus/... and /proc/driver/nvidia are visible and accessible to the Pod. Is that for the isolation purpose in case there are multiple GPU devices on the system and only allowing the Pod to see the allocated device?

We also experimented on GKE, which doesn't have this issue. We don't see the mountpoints on /proc on GKE.

The text was updated successfully, but these errors were encountered:

elezar · 2022-05-11T07:36:00Z

The NVIDIA Container CLI ensures that only the proc paths for devices requested are mounted into the container. The /proc/driver/nvidia/params file is also updated to ensure that tools such as nvidia-smi don't create the device nodes for devices not requested.

Since you mention GKE did you install the NVIDIA Container Runtime there, or are you launching a pod using their device plugin?

easeway · 2022-05-11T16:26:40Z

Thanks @elezar for explanation!

Regarding GKE, we followed https://cloud.google.com/kubernetes-engine/docs/how-to/gpus, and we didn't dig deeper into what's configured on the VM, and we didn't do specific things on the VMs.

elezar · 2022-05-12T13:53:15Z

@easeway the default GKE installation does not use the NVIDIA Container Toolkit which would explain the different experience there. We are working on aligning things getter across the Cloud providers and including better support for nested containers.

easeway · 2022-05-12T16:35:15Z

@elezar Thanks! I'm looking forward to it!

bsilver8192 · 2023-02-24T06:18:59Z

For reference, I ran into this same problem while trying to use Bazel's linux-sandbox. Unfortunately I don't have a solution, but here's some info about what's happening that might help.

I think the problem is the kernel enforces this (from mount(2)):

EINVAL In an unprivileged mount namespace (i.e., a mount
              namespace owned by a user namespace that was created by an
              unprivileged user), a bind mount operation (MS_BIND) was
              attempted without specifying (MS_REC), which would have
              revealed the filesystem tree underneath one of the
              submounts of the directory being bound.

Even though this isn't technically a bind mount, it has the same effect, so I can see how it makes sense to enforce the restriction. I can't find any documentation about it though.

opencontainers/runc#1658 (comment) (and other discussion in that bug) is the best discussion of the history behind this limitation I could find. Although that discussion links to some people saying that a fresh mount (like this project and Bazel both attempt) works, which does not seem to be true with the kernel versions I tried.

Ryang20718 · 2023-04-13T00:27:46Z

made a PR to submit a patch to bazel. @bsilver8192

bazelbuild/bazel#18069

bsilver8192 · 2023-04-18T22:27:35Z

I didn't put something here because it didn't work in the end, but I attempted the same approach as @Ryang20718 at bazelbuild/bazel#17574, and concluded it was fundamentally broken and wouldn't work (sorry for the duplicate work). bazelbuild/bazel#17574 (comment) has some of my thoughts on workarounds.

Ryang20718 mentioned this issue Apr 13, 2023

Use Recursive Mount Option when attempting to mount proc in linux sandbox bazelbuild/bazel#18069

Closed

Ryang20718 mentioned this issue Apr 13, 2023

src/main/tools/linux-sandbox-pid1.cc:393: "mount": Operation not permitted bazelbuild/bazel#1972

Closed

elezar transferred this issue from NVIDIA/nvidia-docker Nov 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nested container can't start #168

Nested container can't start #168

easeway commented May 11, 2022

elezar commented May 11, 2022

easeway commented May 11, 2022 •

edited

Loading

elezar commented May 12, 2022

easeway commented May 12, 2022

bsilver8192 commented Feb 24, 2023

Ryang20718 commented Apr 13, 2023

bsilver8192 commented Apr 18, 2023

Nested container can't start #168

Nested container can't start #168

Comments

easeway commented May 11, 2022

1. Issue or feature description

2. Steps to reproduce the issue

3. Root cause

4. Thoughts

elezar commented May 11, 2022

easeway commented May 11, 2022 • edited Loading

elezar commented May 12, 2022

easeway commented May 12, 2022

bsilver8192 commented Feb 24, 2023

Ryang20718 commented Apr 13, 2023

bsilver8192 commented Apr 18, 2023

easeway commented May 11, 2022 •

edited

Loading