Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Recursive Mount Option when attempting to mount proc in linux sandbox #18069

Conversation

Ryang20718
Copy link

@Ryang20718 Ryang20718 commented Apr 13, 2023

Currently on Bazel 6.0.0

Problem: Due to Nvidia Runtime Mounting Proc, when running bazel within a docker container, we hit

src/main/tools/linux-sandbox-pid1.cc:441: "mount": Operation not permitted

We see that there's a nested proc mount

unshare --mount --map-root-user --pid --fork
# mount | grep proc
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
tmpfs on /proc/driver/nvidia type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=555,inode64)
proc on /proc/driver/nvidia/gpus/0000:b3:00.0 type proc (ro,nosuid,nodev,noexec,relatime)

Whilst I know this is nvidia problem and limited to local execution, it would be nice to be able to use linux-sandbox within a docker container w/ access to Nvidia runtime.

Proposal:
Add recursive bind option when mounting proc.
Given that proc is mounted as read only, I don't think the recursive mount would be an issue?

@google-cla
Copy link

google-cla bot commented Apr 13, 2023

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@Ryang20718 Ryang20718 force-pushed the user/ryang/add_recursive_bind_mount_linux_sandbox branch from 6ee5f0f to 9d0dcbc Compare April 13, 2023 00:17
@Ryang20718 Ryang20718 force-pushed the user/ryang/add_recursive_bind_mount_linux_sandbox branch from 9d0dcbc to 055f7ad Compare April 13, 2023 00:22
@Ryang20718 Ryang20718 marked this pull request as ready for review April 13, 2023 00:22
@github-actions github-actions bot added the awaiting-review PR is awaiting review from an assigned reviewer label Apr 13, 2023
@sgowroji sgowroji added team-Local-Exec Issues and PRs for the Execution (Local) team awaiting-user-response Awaiting a response from the author and removed awaiting-review PR is awaiting review from an assigned reviewer labels Apr 13, 2023
@sgowroji
Copy link
Member

Hi @Ryang20718, Can you please fix above build-kite failures. Thanks!

@meisterT meisterT requested a review from larsrc-google April 13, 2023 10:15
@larsrc-google
Copy link
Contributor

I'm a bit skittish about such a global change for a very specific case. Before making this change for everyone, could you check if you can use some variation of --sandbox_add_mount_pair to work around this?

@Ryang20718
Copy link
Author

Ryang20718 commented Apr 13, 2023

I'm a bit skittish about such a global change for a very specific case. Before making this change for everyone, could you check if you can use some variation of --sandbox_add_mount_pair to work around this?

Fully understand the skittish concerns 😓

I had originally tried --sandbox_add_mount_pair=/proc/driver:/proc/driver in an attempt to allow linux-sandbox to bypass these nested proc mounts. However, testing this out didn't work as linux-sandbox would still attempt to mount the nested proc, but fail and thus no linux-sandbox. Based on implementation, it would seem that the /proc/driver would still be visible from linux-sandbox's perspective and thus would fail to use linux-sandbox

tmpfs on /proc/driver/nvidia type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=555,inode64)
proc on /proc/driver/nvidia/gpus/0000:b3:00.0 type proc (ro,nosuid,nodev,noexec,relatime)

@Ryang20718
Copy link
Author

Ryang20718 commented Apr 14, 2023

Hi @Ryang20718, Can you please fix above build-kite failures. Thanks!

The failing test is due to the PID when inside /proc/self != PID of exec cat stat.
/proc/self is a symbolic link and the nested mounts under proc will cause the PID to not be the same. Hence the failed test.

Wondering if this test behaviour is absolutely necessary for some behaviour of bazel or if this was a test that was written when first adding the mount /proc to code? @sgowroji

If this behaviour isn't necessary, I can remove remove the last check of this test.

@sgowroji sgowroji added awaiting-review PR is awaiting review from an assigned reviewer and removed awaiting-user-response Awaiting a response from the author labels Apr 15, 2023
@larsrc-google
Copy link
Contributor

I think that test is a very good canary in this goldmine. There are many places that use /proc/self or /proc/$PID or similar. If those no longer agree, really strange things are going to happen! Not to mention other things under /proc that might change.

@larsrc-google
Copy link
Contributor

Is the failure to mount with --sandbox_add_mount_pair=/proc/driver:/proc/driver (or probably better --sandbox_add_mount_pair=/proc/driver/nvidia:/proc/driver/nvidia) possibly due to a question of mount ordering? If so, we could probably figure out a way to make the ordering more flexible.

@Ryang20718
Copy link
Author

Ryang20718 commented Apr 17, 2023

Is the failure to mount with --sandbox_add_mount_pair=/proc/driver:/proc/driver (or probably better --sandbox_add_mount_pair=/proc/driver/nvidia:/proc/driver/nvidia) possibly due to a question of mount ordering? If so, we could probably figure out a way to make the ordering more flexible.

Both /proc/driver/nvidia:/proc/driver/nvidia and /proc/driver:/proc/driver cause bazel to fail to use linux-sandbox when attempting to mount /proc. I believe the underlying reason is due to /proc already being mounted in the parent namespace and when we call clone, we would create a new namespace for the underlying child and when the child attempts to mount the nested /proc, it fails with a permission denied issue.
@larsrc-google

# ls -ld /proc/driver
dr-xr-xr-x 8 root root 0 Apr 17 18:26 /proc/driver

@bsilver8192
Copy link
Contributor

I had the same idea, ran into the same problem, and concluded there is no way to make it work. See #17574 (comment) for some further ideas.

@Ryang20718
Copy link
Author

oh interesting, bind mounting the volume to the docker container but under a different dir fixes the permission issue!
Thanks @bsilver8192

@Ryang20718 Ryang20718 closed this Apr 19, 2023
@Ryang20718 Ryang20718 reopened this Apr 19, 2023
@Ryang20718
Copy link
Author

Ryang20718 commented Apr 19, 2023

@bsilver8192 Was testing the mount fix with a privileged docker container via volume mounting /proc to somewhere else such as -v /proc:/proc2, or first creating /tmp/proc2 and then mounting proc onto /tmp/proc2 however, whilst this does fix the issue of mount failing (mount("/proc", "/proc", "proc", MS_NODEV | MS_NOEXEC | MS_NOSUID, nullptr) < 0)

Confirming linux-sandbox mount works

$(bazel info install_base)/linux-sandbox -D /bin/true
...
1681924823.215671162: src/main/tools/linux-sandbox-pid1.cc:538: wait returned pid=2, status=0x00
1681924823.215691776: src/main/tools/linux-sandbox-pid1.cc:556: child exited normally with code 0
1681924823.216466569: src/main/tools/linux-sandbox.cc:233: child exited normally with code 0

We still hit issues with unless specifying --network=host. Wondering if you had to work around that at all?

import ray

ray.init(num_gpus=1)

...
Unable to connect to GCS at 172.17.0.2:63842. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.
  • processwrapper-sandbox + docker container w/ nvidia runtime works (without specifying network=host) works
  • processwrapper-sandbox + docker container w/ -v /proc:/host_proc && nvidia runtime works (without specifying network=host) works
  • linux-sandbox w/o docker works
  • linux-sandbox + docker container w/ -v /proc:/host_proc && nvidia runtime fails unless specifying a host network, so there must be something that I'm missing?

Running the same code outside the container w/ linux-sandbox works just fine, so I don't believe it's a linux-sandbox issue but I think some combination of nvidia docker runtime + linux-sandbox

@Ryang20718 Ryang20718 closed this Apr 19, 2023
@bsilver8192
Copy link
Contributor

I have --sandbox_default_allow_network=false turned on deliberately, so I don't expect the network to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-review PR is awaiting review from an assigned reviewer team-Local-Exec Issues and PRs for the Execution (Local) team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants