libpod: do not chmod bind mounts #23032

giuseppe · 2024-06-18T15:07:35Z

with the new mount API is available, the OCI runtime doesn't require that each parent directory for a bind mount must be accessible. Instead it is opened in the initial user namespace and passed down to the container init process.

This requires that the kernel supports the new mount API and that the OCI runtime uses it.

Closes: #23028

Does this PR introduce a user-facing change?

Now Podman requires the new kernel mount API (available since Linux 5.2) to configure containers in a new user namespace where the current user is not part of the mapping

openshift-ci · 2024-06-18T15:07:44Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [giuseppe]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Luap99

LGTM, but just to be sure do we know if runc makes use of the new API by default?

Luap99 · 2024-06-18T15:58:51Z

Looks like it might be needed after all.

giuseppe · 2024-06-18T20:31:14Z

I was over optimistic :)

I've pushed a new version that really deals only with the bind mounts, and this works well with runc too.

I've marked the PR as a Draft for now, I want to check better what we can do with the remaining occurrences

Luap99 · 2024-06-19T08:11:45Z

So thinking about this more I assume the oci runtimes try to use the new mount API by default but then silently fall back to the old one if the new one doesn't work? Doesn't this mean that we break the userns containers on old distros or even just limited envs (i.e. are there any seccomp profiles blocking the mount API but not the old mount syscall)? As such would it make sense to maybe check in podman if the syscall exists/works first only only do the chmod if it does not?
The alternative would be we have to document that we only work on kernels with the new mount API (it has been around for a while so maybe this is not a problem in practise).

giuseppe · 2024-06-19T22:01:05Z

So thinking about this more I assume the oci runtimes try to use the new mount API by default but then silently fall back to the old one if the new one doesn't work? Doesn't this mean that we break the userns containers on old distros or even just limited envs (i.e. are there any seccomp profiles blocking the mount API but not the old mount syscall)? As such would it make sense to maybe check in podman if the syscall exists/works first only only do the chmod if it does not? The alternative would be we have to document that we only work on kernels with the new mount API (it has been around for a while so maybe this is not a problem in practise).

I've originally added the makeAccessible logic exactly for this reason, since the mount API was not available everywhere. fsopen(2) was added to Linux 5.2, that was released on 7 July 2019. IMO, at this point we can safely assume the new API is available everywhere we care. I'd not mind keeping the fallback, since we already have the code, but it is quite invasive as we chown different directories.

Still marked as a Draft as I need to polish the commits, pushed early only to trigger the CI

Luap99 · 2024-06-20T08:28:47Z

I am fine with not having the fallback if it is out for that long as long as we document this in the release note that we require the new mount API (kernel 5.2).

mheon · 2024-06-20T18:41:26Z

LGTM

Luap99

I am very unsure if this can even work like that. CI will not catch parallel or long running container issues so I wouldn't trust it to much.

I think I have to so some manual testing to check.

libpod/container_internal.go

Luap99 · 2024-06-21T10:14:52Z

libpod/oci_conmon_linux.go

+				if err := unix.Mount("", parentDir, "", unix.MS_PRIVATE, ""); err != nil {
+					return 0, fmt.Errorf("making intermediate parent directory private for container %s: %w", ctr.ID(), err)
+				}


If we make this mount private doesn't this mean a users can no longer receive additional mounts in their mounted volumes (assuming they specified the shared,slave option for the volume)

i.e. assume the container has /tmp/test mounted and then on the host someone mounts /tmp/test/mnt then it no longer gets forwarded? Or am I missing something? I have not tested this yet.

no, this prevents only mounts below parentDir to not be propagated, everything else maintains their original propagation.

The PR was still marked Draft because I was not happy with the hack above, and it was still propagating a new mount to the parent mount namespace. I've submitted a new version that doesn't require the double mount

Luap99 · 2024-06-21T10:17:22Z

libpod/oci_conmon_linux.go

+				if err := unix.Mount("", parentDir, "", unix.MS_PRIVATE, ""); err != nil {
+					return 0, fmt.Errorf("making intermediate parent directory private for container %s: %w", ctr.ID(), err)
+				}
+				if err := unix.Mount(ctr.state.Mountpoint, rootPath, "", unix.MS_BIND, ""); err != nil {


AFAIK if a users removes the target of a bind mount the mount is just dropped.
This means this special tmpdir can never be removed without breaking the running containers which seems very surprising behaviour.
And you never prevent systemd-tmpfiles from removing it besides updating the timestamp when you start a new one which doesn't really help if you have only a really long container running.

that is not what I am observing by deleting the directory manually or inspecting the kernel code.

From what I can see, the removal of a dentry (either through rmdir for directories or unlink for files), causes any mount on that dentry to be lazily unmounted (through detach_mounts() in the kernel), so the mount is still referenced as long as there is something using the mount. So it is fine if the mount gets removed.

I've not marked the directory as sticky (in this way systemd-tmpfiles would just ignore it), so it will eventually be cleaned up if not needed for too long, but just make sure this won't happen while we are setting up the mount.

# podman image mount fedora /var/lib/containers/storage/overlay/1169780961bbebe13753267b1b2c1e720531fe8fb95ffcfc9a51996c4af3f743/merged # mkdir /tmp/fedora # mount --bind /var/lib/containers/storage/overlay/1169780961bbebe13753267b1b2c1e720531fe8fb95ffcfc9a51996c4af3f743/merged /tmp/fedora/ # unshare -m # cat /proc/self/mountinfo | grep /tmp^C # pivot_root . . # cd / # ls afs boot etc lib media opt root sbin sys usr bin dev home lib64 mnt proc run srv tmp var

from another terminal:

# umount /tmp/fedora # rmdir /tmp/fedora # podman image umount fedora

but the previous mount point still works:

# echo hello hello

Signed-off-by: Giuseppe Scrivano <[email protected]>

with the new mount API is available, the OCI runtime doesn't require that each parent directory for a bind mount must be accessible. Instead it is opened in the initial user namespace and passed down to the container init process. This requires that the kernel supports the new mount API and that the OCI runtime uses it. Signed-off-by: Giuseppe Scrivano <[email protected]>

so it is possible to remove the code to make the entire directory world accessible. Signed-off-by: Giuseppe Scrivano <[email protected]>

if the current user is not mapped into the new user namespace, use an intermediate mount to allow the mount point to be accessible instead of opening up all the parent directories for the mountpoint. Closes: containers#23028 Signed-off-by: Giuseppe Scrivano <[email protected]>

rhatdan · 2024-06-22T13:42:46Z

/lgtm

giuseppe added the No New Tests Allow PR to proceed without adding regression tests label Jun 18, 2024

openshift-ci bot added the release-note-none label Jun 18, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 18, 2024

giuseppe mentioned this pull request Jun 18, 2024

podman tries to make user-executable directory world-executable (rootless) #23028

Closed

Luap99 reviewed Jun 18, 2024

View reviewed changes

giuseppe force-pushed the drop-make-accessible branch from 688ce4d to e14ae9a Compare June 18, 2024 20:29

giuseppe marked this pull request as draft June 18, 2024 20:29

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 18, 2024

giuseppe force-pushed the drop-make-accessible branch from e14ae9a to a0f4a10 Compare June 18, 2024 21:11

giuseppe force-pushed the drop-make-accessible branch 3 times, most recently from 6cddb7b to da69dfc Compare June 19, 2024 22:00

giuseppe force-pushed the drop-make-accessible branch from da69dfc to aec1da4 Compare June 20, 2024 08:15

giuseppe force-pushed the drop-make-accessible branch 4 times, most recently from 304d223 to db64e46 Compare June 20, 2024 13:42

giuseppe marked this pull request as ready for review June 21, 2024 07:56

openshift-ci bot added release-note and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none labels Jun 21, 2024

giuseppe marked this pull request as draft June 21, 2024 09:18

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 21, 2024

Luap99 requested changes Jun 21, 2024

View reviewed changes

giuseppe force-pushed the drop-make-accessible branch 2 times, most recently from bdc8183 to 61fb607 Compare June 21, 2024 13:44

giuseppe marked this pull request as ready for review June 21, 2024 14:13

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 21, 2024

giuseppe added 4 commits June 21, 2024 18:01

libpod: unlock the thread if possible

094bc67

Signed-off-by: Giuseppe Scrivano <[email protected]>

libpod: avoid chowning the rundir to root in the userns

08a8429

so it is possible to remove the code to make the entire directory world accessible. Signed-off-by: Giuseppe Scrivano <[email protected]>

giuseppe force-pushed the drop-make-accessible branch from 61fb607 to 49eb5af Compare June 21, 2024 16:01

openshift-ci bot assigned rhatdan Jun 22, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 22, 2024

openshift-merge-bot bot merged commit 25bc426 into containers:main Jun 22, 2024
90 checks passed

Luap99 mentioned this pull request Jun 25, 2024

test/system: debug file leaks #23100

Draft

eyezak mentioned this pull request Sep 5, 2024

Rootful podman with --userns=auto fails to run a container, regression in 5.2.0+. #23877

Closed

stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 21, 2024

stale-locking-app bot locked as resolved and limited conversation to collaborators Sep 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libpod: do not chmod bind mounts #23032

libpod: do not chmod bind mounts #23032

giuseppe commented Jun 18, 2024 •

edited

Loading

openshift-ci bot commented Jun 18, 2024

Luap99 left a comment

Luap99 commented Jun 18, 2024

giuseppe commented Jun 18, 2024

Luap99 commented Jun 19, 2024

giuseppe commented Jun 19, 2024

Luap99 commented Jun 20, 2024

mheon commented Jun 20, 2024

Luap99 left a comment

Luap99 Jun 21, 2024

giuseppe Jun 21, 2024

Luap99 Jun 21, 2024

giuseppe Jun 21, 2024

rhatdan commented Jun 22, 2024

libpod: do not chmod bind mounts #23032

libpod: do not chmod bind mounts #23032

Conversation

giuseppe commented Jun 18, 2024 • edited Loading

Does this PR introduce a user-facing change?

openshift-ci bot commented Jun 18, 2024

Luap99 left a comment

Choose a reason for hiding this comment

Luap99 commented Jun 18, 2024

giuseppe commented Jun 18, 2024

Luap99 commented Jun 19, 2024

giuseppe commented Jun 19, 2024

Luap99 commented Jun 20, 2024

mheon commented Jun 20, 2024

Luap99 left a comment

Choose a reason for hiding this comment

Luap99 Jun 21, 2024

Choose a reason for hiding this comment

giuseppe Jun 21, 2024

Choose a reason for hiding this comment

Luap99 Jun 21, 2024

Choose a reason for hiding this comment

giuseppe Jun 21, 2024

Choose a reason for hiding this comment

rhatdan commented Jun 22, 2024

giuseppe commented Jun 18, 2024 •

edited

Loading