Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libpod: do not chmod bind mounts #23032

Merged
merged 4 commits into from
Jun 22, 2024

Conversation

giuseppe
Copy link
Member

@giuseppe giuseppe commented Jun 18, 2024

with the new mount API is available, the OCI runtime doesn't require that each parent directory for a bind mount must be accessible. Instead it is opened in the initial user namespace and passed down to the container init process.

This requires that the kernel supports the new mount API and that the OCI runtime uses it.

Closes: #23028

Does this PR introduce a user-facing change?

Now Podman requires the new kernel mount API (available since Linux 5.2) to configure containers in a new user namespace where the current user is not part of the mapping

@giuseppe giuseppe added the No New Tests Allow PR to proceed without adding regression tests label Jun 18, 2024
Copy link
Contributor

openshift-ci bot commented Jun 18, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 18, 2024
Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but just to be sure do we know if runc makes use of the new API by default?

@Luap99
Copy link
Member

Luap99 commented Jun 18, 2024

Looks like it might be needed after all.

@giuseppe giuseppe force-pushed the drop-make-accessible branch from 688ce4d to e14ae9a Compare June 18, 2024 20:29
@giuseppe giuseppe marked this pull request as draft June 18, 2024 20:29
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 18, 2024
@giuseppe
Copy link
Member Author

I was over optimistic :)

I've pushed a new version that really deals only with the bind mounts, and this works well with runc too.

I've marked the PR as a Draft for now, I want to check better what we can do with the remaining occurrences

@giuseppe giuseppe force-pushed the drop-make-accessible branch from e14ae9a to a0f4a10 Compare June 18, 2024 21:11
@Luap99
Copy link
Member

Luap99 commented Jun 19, 2024

So thinking about this more I assume the oci runtimes try to use the new mount API by default but then silently fall back to the old one if the new one doesn't work? Doesn't this mean that we break the userns containers on old distros or even just limited envs (i.e. are there any seccomp profiles blocking the mount API but not the old mount syscall)? As such would it make sense to maybe check in podman if the syscall exists/works first only only do the chmod if it does not?
The alternative would be we have to document that we only work on kernels with the new mount API (it has been around for a while so maybe this is not a problem in practise).

@giuseppe giuseppe force-pushed the drop-make-accessible branch 3 times, most recently from 6cddb7b to da69dfc Compare June 19, 2024 22:00
@giuseppe
Copy link
Member Author

So thinking about this more I assume the oci runtimes try to use the new mount API by default but then silently fall back to the old one if the new one doesn't work? Doesn't this mean that we break the userns containers on old distros or even just limited envs (i.e. are there any seccomp profiles blocking the mount API but not the old mount syscall)? As such would it make sense to maybe check in podman if the syscall exists/works first only only do the chmod if it does not? The alternative would be we have to document that we only work on kernels with the new mount API (it has been around for a while so maybe this is not a problem in practise).

I've originally added the makeAccessible logic exactly for this reason, since the mount API was not available everywhere. fsopen(2) was added to Linux 5.2, that was released on 7 July 2019. IMO, at this point we can safely assume the new API is available everywhere we care. I'd not mind keeping the fallback, since we already have the code, but it is quite invasive as we chown different directories.

Still marked as a Draft as I need to polish the commits, pushed early only to trigger the CI

@giuseppe giuseppe force-pushed the drop-make-accessible branch from da69dfc to aec1da4 Compare June 20, 2024 08:15
@Luap99
Copy link
Member

Luap99 commented Jun 20, 2024

I am fine with not having the fallback if it is out for that long as long as we document this in the release note that we require the new mount API (kernel 5.2).

@giuseppe giuseppe force-pushed the drop-make-accessible branch 4 times, most recently from 304d223 to db64e46 Compare June 20, 2024 13:42
@mheon
Copy link
Member

mheon commented Jun 20, 2024

LGTM

@giuseppe giuseppe marked this pull request as ready for review June 21, 2024 07:56
@openshift-ci openshift-ci bot added release-note and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none labels Jun 21, 2024
@giuseppe giuseppe marked this pull request as draft June 21, 2024 09:18
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 21, 2024
Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very unsure if this can even work like that. CI will not catch parallel or long running container issues so I wouldn't trust it to much.

I think I have to so some manual testing to check.

Comment on lines 70 to 95
if err := unix.Mount("", parentDir, "", unix.MS_PRIVATE, ""); err != nil {
return 0, fmt.Errorf("making intermediate parent directory private for container %s: %w", ctr.ID(), err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make this mount private doesn't this mean a users can no longer receive additional mounts in their mounted volumes (assuming they specified the shared,slave option for the volume)

i.e. assume the container has /tmp/test mounted and then on the host someone mounts /tmp/test/mnt then it no longer gets forwarded? Or am I missing something? I have not tested this yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, this prevents only mounts below parentDir to not be propagated, everything else maintains their original propagation.

The PR was still marked Draft because I was not happy with the hack above, and it was still propagating a new mount to the parent mount namespace. I've submitted a new version that doesn't require the double mount

if err := unix.Mount("", parentDir, "", unix.MS_PRIVATE, ""); err != nil {
return 0, fmt.Errorf("making intermediate parent directory private for container %s: %w", ctr.ID(), err)
}
if err := unix.Mount(ctr.state.Mountpoint, rootPath, "", unix.MS_BIND, ""); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK if a users removes the target of a bind mount the mount is just dropped.
This means this special tmpdir can never be removed without breaking the running containers which seems very surprising behaviour.
And you never prevent systemd-tmpfiles from removing it besides updating the timestamp when you start a new one which doesn't really help if you have only a really long container running.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is not what I am observing by deleting the directory manually or inspecting the kernel code.

From what I can see, the removal of a dentry (either through rmdir for directories or unlink for files), causes any mount on that dentry to be lazily unmounted (through detach_mounts() in the kernel), so the mount is still referenced as long as there is something using the mount. So it is fine if the mount gets removed.

I've not marked the directory as sticky (in this way systemd-tmpfiles would just ignore it), so it will eventually be cleaned up if not needed for too long, but just make sure this won't happen while we are setting up the mount.

# podman image mount fedora
/var/lib/containers/storage/overlay/1169780961bbebe13753267b1b2c1e720531fe8fb95ffcfc9a51996c4af3f743/merged
# mkdir /tmp/fedora
# mount --bind /var/lib/containers/storage/overlay/1169780961bbebe13753267b1b2c1e720531fe8fb95ffcfc9a51996c4af3f743/merged /tmp/fedora/

# unshare -m 
# cat /proc/self/mountinfo  | grep /tmp^C
# pivot_root . .
# cd /
# ls
afs  boot  etc   lib    media  opt   root  sbin  sys  usr
bin  dev   home  lib64  mnt    proc  run   srv   tmp  var

from another terminal:

# umount /tmp/fedora
# rmdir /tmp/fedora
# podman image umount fedora

but the previous mount point still works:

# echo hello
hello

@giuseppe giuseppe force-pushed the drop-make-accessible branch 2 times, most recently from bdc8183 to 61fb607 Compare June 21, 2024 13:44
@giuseppe giuseppe marked this pull request as ready for review June 21, 2024 14:13
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 21, 2024
giuseppe added 4 commits June 21, 2024 18:01
Signed-off-by: Giuseppe Scrivano <[email protected]>
with the new mount API is available, the OCI runtime doesn't require
that each parent directory for a bind mount must be accessible.
Instead it is opened in the initial user namespace and passed down to
the container init process.

This requires that the kernel supports the new mount API and that the
OCI runtime uses it.

Signed-off-by: Giuseppe Scrivano <[email protected]>
so it is possible to remove the code to make the entire directory
world accessible.

Signed-off-by: Giuseppe Scrivano <[email protected]>
if the current user is not mapped into the new user namespace, use an
intermediate mount to allow the mount point to be accessible instead
of opening up all the parent directories for the mountpoint.

Closes: containers#23028

Signed-off-by: Giuseppe Scrivano <[email protected]>
@giuseppe giuseppe force-pushed the drop-make-accessible branch from 61fb607 to 49eb5af Compare June 21, 2024 16:01
@rhatdan
Copy link
Member

rhatdan commented Jun 22, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 22, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 25bc426 into containers:main Jun 22, 2024
90 checks passed
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 21, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Sep 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. No New Tests Allow PR to proceed without adding regression tests release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

podman tries to make user-executable directory world-executable (rootless)
4 participants