-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libpod: do not chmod bind mounts #23032
libpod: do not chmod bind mounts #23032
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: giuseppe The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but just to be sure do we know if runc makes use of the new API by default?
Looks like it might be needed after all. |
688ce4d
to
e14ae9a
Compare
I was over optimistic :) I've pushed a new version that really deals only with the bind mounts, and this works well with runc too. I've marked the PR as a Draft for now, I want to check better what we can do with the remaining occurrences |
e14ae9a
to
a0f4a10
Compare
So thinking about this more I assume the oci runtimes try to use the new mount API by default but then silently fall back to the old one if the new one doesn't work? Doesn't this mean that we break the userns containers on old distros or even just limited envs (i.e. are there any seccomp profiles blocking the mount API but not the old mount syscall)? As such would it make sense to maybe check in podman if the syscall exists/works first only only do the chmod if it does not? |
6cddb7b
to
da69dfc
Compare
I've originally added the Still marked as a Draft as I need to polish the commits, pushed early only to trigger the CI |
da69dfc
to
aec1da4
Compare
I am fine with not having the fallback if it is out for that long as long as we document this in the release note that we require the new mount API (kernel 5.2). |
304d223
to
db64e46
Compare
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am very unsure if this can even work like that. CI will not catch parallel or long running container issues so I wouldn't trust it to much.
I think I have to so some manual testing to check.
libpod/oci_conmon_linux.go
Outdated
if err := unix.Mount("", parentDir, "", unix.MS_PRIVATE, ""); err != nil { | ||
return 0, fmt.Errorf("making intermediate parent directory private for container %s: %w", ctr.ID(), err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we make this mount private doesn't this mean a users can no longer receive additional mounts in their mounted volumes (assuming they specified the shared,slave option for the volume)
i.e. assume the container has /tmp/test mounted and then on the host someone mounts /tmp/test/mnt then it no longer gets forwarded? Or am I missing something? I have not tested this yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, this prevents only mounts below parentDir
to not be propagated, everything else maintains their original propagation.
The PR was still marked Draft because I was not happy with the hack above, and it was still propagating a new mount to the parent mount namespace. I've submitted a new version that doesn't require the double mount
if err := unix.Mount("", parentDir, "", unix.MS_PRIVATE, ""); err != nil { | ||
return 0, fmt.Errorf("making intermediate parent directory private for container %s: %w", ctr.ID(), err) | ||
} | ||
if err := unix.Mount(ctr.state.Mountpoint, rootPath, "", unix.MS_BIND, ""); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK if a users removes the target of a bind mount the mount is just dropped.
This means this special tmpdir can never be removed without breaking the running containers which seems very surprising behaviour.
And you never prevent systemd-tmpfiles from removing it besides updating the timestamp when you start a new one which doesn't really help if you have only a really long container running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is not what I am observing by deleting the directory manually or inspecting the kernel code.
From what I can see, the removal of a dentry (either through rmdir
for directories or unlink
for files), causes any mount on that dentry to be lazily unmounted (through detach_mounts()
in the kernel), so the mount is still referenced as long as there is something using the mount. So it is fine if the mount gets removed.
I've not marked the directory as sticky (in this way systemd-tmpfiles would just ignore it), so it will eventually be cleaned up if not needed for too long, but just make sure this won't happen while we are setting up the mount.
# podman image mount fedora
/var/lib/containers/storage/overlay/1169780961bbebe13753267b1b2c1e720531fe8fb95ffcfc9a51996c4af3f743/merged
# mkdir /tmp/fedora
# mount --bind /var/lib/containers/storage/overlay/1169780961bbebe13753267b1b2c1e720531fe8fb95ffcfc9a51996c4af3f743/merged /tmp/fedora/
# unshare -m
# cat /proc/self/mountinfo | grep /tmp^C
# pivot_root . .
# cd /
# ls
afs boot etc lib media opt root sbin sys usr
bin dev home lib64 mnt proc run srv tmp var
from another terminal:
# umount /tmp/fedora
# rmdir /tmp/fedora
# podman image umount fedora
but the previous mount point still works:
# echo hello
hello
bdc8183
to
61fb607
Compare
Signed-off-by: Giuseppe Scrivano <[email protected]>
with the new mount API is available, the OCI runtime doesn't require that each parent directory for a bind mount must be accessible. Instead it is opened in the initial user namespace and passed down to the container init process. This requires that the kernel supports the new mount API and that the OCI runtime uses it. Signed-off-by: Giuseppe Scrivano <[email protected]>
so it is possible to remove the code to make the entire directory world accessible. Signed-off-by: Giuseppe Scrivano <[email protected]>
if the current user is not mapped into the new user namespace, use an intermediate mount to allow the mount point to be accessible instead of opening up all the parent directories for the mountpoint. Closes: containers#23028 Signed-off-by: Giuseppe Scrivano <[email protected]>
61fb607
to
49eb5af
Compare
/lgtm |
with the new mount API is available, the OCI runtime doesn't require that each parent directory for a bind mount must be accessible. Instead it is opened in the initial user namespace and passed down to the container init process.
This requires that the kernel supports the new mount API and that the OCI runtime uses it.
Closes: #23028
Does this PR introduce a user-facing change?