-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rootless: make /sys/fs/cgroup/* read-only #1869
Conversation
e4d76aa
to
6de6a84
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a small comment, tested locally and it works for me
{ | ||
Source: "/sys", | ||
Destination: "/sys", | ||
Type: "none", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be "bind"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the only place where I see it used. In the runtime-specs example bind
is used as the type for a bind mount (https://github.com/opencontainers/runtime-spec/blob/master/config.md#example-linux), both Docker and Podman generate "bind".
Anyway since it is only in the example and not in the specs, and you are just moving this code snippet, it is probably fine to leave it as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type: "none"
is entirely valid. There was a bug previously in runc
when we didn't handle this properly, but the type field (when it comes to mount(2)
) doesn't affect bindmounts -- what matters is that the bind
option is set. Docker set bind
presumably by accident, and Podman probably copied the mistake (to be clear -- it doesn't matter either way, but it does lead to confusion).
But this code is used for more than example -- it's used by at least a few projects to convert a configuration to the rootless version of said configuration.
libcontainer/specconv/example.go
Outdated
@@ -162,6 +162,10 @@ func ToRootless(spec *specs.Spec) { | |||
var namespaces []specs.LinuxNamespace | |||
|
|||
// Remove networkns from the spec. | |||
// | |||
// TODO: removal of networkns should be optional, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI. When someone puts TODOs in PRs, it makes reviewers nervous.
todo when?
by who?
what is the issue now?
what breaks while this isn't done?
6de6a84
to
6450010
Compare
addressed comments |
Thanks, LGTM |
libcontainer/specconv/example.go
Outdated
var mounts []specs.Mount | ||
for _, mount := range spec.Mounts { | ||
// Ignore all mounts that are under /sys. | ||
if strings.HasPrefix(mount.Destination, "/sys") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is always fine -- if someone wants to mount a fake filesystem over parts of /sys
then I don't see why we would block it (let alone ignore it).
Ultimately this is used during config generation so it's not the end of the world, but it is something to keep in mind.
libcontainer/specconv/example.go
Outdated
// Without this hack, `df` fails with "df: /sys/firmware/efi/efivars" | ||
// because the efivars entry exists in mtab (because /sys is rbind-mounted) | ||
// but /sys/firmware is masked. | ||
if strings.HasPrefix(masked, "/sys") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is fine for rootless mode -- if you unmask paths you could be leaking information to containers (so while they might not be able to write to the masked paths, they are able to read them). I mean, you probably can't read most of /sys
that is masked anyway, but I'm not sure it's a great idea.
The default rootless spec bind-mounts /sys with "rbind,ro" (because mounting sysfs requires netns to be unshared), however, it does not make subfilesystems mounted under /sys read-only. So, files under /sys/fs/cgroup were unexpectedly writable when they are chmod/chowned via privileged helpers such as pam_cgfs. This patch fixes the issue by mounting /sys/fs/cgroup/* as read-only explicitly. Signed-off-by: Akihiro Suda <[email protected]>
6450010
to
28dc2e6
Compare
addressed comments, although it became more complicated |
@giuseppe latest revision still LGTY? |
@AkihiroSuda yes, latest version LGTM |
@cyphar PTAL |
ping @cyphar |
On second thought maybe we should rather suggest users not mounting Rootless BuildKit already removed For Rootless Moby, I'm going to close this PR but I may change my mind later. |
The default rootless spec bind-mounts /sys with "rbind,ro" (because mounting sysfs
requires netns to be unshared), however, it does not make subfilesystems mounted
under /sys read-only.
So, files under /sys/fs/cgroup were unexpectedly writable when they are chmod/chowned
via privileged helpers such as pam_cgfs.
This patch fixes the issue by mounting /sys/fs/cgroup/* as read-only explicitly.
Signed-off-by: Akihiro Suda [email protected]