-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
userns: prevent /sys/kernel/* paths in the container #2899
userns: prevent /sys/kernel/* paths in the container #2899
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: giuseppe The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
3d0b90a
to
507c640
Compare
libpod/oci_linux.go
Outdated
return errors.Wrapf(err, "cannot make /sys slave") | ||
} | ||
|
||
paths := []string{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make this more future proof. Remove all submounts of /sys? Or at least anything begining with /sys/kernel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed now
@@ -31,4 +31,12 @@ echo $rand | 0 | $rand | |||
done < <(parse_table "$tests") | |||
} | |||
|
|||
@test "podman run - uidmapping has no /sys/kernel mounts" { | |||
run_podman $expected_rc run --uidmapping 0:100:10000 $IMAGE mount | grep /sys/kernel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need --net=host here to make the current behaviour happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added another test with --net=host, so that we can test both cases
507c640
to
0a5df19
Compare
LGTM |
@giuseppe rootless tests aren't happy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code LGTM (very elegant)
0a5df19
to
2ea8946
Compare
ah, this wouldn't work with rootless. An unprivileged user cannot umount these paths. @rhatdan is it an issue for buildah? |
if err = unix.Unshare(unix.CLONE_NEWNS); err != nil { | ||
return err | ||
} | ||
defer unix.Setns(int(fd.Fd()), unix.CLONE_NEWNS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this defer needed? Seems to be a duplicate of line 112
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need the one above.
@giuseppe Yes. |
when we run in a user namespace, there are cases where we have not enough privileges to mount a fresh sysfs on /sys. To circumvent this limitation, we rbind /sys from the host. This carries inside of the container also some mounts we probably don't want to. We are also limited by the kernel to use rbind instead of bind, as allowing a bind would uncover paths that were not previously visible. This is a slimmed down version of the intermediate mount namespace logic we had before, where we only set /sys to slave, so the umounts done to the storage by the cleanup process are propagated back to the host. We also don't setup any new directory, so there is no additional cleanup to do. Signed-off-by: Giuseppe Scrivano <[email protected]>
2ea8946
to
b780088
Compare
We might need to mask these in rootless mode. I am not sure if @TomSweeneyRedHat has tried buildah bud --isolation chroot inside of rootless podman yet. We really need this for root running container anyways, especially where if you don't have SELinux running you can potentially modify the kernel. In rootless mode this would obviously be blocked. |
I think there is no way for an unprivileged user to drop that mounts. It cannot mount a new sysfs and it cannot umount anything that will reveal what is under the mount point. On the other hand, is it really a security issue? The unprivileged user won't have any new privilege that it didn't have on the host. |
Right I am worried about information reveal though. The information in /sys/kernel/debug might be useful to a hacked container. If we can not umount we should mount a tmpfs over them. |
we can mask it, altough |
Signed-off-by: Giuseppe Scrivano <[email protected]>
1b60e4b
to
1367c65
Compare
Signed-off-by: Giuseppe Scrivano <[email protected]>
1367c65
to
2c9c40d
Compare
I've added a patch to mask |
Well this looks good. |
LGTM |
should I drop the second patch or is it still a valid additional defense? |
LGTM. I think we keep the masked path - defense in depth is good. |
Holding /lgtm until tests go green |
/lgtm |
This commit seems to have broken the new unbrivileged-access BATS test:
@giuseppe any suggestions on how to fix it |
I think it is related to the new tests that are using a mapping not correct for the unprivileged access tests |
I am surprised how that could happen though, directories under |
Not any more. Immediately after this command,
Furthermore, I am now stuck in a system where I can no longer remove that directory:
(podman rm -a, podman ps -a, podman images -a, all show nothing). master @ c1e2b58. I don't have time this evening to do a full investigateion but wanted to leave you with something you can check out tomorrow morning. |
how do directories under
They must have mode |
Sorry for the late reply. I can no longer reproduce this on my system, and OSP10 is down so I can't get a new virt for testing. The only difference on my system is that I'm still stuck with the undeletable directory mentioned above, so instead of |
when we run in a user namespace, there are cases where we have not
enough privileges to mount a fresh sysfs on /sys. To circumvent this
limitation, we rbind /sys from the host. This carries inside of the
container also some mounts we probably don't want to. We are also
limited by the kernel to use rbind instead of bind, as allowing a bind
would uncover paths that were not previously visible.
This is a slimmed down version of the intermediate mount namespace
logic we had before, where we only set /sys to slave, so the umounts
done to the storage by the cleanup process are propagated back to the
host. We also don't setup any new directory, so there is no
additional cleanup to do.
Signed-off-by: Giuseppe Scrivano [email protected]