-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support compose with default docker/kube privileges #1329
Comments
symlinks wouldn't work obviously. Can't bind mount either since that's still In the end I think it'd probably be better to sprint towards getting usable userns in kube. |
That's 6 months out, no matter what, which means from then until now we can't build CoreOS on OpenShift. That seems like a problem - I'm still not clear why bwrap and other containerization is necessary when you are already inside a container? I assume you're trying to prevent the content that will be running as root from escaping and gaining access to some of the contents of the container running the container? We already operate in that mode for docker builds, and I'm not sure it's a bigger security concern |
Fundamentally we have our source rootfs (container) and we're generating a new target rootfs which is distinct from that source. Changing that assumption would be...ugly. It starts to look like this. In this source/target separation then (which is the same scenario necessary to run the same User namespaces is the general solution to this, it enables recursive containerization.
Yeah, it's slightly useful for host builds but not a big deal. However there are two cases where it's a lot more useful:
At a high level I'd say we're establishing some of the requirements to do RPM-style builds in Kube/OpenShift which is going to be a very useful thing in general. Where rpm-ostree adds additional complexity is that unlike |
The bwrap code has some "opinionated" setup around e.g. /etc and such that I'd like to centralize even more. However, the dracut case of taking the host's `/etc` is the unusual standout. Let's split out a base constructor. Prep for coreos#1329
This is also prep for running inside (unprivileged) Kube containers as they exist today: coreos#1329 Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN. I think there's some work on FUSE-in-containers but I'm not sure of the current status. What FUSE is protecting is the hardlinked repo imports. But if `--cachedir` isn't specified (which we support today) that repository gets thrown away anyways. So there's no real value to using FUSE here. Let's hence make it so that in this scenario we don't. Down the line ideally we gain the capability to detect if either unprivileged overlayfs/FUSE are available. Then if `--cachedir` is specified we can make things work.
This was useful for me to directly invoke bwrap against the target rootfs. Part of working on: coreos#1329
In prep for cleaning up how we manage mounts. If we're looking at a real existing `/var`, then it must have the directory. The only case where we don't is in pkglayering, so move the special case there. Prep for coreos#1329
User namespaces denies overmounting existing targets as it could reveal things intended to be hidden. In the recursive container case, we need to just use e.g. `--bind /dev /dev` rather than trying to create a new instances. Further, in the selftest case we can't overmount what already exists, so skip doing that entirely. Prep for rpm-ostree-in-Kube: coreos#1329
This should all work with preceeding commits, let's start testing it. Closes: coreos#1329
PR in #1332 |
The bwrap code has some "opinionated" setup around e.g. /etc and such that I'd like to centralize even more. However, the dracut case of taking the host's `/etc` is the unusual standout. Let's split out a base constructor. Prep for coreos#1329
This was useful for me to directly invoke bwrap against the target rootfs. Part of working on: coreos#1329
In prep for cleaning up how we manage mounts. If we're looking at a real existing `/var`, then it must have the directory. The only case where we don't is in pkglayering, so move the special case there. Prep for coreos#1329
This is also prep for running inside (unprivileged) Kube containers as they exist today: coreos#1329 Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN. I think there's some work on FUSE-in-containers but I'm not sure of the current status. What FUSE is protecting is the hardlinked repo imports. But if `--cachedir` isn't specified (which we support today) that repository gets thrown away anyways. So there's no real value to using FUSE here. Let's hence make it so that in this scenario we don't. Down the line ideally we gain the capability to detect if either unprivileged overlayfs/FUSE are available. Then if `--cachedir` is specified we can make things work.
User namespaces denies overmounting existing targets as it could reveal things intended to be hidden. In the recursive container case, we need to just use e.g. `--bind /dev /dev` rather than trying to create a new instances. Further, in the selftest case we can't overmount what already exists, so skip doing that entirely. Prep for rpm-ostree-in-Kube: coreos#1329
This should all work with preceeding commits, let's start testing it. Closes: coreos#1329
Now that we've split out the "base" constructor for the initramfs special case, the "mutable" disposition tells us how to handle binding `/etc` and `/var` in addition to `/usr`. Drop the `RPMOSTREE_BWRAP_IMMUTABLE` case as it is now only used in the script case where we handle `/var`. Next, user namespaces denies overmounting existing targets as it could reveal things intended to be hidden. In the recursive container case, we need to just use e.g. `--bind /dev /dev` rather than trying to create a new instances. Further, in the selftest case we can't overmount what already exists, so skip doing that entirely. Closes: coreos#1329
In pulling together what the CI process arc looks like, the concrete requirements for this should fall out. If the cost is high enough from your investigations that you think it needs to be scheduled in, we can manage that as part of the epic that tracks what the CI process looks like. |
Colin, how much is important to get a new pid namespace and a new I gave it a quick try (bwrap under bwrap under docker), from the bwrap build directory:
There is an issue for Moby to not have masked paths on |
Without a pid namespace (and without a seccomp policy in place) one can easily |
Isn't relaxing seccomp required in any case when a new namespace must be created inside of a Docker container? Same for a new pid namespace + procfs mount: that is not permitted in an unprivileged Docker container as In any case, I was just wondering if the recursive bwrap could help with anything as it can be used for launching rpm-ostree which in turn uses bwrap internally. |
Maybe the path we should be looking at is doing a Custom builder specifically for the OpenShift casse? |
Custom builders are effectively dead, so not really a practical option. The problem is that building this content shouldn't require privileges, and i don't really want to encourage anyone to think that privileged container image builders are necessary or a good idea. When user namespaces and cri-o lands in openshift, and we start dropping privileges for those containers, I don't really want there to be this one special case. Maybe I'm being unreasonable, but in a properly confined container that is running as uid 0, there's nothing special about what rpm-ostree is doing that requires isolation other than some trickery. You are already confined when you run. |
Mounting proc is in order to prevent what? Or allow what? Inside of a containerized runner, we're not defending the container from the content (we will already have ways of defending that that will grow over time). We're just trying to ensure the content doesn't escape the container. Malicious content in a build container is not our threat model here, because all user code that ends up in the final image has to be considered malicious. |
I forgot to comment here too, but the basic thing is that we're generating a root filesystem which is not the same as the source, and we want to run scripts in the target, and it really works best if those scripts are run as containers. One of the biggest "scripts" is dracut, which generates the initramfs which is is another little filesystem generated from our target. We could in theory teach dracut how to deal with a src != target scenario but it'd be quite nontrivial, and a big risky change to an under-tested but critical component of the OS. Mix in the major changes here with the fundamental thing that down the line, it should be quite possible to make recursive containerization can be made to work inside Kubernetes, I'd rather limp forward with the story that "use --privileged for now". See containers/bubblewrap#269 for some discussion on recursive containers. |
This is prep for running inside (unprivileged) Kube containers as they exist today: coreos#1329 Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN. I think there's some work on FUSE-in-containers but I'm not sure of the current status. What rofiles-fuse here is doing here is protecting is the hardlinked repo imports. But if `--cachedir` isn't specified, that repository gets thrown away anyways. So there's no real value to using FUSE here. Also since nothing is cached, disable the devino cache. Down the line ideally we gain the capability to detect if either unprivileged overlayfs/FUSE are available. Then if `--cachedir` is specified we can make things work.
This is prep for running inside (unprivileged) Kube containers as they exist today: coreos#1329 Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN. I think there's some work on FUSE-in-containers but I'm not sure of the current status. What rofiles-fuse here is doing here is protecting is the hardlinked repo imports. But if `--cachedir` isn't specified, that repository gets thrown away anyways. So there's no real value to using FUSE here. Also since nothing is cached, disable the devino cache. Down the line ideally we gain the capability to detect if either unprivileged overlayfs/FUSE are available. Then if `--cachedir` is specified we can make things work.
This is prep for running inside (unprivileged) Kube containers as they exist today: coreos#1329 Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN. I think there's some work on FUSE-in-containers but I'm not sure of the current status. What rofiles-fuse here is doing here is protecting is the hardlinked repo imports. But if `--cachedir` isn't specified, that repository gets thrown away anyways. So there's no real value to using FUSE here. Also since nothing is cached, disable the devino cache. Down the line ideally we gain the capability to detect if either unprivileged overlayfs/FUSE are available. Then if `--cachedir` is specified we can make things work.
This is prep for running inside (unprivileged) Kube containers as they exist today: #1329 Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN. I think there's some work on FUSE-in-containers but I'm not sure of the current status. What rofiles-fuse here is doing here is protecting is the hardlinked repo imports. But if `--cachedir` isn't specified, that repository gets thrown away anyways. So there's no real value to using FUSE here. Also since nothing is cached, disable the devino cache. Down the line ideally we gain the capability to detect if either unprivileged overlayfs/FUSE are available. Then if `--cachedir` is specified we can make things work. Closes: #1591 Approved by: jlebon
This is prep for running inside (unprivileged) Kube containers as they exist today: coreos#1329 Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN. I think there's some work on FUSE-in-containers but I'm not sure of the current status. What rofiles-fuse here is doing here is protecting is the hardlinked repo imports. But if `--cachedir` isn't specified, that repository gets thrown away anyways. So there's no real value to using FUSE here. Also since nothing is cached, disable the devino cache. Down the line ideally we gain the capability to detect if either unprivileged overlayfs/FUSE are available. Then if `--cachedir` is specified we can make things work.
This is prep for running inside (unprivileged) Kube containers as they exist today: coreos#1329 Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN. I think there's some work on FUSE-in-containers but I'm not sure of the current status. What rofiles-fuse here is doing here is protecting is the hardlinked repo imports. But if `--cachedir` isn't specified, that repository gets thrown away anyways. So there's no real value to using FUSE here. Also since nothing is cached, disable the devino cache. Down the line ideally we gain the capability to detect if either unprivileged overlayfs/FUSE are available. Then if `--cachedir` is specified we can make things work.
This is prep for running inside (unprivileged) Kube containers as they exist today: coreos#1329 Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN. I think there's some work on FUSE-in-containers but I'm not sure of the current status. What rofiles-fuse here is doing here is protecting is the hardlinked repo imports. But if `--cachedir` isn't specified, that repository gets thrown away anyways. So there's no real value to using FUSE here. Also since nothing is cached, disable the devino cache. We also make use of --force-copy-zerosized that just landed in libostree: ostreedev/ostree#1752 Down the line ideally we gain the capability to detect if either unprivileged overlayfs/FUSE are available. Then if `--cachedir` is specified we can make things work.
This is prep for running inside (unprivileged) Kube containers as they exist today: coreos#1329 Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN. I think there's some work on FUSE-in-containers but I'm not sure of the current status. What rofiles-fuse here is doing here is protecting is the hardlinked repo imports. But if `--cachedir` isn't specified, that repository gets thrown away anyways. So there's no real value to using FUSE here. Also since nothing is cached, disable the devino cache. We also make use of --force-copy-zerosized that just landed in libostree: ostreedev/ostree#1752 Down the line ideally we gain the capability to detect if either unprivileged overlayfs/FUSE are available. Then if `--cachedir` is specified we can make things work.
This is prep for running inside (unprivileged) Kube containers as they exist today: coreos#1329 Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN. I think there's some work on FUSE-in-containers but I'm not sure of the current status. What rofiles-fuse here is doing here is protecting is the hardlinked repo imports. But if `--cachedir` isn't specified, that repository gets thrown away anyways. So there's no real value to using FUSE here. Also since nothing is cached, disable the devino cache. We also make use of --force-copy-zerosized that just landed in libostree: ostreedev/ostree#1752 Down the line ideally we gain the capability to detect if either unprivileged overlayfs/FUSE are available. Then if `--cachedir` is specified we can make things work.
This is prep for running inside (unprivileged) Kube containers as they exist today: #1329 Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN. I think there's some work on FUSE-in-containers but I'm not sure of the current status. What rofiles-fuse here is doing here is protecting is the hardlinked repo imports. But if `--cachedir` isn't specified, that repository gets thrown away anyways. So there's no real value to using FUSE here. Also since nothing is cached, disable the devino cache. We also make use of --force-copy-zerosized that just landed in libostree: ostreedev/ostree#1752 Down the line ideally we gain the capability to detect if either unprivileged overlayfs/FUSE are available. Then if `--cachedir` is specified we can make things work. Closes: #1591 Approved by: jlebon
This is prep for running inside (unprivileged) Kube containers as they exist today: #1329 Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN. I think there's some work on FUSE-in-containers but I'm not sure of the current status. What rofiles-fuse here is doing here is protecting is the hardlinked repo imports. But if `--cachedir` isn't specified, that repository gets thrown away anyways. So there's no real value to using FUSE here. Also since nothing is cached, disable the devino cache. We also make use of --force-copy-zerosized that just landed in libostree: ostreedev/ostree#1752 Down the line ideally we gain the capability to detect if either unprivileged overlayfs/FUSE are available. Then if `--cachedir` is specified we can make things work. Closes: #1591 Approved by: jlebon
BTW, there's a PR open to implement userns in Kube: kubernetes/kubernetes#64005. |
This might be a little different than the issue being discussed here, but I am trying to make a custom tree using rootless podman/buildah with --privileged flag. It seems everything goes fine until it downloads files from the repository (which I assume means bwrap is able to create a container). But after that it fails with Operation not permitted. Am I doing something wrong? |
We just merged some code in coreos-assembler to make it work unprivileged: coreos/coreos-assembler#190. You might be able to use coreos-assembler (running in a rootless podman invocation) to do a compose (it uses an unprivileged VM). |
@dustymabe Thanks for the advice. I did take a look at the coreos-assembler. From what I can see is that you guys are launching a qemu-kvm instance if the container is unprivileged right? Two thing I would like to ask.
BTW I am using centos 7.5 and the custom tree is for centos 7.5 |
right
yeah. this isn't something I've tried. I'm not sure if there's a technical reason why it's not working, or if it is a bug in rootless containers, which are pretty new
If you have control of the hypervisor you could enable nested virt, but that is a big IF.
yeah. I don't know if I recommend this, it would probably be really slow. We are calling the qemu-kvm binary in coreos-assembler, so i'm not sure if it would fall back to non hardware assisted virt (emulation), but you could try. |
@dustymabe Thanks for your help. I think I will stick to using nested virt approach. This will take some effort, but I think it will be worth the time I will save later. |
While some initial work landed on this, I think we're unlikely to spend much more time on it. I'd prefer to focus on the coreos-assembler unpriv path: #1329 (comment) |
Currently we require recursive containerization support for compose; the ability to run bwrap.
In theory, we should be able to fall back on just
chroot()
, which is a capability is exposed.Unfortunately that isn't going to give us privileges to e.g. mount
/proc
. What we'd need to do is inside the target, make those symlinks to the outer container's filesystems.The text was updated successfully, but these errors were encountered: