-
Notifications
You must be signed in to change notification settings - Fork 348
runtime based privileged container translation #1213
Comments
Why not use |
@AkihiroSuda runtimeclass works at sandbox level. We would want something at the container level. As @dadux explained in his use case, only one container in a pod is created in privileged mode, and other containers do not have extra privileges. We can still make it (kata's own translation) work by introducing a container annotation but that is not the purpose of this issue. The purpose of this issue is to let containerd/cri still handle privileged container spec translation, instead of introducing kata's own translation and bypassing cri/containerd all together. |
I thought Kata was creating a VM per a pod, not per a container - the situation has changed? |
No, it is not changed, still one vm per pod. |
If a container (w/o access to the host devices) got compromised, is the attacker able to see the host devices in the pod VM? |
Yes, for instance If there's any bad kernel CVE allowing a container escape, an attacker could then access any of the host devices mounted in the kata virtual machines - root fs, /dev/dm-X from other containers, etc... |
Then I think that mode should be pod-level (VM-level), not container-level. |
@AkihiroSuda If we can drop WithPrivilegedDevices for kata containers, it is safe to keep privileged mode at the container level, because no host devices are passed to guest implicitly. |
/cc @Random-Liu |
To clarify the point (restate what @bergwolf is already saying): Privileged todayWhile privileged isn't a very popular option, there are some scenarios where it is useful, as it will make adjustments to the container aside from just devices/caps: see current handling, as an example. There isn't a way to express these in a container spec today, so onward with privileged. Desired changeFor non-host based runtimes (ie, Kata-Containers), passing in devices doesn't make sense. However, access to the guest kernel and sysfs for these specific privileged use cases (ex, dind) are useful while still limiting exposure to the host. We'd like to see device addition, WithPrivilegedDevices, be configurable so we can drop this in the case of Kata Containers. With this:
Alternative - just create new annotation in KataWe had considered adding an annotation for Kata, allowing end users to specify using a kata-privileged mode. Unfortunately, annotations are consumed at the pod level. Security profile should be applied at container granularity (as it is done today). |
I prefer option 1.1 or option 2. |
1.1 seems to make sense, though 1.3 is a decent short term method as well that allows usage if you aren't using a runtimeClass aware orchestrator? (I'm really not sure how widespread this would be). For Option 2: is this feasible at container granularity? |
Yeah, the annotation can contain container name, that is how experimental seccomp is designed today. Once we have enough implementation, we can eventually consolidate different implementations and graduate it to something similar to option 1.1. |
Sorry - can you point to existing example for option 2? @Random-Liu |
@Random-Liu - +1 to |
Hm, I can't find an official doc. It's like this https://kubesec.io/basics/metadata-annotations-seccomp-security-alpha-kubernetes-io-pod/ |
+1 to While changes to runtimeclass might take sometime, we might consider As for |
Agreed with short term 1.3 and 1.1 medium term @bergwolf. Wdyt @Random-Liu ? |
For option 1.3, could this be a possible configuration format? [plugins]
[plugins.cri]
[plugins.cri.containerd]
[plugins.cri.containerd.default_runtime]
runtime_type = "io.containerd.runtime.v1.linux"
---> privileged_all_host_devices = true
[plugins.cri.containerd.runtimes.kata]
runtime_type = "io.containerd.kata.v2"
---> privileged_all_host_devices = false
[plugins.cri.containerd.runtimes.kata.options]
ConfigPath = "/opt/kata/share/defaults/kata-containers/configuration.toml" |
@awprice Yes, I was thinking something similar. We can discuss if we want to let users configure ALL privileged container properties, or just the host device one. |
As we discussed, eventually this should come from RuntimeClass (option 1.1), but I'm OK with starting from containerd config (option 1.3), which is the traditional way of adding new features right now. :P Let's start from #1213 (comment), and extend/redesign it in the future if more behavior differences are needed. However, I still want to put some ideas about option 1.1 here for future Kubernetes API design. Also ref kubernetes/kubernetes#44503. Today's
In the future, if we add a new privileged behavior for a specific runtime, I believe we'll also have a well-defined "With" option, e.g. So If for a specific runtime, we want to change the privileged behavior, we just need to add/drop policies from that list. I think in Kubernetes, we should:
I like this idea, because:
|
@Random-Liu +1 to adding this functionality to Kubernetes, as it's definitely a more rebust design and gives more fine grained control. I'm happy to make a start on option 1.3 now, as it is much needed functionality for @dadux and I. |
With 1.3 in mind, how about defining a privileged policy for the runtime at the CRI level instead of individual flags and later introduce this in the RuntimeClass itself with 1.1, something like:
I really like the idea of individual container specific primitives in the pod spec itself, as this would allow a finer grained control and something that we have wanted for quite some time. |
@amshinde I thought about that, but found that doesn't have too much benefit at the containerd level, because users can't use the newly introduced primitives. I doubt that there will be many different privileged behaviors, so it seems to be an overkill to define that policy in containerd. If it turns out that we do need many different privileged behaviors, we can introduce the policy list at that time. :) But I still think it makes sense to do that in the Kubernetes api, because it makes the api more self-contained, and all the new primitives are useful to users. And it makes it possible to define other shortcut/policy besides |
@Random-Liu Thoughts on using the configuration format that @amshinde has proposed for option 1.3? I feel like it is much cleaner vs having a large amount of booleans. |
@awprice I think @Random-Liu has expressed that it seems to be an overkill. I agree with him that a single boolean is sufficient for now. In the long term, we can make it configurable via pod spec instead of adding primitives in containerd that are not visible to users. |
I'll mark this 1.3 for now. For short term solution proposed in #1213 (comment). |
Fixes containerd#1213 Signed-off-by: Alex Price <[email protected]>
This commit adds a flag to the runtime config that allows overloading of the default privileged behaviour. When the flag is enabled on a runtime, host devices won't be appended to the runtime spec if the container is run as privileged. By default the flag is false to maintain the current behaviour of privileged. Fixes containerd#1213 Signed-off-by: Alex Price <[email protected]>
This commit adds a flag to the runtime config that allows overloading of the default privileged behaviour. When the flag is enabled on a runtime, host devices won't be appended to the runtime spec if the container is run as privileged. By default the flag is false to maintain the current behaviour of privileged. Fixes containerd#1213 Signed-off-by: Alex Price <[email protected]>
This commit adds a flag to the runtime config that allows overloading of the default privileged behaviour. When the flag is enabled on a runtime, host devices won't be appended to the runtime spec if the container is run as privileged. By default the flag is false to maintain the current behaviour of privileged. Fixes containerd#1213 Signed-off-by: Alex Price <[email protected]>
Right now when a container is specified as privileged, containerd/cri would expose all host devices to the container (https://github.com/containerd/cri/blob/master/pkg/server/container_create.go#L389). While the model works perfectly for runc, for vm based container such as kata, -- a lot of host devices doesn't make sense to be accessed in the guest.
Can we add runtime based policy here so that it is possible for a runtime to specify that no extra host devices to be added to the generated container spec (e.g. not appending
customopts.WithPrivilegedDevices
)?Related kata issue: kata-containers/runtime#1568
As mentioned in the kata issue comment, it is possible for kata to have its own privileged container translation bypassing cri/containerd all together. But I still think the best way is to fix it in containerd/cri.
The text was updated successfully, but these errors were encountered: