Skip to content
This repository has been archived by the owner on Mar 9, 2022. It is now read-only.

runtime based privileged container translation #1213

Closed
bergwolf opened this issue Jul 30, 2019 · 28 comments · Fixed by #1225
Closed

runtime based privileged container translation #1213

bergwolf opened this issue Jul 30, 2019 · 28 comments · Fixed by #1225
Milestone

Comments

@bergwolf
Copy link

bergwolf commented Jul 30, 2019

Right now when a container is specified as privileged, containerd/cri would expose all host devices to the container (https://github.com/containerd/cri/blob/master/pkg/server/container_create.go#L389). While the model works perfectly for runc, for vm based container such as kata, -- a lot of host devices doesn't make sense to be accessed in the guest.

Can we add runtime based policy here so that it is possible for a runtime to specify that no extra host devices to be added to the generated container spec (e.g. not appending customopts.WithPrivilegedDevices)?

Related kata issue: kata-containers/runtime#1568

As mentioned in the kata issue comment, it is possible for kata to have its own privileged container translation bypassing cri/containerd all together. But I still think the best way is to fix it in containerd/cri.

@AkihiroSuda
Copy link
Member

Why not use RuntimeClass with kata-runtime --privileged discussed in kata-containers/runtime#1568 ?

@bergwolf
Copy link
Author

bergwolf commented Jul 30, 2019

@AkihiroSuda runtimeclass works at sandbox level. We would want something at the container level. As @dadux explained in his use case, only one container in a pod is created in privileged mode, and other containers do not have extra privileges. We can still make it (kata's own translation) work by introducing a container annotation but that is not the purpose of this issue.

The purpose of this issue is to let containerd/cri still handle privileged container spec translation, instead of introducing kata's own translation and bypassing cri/containerd all together.

@AkihiroSuda
Copy link
Member

I thought Kata was creating a VM per a pod, not per a container - the situation has changed?

@bergwolf
Copy link
Author

No, it is not changed, still one vm per pod.

@AkihiroSuda
Copy link
Member

If a container (w/o access to the host devices) got compromised, is the attacker able to see the host devices in the pod VM?

@dadux
Copy link

dadux commented Jul 30, 2019

Yes, for instance If there's any bad kernel CVE allowing a container escape, an attacker could then access any of the host devices mounted in the kata virtual machines - root fs, /dev/dm-X from other containers, etc...
Hence why we're trying to find a solution for a pseudo privileged mode, not mounting all devices in the vm containers.

@AkihiroSuda
Copy link
Member

Then I think that mode should be pod-level (VM-level), not container-level.

@bergwolf
Copy link
Author

@AkihiroSuda If we can drop WithPrivilegedDevices for kata containers, it is safe to keep privileged mode at the container level, because no host devices are passed to guest implicitly.

@egernst
Copy link
Contributor

egernst commented Jul 30, 2019

/cc @Random-Liu

@egernst
Copy link
Contributor

egernst commented Jul 30, 2019

To clarify the point (restate what @bergwolf is already saying):

Privileged today

While privileged isn't a very popular option, there are some scenarios where it is useful, as it will make adjustments to the container aside from just devices/caps: see current handling, as an example. There isn't a way to express these in a container spec today, so onward with privileged.

Desired change

For non-host based runtimes (ie, Kata-Containers), passing in devices doesn't make sense. However, access to the guest kernel and sysfs for these specific privileged use cases (ex, dind) are useful while still limiting exposure to the host. We'd like to see device addition, WithPrivilegedDevices, be configurable so we can drop this in the case of Kata Containers. With this:

  • no devices are implicitly added from host -> guest in case of Kata
  • caps / FS access (in the guest) remain the same for the privileged container (and only privileged container).
  • other containers in the pod remain same (default permissions/caps)

Alternative - just create new annotation in Kata

We had considered adding an annotation for Kata, allowing end users to specify using a kata-privileged mode. Unfortunately, annotations are consumed at the pod level. Security profile should be applied at container granularity (as it is done today).

@Random-Liu
Copy link
Member

Random-Liu commented Jul 30, 2019

  • Option 1: Overload Privileged in different runtimes - for different runtimes Privileged means different behaviors. If this is the case, containerd needs to know what is the expected behavior for a specific runtime, there are 3 ways to achieve that:
    • Option 1.1: RuntimeClass. The behavior is defined in RuntimeClass as detailed Privileged Policy, e.g. DropAllCaps, AllHostDevices etc.
    • Option 1.2: Runtime. The Privileged field is passed into the runtime directly as an OCI annotation, and the runtime/shim decides how to handle it itself.
    • Option 1.3: Containerd/CRI plugin. Making the behavior configurable or hard coded in the CRI plugin.
  • Option 2: Disallow Privileged for non-linux container runtimes, use/add other fields. We keep the privileged semantic as it is. For other runtimes, if users need more privilege, they should define each security fields directly, e.g. Caps, Seccomp etc., instead of specifying Privileged. To improve UX, an annotation can be defined, and an admission webhook can auto convert the annotation and set each security field.

I prefer option 1.1 or option 2.

@egernst
Copy link
Contributor

egernst commented Jul 30, 2019

1.1 seems to make sense, though 1.3 is a decent short term method as well that allows usage if you aren't using a runtimeClass aware orchestrator? (I'm really not sure how widespread this would be).

For Option 2: is this feasible at container granularity?

@Random-Liu
Copy link
Member

Random-Liu commented Jul 30, 2019

For Option 2: is this feasible at container granularity?

Yeah, the annotation can contain container name, that is how experimental seccomp is designed today. Once we have enough implementation, we can eventually consolidate different implementations and graduate it to something similar to option 1.1.

@egernst
Copy link
Contributor

egernst commented Jul 30, 2019

Sorry - can you point to existing example for option 2? @Random-Liu

@jterry75
Copy link
Contributor

@Random-Liu - +1 to Option 1.1.

@Random-Liu
Copy link
Member

Sorry - can you point to existing example for option 2?

Hm, I can't find an official doc. It's like this https://kubesec.io/basics/metadata-annotations-seccomp-security-alpha-kubernetes-io-pod/

@bergwolf
Copy link
Author

+1 to option 1.1. runtimeclass is a perfect place for runtime specific properties.

While changes to runtimeclass might take sometime, we might consider option 1.3 as a short term workaround, e.g., adding PrivilegedPolicy to type Runtime struct that would allow users to configure runtime privileged policy in cri plugin config.

As for option 2, I'd love to see the method working for both kubernetes and moby, but I'm not sure how moby can handle it given its limited command line options.

@egernst
Copy link
Contributor

egernst commented Jul 31, 2019

Agreed with short term 1.3 and 1.1 medium term @bergwolf.

Wdyt @Random-Liu ?

@awprice
Copy link
Contributor

awprice commented Jul 31, 2019

For option 1.3, could this be a possible configuration format?

[plugins]
  [plugins.cri]
   [plugins.cri.containerd]
    [plugins.cri.containerd.default_runtime]
      runtime_type = "io.containerd.runtime.v1.linux"
--->  privileged_all_host_devices = true
    [plugins.cri.containerd.runtimes.kata]
      runtime_type = "io.containerd.kata.v2"
--->  privileged_all_host_devices = false
      [plugins.cri.containerd.runtimes.kata.options]
        ConfigPath = "/opt/kata/share/defaults/kata-containers/configuration.toml"

@bergwolf
Copy link
Author

@awprice Yes, I was thinking something similar. We can discuss if we want to let users configure ALL privileged container properties, or just the host device one.

@Random-Liu
Copy link
Member

Random-Liu commented Jul 31, 2019

As we discussed, eventually this should come from RuntimeClass (option 1.1), but I'm OK with starting from containerd config (option 1.3), which is the traditional way of adding new features right now. :P

Let's start from #1213 (comment), and extend/redesign it in the future if more behavior differences are needed.

However, I still want to put some ideas about option 1.1 here for future Kubernetes API design. Also ref kubernetes/kubernetes#44503.

Today's privileged is a shortcut of a group of smaller policies. Thanks to containerd's good client design, those policies are clearly listed as OCI spec options, and each of the option can be easily converted to a policy name:

  • customopts.WithPrivilegedDevices -> PrivilegedDevices or AllHostDevices
  • WithAllCapabilities -> AllCapabilities
  • WithMaskedPaths(nil) -> UnmaskedProcMount
  • WithReadonlyPaths(nil) -> UnmaskedProcMount
  • WithWriteableSysfs -> WritableSysfs
  • WithWriteableCgroupfs -> WritableCgroupfs
  • WithSeccompUnconfined -> SeccompUnconfined
  • WithSelinuxLabel("") -> NoSelinux or NoLSM
  • WithApparmorProfile("") -> NoApparmor or NoLSM

In the future, if we add a new privileged behavior for a specific runtime, I believe we'll also have a well-defined "With" option, e.g. WithAllSandboxDevices for kata.

So Default Privileged == [AllHostDevices, AllCapabilities, UnmaskedProcMount, WritableSysfs, WritableCgroupfs, SeccompUnconfined, NoLSM].

If for a specific runtime, we want to change the privileged behavior, we just need to add/drop policies from that list.

I think in Kubernetes, we should:

  1. Add missing primitives in the pod spec: PrivilegedDevices: true/false, SysfsMountType: writable/readonly, CgroupfsType: writable/readonly. (ProcMount is a very good example of this effort)
  2. Define privileged behavior in the Kubernetes api (RuntimeClass or PodSecurityPolicy) based on those primitives.

I like this idea, because:

  1. It makes privileged behavior clear to users;
  2. All the new primitives added give users more control to their workload, and make it possible to give finer grained privilege to their workload, instead of just privileged.

@awprice
Copy link
Contributor

awprice commented Jul 31, 2019

@Random-Liu +1 to adding this functionality to Kubernetes, as it's definitely a more rebust design and gives more fine grained control.

I'm happy to make a start on option 1.3 now, as it is much needed functionality for @dadux and I.

@amshinde
Copy link
Contributor

With 1.3 in mind, how about defining a privileged policy for the runtime at the CRI level instead of individual flags and later introduce this in the RuntimeClass itself with 1.1, something like:

[plugins.cri.containerd.runtimes.kata]
      runtime_type = "io.containerd.kata.v2"
      privileged = [AllCapabilities, UnmaskedProcMount, WritableSysfs, WritableCgroupfs, NoLSM]

I really like the idea of individual container specific primitives in the pod spec itself, as this would allow a finer grained control and something that we have wanted for quite some time.

@Random-Liu
Copy link
Member

Random-Liu commented Jul 31, 2019

@amshinde I thought about that, but found that doesn't have too much benefit at the containerd level, because users can't use the newly introduced primitives.

I doubt that there will be many different privileged behaviors, so it seems to be an overkill to define that policy in containerd. If it turns out that we do need many different privileged behaviors, we can introduce the policy list at that time. :)

But I still think it makes sense to do that in the Kubernetes api, because it makes the api more self-contained, and all the new primitives are useful to users. And it makes it possible to define other shortcut/policy besides Privileged, e.g. we can define a generic CSI security spec for all CSI plugins based on all those primitives. /cc @jingxu

@Random-Liu
Copy link
Member

Random-Liu commented Jul 31, 2019

I'm happy to make a start on option 1.3 now, as it is much needed functionality for @dadux and I.

SGTM @awprice

@awprice
Copy link
Contributor

awprice commented Jul 31, 2019

@Random-Liu Thoughts on using the configuration format that @amshinde has proposed for option 1.3? I feel like it is much cleaner vs having a large amount of booleans.

@bergwolf
Copy link
Author

bergwolf commented Aug 3, 2019

@awprice I think @Random-Liu has expressed that it seems to be an overkill. I agree with him that a single boolean is sufficient for now. In the long term, we can make it configurable via pod spec instead of adding primitives in containerd that are not visible to users.

@Random-Liu
Copy link
Member

I'll mark this 1.3 for now.

For short term solution proposed in #1213 (comment).

@Random-Liu Random-Liu added this to the v1.3 milestone Aug 5, 2019
awprice added a commit to awprice/cri that referenced this issue Aug 7, 2019
awprice added a commit to awprice/cri that referenced this issue Aug 7, 2019
This commit adds a flag to the runtime config that allows overloading of the default
privileged behaviour. When the flag is enabled on a runtime, host devices won't
be appended to the runtime spec if the container is run as privileged.

By default the flag is false to maintain the current behaviour of privileged.

Fixes containerd#1213

Signed-off-by: Alex Price <[email protected]>
awprice added a commit to awprice/cri that referenced this issue Aug 8, 2019
This commit adds a flag to the runtime config that allows overloading of the default
privileged behaviour. When the flag is enabled on a runtime, host devices won't
be appended to the runtime spec if the container is run as privileged.

By default the flag is false to maintain the current behaviour of privileged.

Fixes containerd#1213

Signed-off-by: Alex Price <[email protected]>
awprice added a commit to awprice/cri that referenced this issue Aug 8, 2019
This commit adds a flag to the runtime config that allows overloading of the default
privileged behaviour. When the flag is enabled on a runtime, host devices won't
be appended to the runtime spec if the container is run as privileged.

By default the flag is false to maintain the current behaviour of privileged.

Fixes containerd#1213

Signed-off-by: Alex Price <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants