-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for no_new_privs
via allowPrivilegeEscalation
#639
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,141 @@ | ||
# No New Privileges | ||
|
||
- [Description](#description) | ||
* [Interactions with other Linux primitives](#interactions-with-other-linux-primitives) | ||
- [Current Implementations](#current-implementations) | ||
* [Support in Docker](#support-in-docker) | ||
* [Support in rkt](#support-in-rkt) | ||
* [Support in OCI runtimes](#support-in-oci-runtimes) | ||
- [Existing SecurityContext objects](#existing-securitycontext-objects) | ||
- [Changes of SecurityContext objects](#changes-of-securitycontext-objects) | ||
- [Pod Security Policy changes](#pod-security-policy-changes) | ||
|
||
|
||
## Description | ||
|
||
In Linux, the `execve` system call can grant more privileges to a newly-created | ||
process than its parent process. Considering security issues, since Linux kernel | ||
v3.5, there is a new flag named `no_new_privs` added to prevent those new | ||
privileges from being granted to the processes. | ||
|
||
[`no_new_privs`](https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt) | ||
is inherited across `fork`, `clone` and `execve` and can not be unset. With | ||
`no_new_privs` set, `execve` promises not to grant the privilege to do anything | ||
that could not have been done without the `execve` call. | ||
|
||
For more details about `no_new_privs`, please check the | ||
[Linux kernel documention](https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt). | ||
|
||
This is different from `NOSUID` in that `no_new_privs`can give permission to | ||
the container process to further restrict child processes with seccomp. This | ||
permission goes only one-way in that the container process can not grant more | ||
permissions, only further restrict. | ||
|
||
### Interactions with other Linux primitives | ||
|
||
- suid binaries: will break when `no_new_privs` is enabled | ||
- seccomp2 as a non root user: requires `no_new_privs` | ||
- seccomp2 with dropped `CAP_SYS_ADMIN`: requires `no_new_privs` | ||
- ambient capabilities: requires `no_new_privs` | ||
- selinux transitions: bugs that were fixed documented [here](https://github.com/moby/moby/issues/23981#issuecomment-233121969) | ||
|
||
|
||
## Current Implementations | ||
|
||
### Support in Docker | ||
|
||
Since Docker 1.11, a user can specify `--security-opt` to enable `no_new_privs` | ||
while creating containers, for example | ||
`docker run --security-opt=no_new_privs busybox`. | ||
|
||
Docker provides via their Go api an object named `ContainerCreateConfig` to | ||
configure container creation parameters. In this object, there is a string | ||
array `HostConfig.SecurityOpt` to specify the security options. Client can | ||
utilize this field to specify the arguments for security options while | ||
creating new containers. | ||
|
||
This field did not scale well for the Docker client, so it's suggested that | ||
Kubernetes does not follow that design. | ||
|
||
This is not on by default in Docker. | ||
|
||
More details of the Docker implementation can be read | ||
[here](https://github.com/moby/moby/pull/20727) as well as the original | ||
discussion [here](https://github.com/moby/moby/issues/20329). | ||
|
||
### Support in rkt | ||
|
||
Since rkt v1.26.0, the `NoNewPrivileges` option has been enabled in rkt. | ||
|
||
More details of the rkt implementation can be read | ||
[here](https://github.com/rkt/rkt/pull/2677). | ||
|
||
### Support in OCI runtimes | ||
|
||
Since version 0.3.0 of the OCI runtime specification, a user can specify the | ||
`noNewPrivs` boolean flag in the configuration file. | ||
|
||
More details of the OCI implementation can be read | ||
[here](https://github.com/opencontainers/runtime-spec/pull/290). | ||
|
||
## Existing SecurityContext objects | ||
|
||
Kubernetes defines `SecurityContext` for `Container` and `PodSecurityContext` | ||
for `PodSpec`. `SecurityContext` objects define the related security options | ||
for Kubernetes containers, e.g. selinux options. | ||
|
||
To support "no new privileges" options in Kubernetes, it is proposed to make | ||
the following changes: | ||
|
||
## Changes of SecurityContext objects | ||
|
||
Add a new `*bool` type field named `allowPrivilegeEscalation` to the `SecurityContext` | ||
definition. | ||
|
||
By default, ie when `allowPrivilegeEscalation=nil`, we will set `no_new_privs=true` | ||
with the following exceptions: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Which cases are not covered by these exceptions? It seems like they cover all the cases (uid = 0, uid != 0). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if uid=0 then no_new_privs is true, if uid!=0 the no_new_privs is not set, does that tanble at the bottom clear it up for you... confused as to what you mean There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Perhaps I got confused because there are 2 fields named
In other words, if All other cases are exceptions and Is this correct? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes except if the container is run as privileged or adds CAP_SYS_ADMIN It is the first case in the table... when allowPrivilegeEscalation=nil |
||
|
||
- when a container is `privileged` | ||
- when `CAP_SYS_ADMIN` is added to a container | ||
- when a container is not run as root, uid `0` (to prevent breaking suid | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a bit unfortunate. We want users to not run as uid There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah they could explicitly set to false I guess but that behavior feels I thought about adding another option like "allowSetuid" I think you had mentioned it in the other thread as well, maybe there is a way to find one config toggle to cover everything that will also allow a sane default There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wish we knew how many people use suid binaries lol because I'd like to just have that be explicit, and not only set it when it's There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed. Maybe we can have a less secure default, but a more secure option that can be added using a single config change? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. could it be a kubelet config? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So I feel like you are combining "am I allowed to do this" and "what is the default " all into allowPrivilegeEscalation and I think that is going to cause a lot of confusion for users and add a lot of complexity to one Boolean causing us to potentially shoot ourselves in the foot trying to get a certain behavior. I suggest the following if we absolutely need a gate for "am I allowed to set the securityContext" then we should have a defaultAllowPrivilegeEscalation field AND a allowPrivilegeEscalation field, of course these could be named differently but I don't think we should bundle all this behavior into one field There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have no problem with that if folks like it. We've done similar things with the annotation fields that were added. Gives control at both levels which is the only thing I care about. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm ok with defaultAllow* and allow* (two fields, one to control the default, and one to describe whether it is allowed). Naming wise, the double allow is confusing - I don't have a strong opinion, but we should ensure users don't get confused about which field does what. Since "privileged" as defined in PSP as "user is allowed to request privileged", we can probably be consistent and use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. updated, thanks for your patience! I will be sure to include docs that make it explicit which field does what :) but makes sense to be consistent with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't have to thank me, you're the one who has to suffer through writing this :) |
||
binaries) | ||
|
||
The API will reject as invalid `privileged=true` and | ||
`allowPrivilegeEscalation=false`, as well as `capAdd=CAP_SYS_ADMIN` and | ||
`allowPrivilegeEscalation=false.` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jessfraz I don't see that this was implemented. Do we still need this check or we don't? |
||
|
||
When `allowPrivilegeEscalation` is set to `false` it will enable `no_new_privs` | ||
for that container. | ||
|
||
`allowPrivilegeEscalation` in `SecurityContext` provides container level | ||
control of the `no_new_privs` flag and can override the default in both directions | ||
of the `allowPrivilegeEscalation` setting. | ||
|
||
This requires changes to the Docker, rkt, and CRI runtime integrations so that | ||
kubelet will add the specific `no_new_privs` option. | ||
|
||
## Pod Security Policy changes | ||
|
||
The default can be set via a new `*bool` type field named `defaultAllowPrivilegeEscalation` | ||
in a Pod Security Policy. | ||
This would allow users to set `defaultAllowPrivilegeEscalation=false`, overriding the | ||
default `nil` behavior of `no_new_privs=false` for containers | ||
whose uids are not 0. | ||
|
||
This would also keep the behavior of setting the security context as | ||
`allowPrivilegeEscalation=true` | ||
for privileged containers and those with `capAdd=CAP_SYS_ADMIN`. | ||
|
||
To recap, below is a table defining the default behavior at the pod security | ||
policy level and what can be set as a default with a pod security policy. | ||
|
||
| allowPrivilegeEscalation setting | uid = 0 or unset | uid != 0 | privileged/CAP_SYS_ADMIN | | ||
|----------------------------------|--------------------|--------------------|--------------------------| | ||
| nil | no_new_privs=true | no_new_privs=false | no_new_privs=false | | ||
| false | no_new_privs=true | no_new_privs=true | no_new_privs=false | | ||
| true | no_new_privs=false | no_new_privs=false | no_new_privs=false | | ||
|
||
A new `bool` field named `allowPrivilegeEscalation` will be added to the Pod | ||
Security Policy as well to gate whether or not a user is allowed to set the | ||
security context to `allowPrivilegeEscalation=true`. This field will default to | ||
false. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW why we didn't implement
allowPrivilegeEscalation
forPodSecurityContext
?