Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for no_new_privs via allowPrivilegeEscalation #639

Merged
merged 2 commits into from
May 25, 2017
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
141 changes: 141 additions & 0 deletions contributors/design-proposals/no-new-privs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# No New Privileges

- [Description](#description)
* [Interactions with other Linux primitives](#interactions-with-other-linux-primitives)
- [Current Implementations](#current-implementations)
* [Support in Docker](#support-in-docker)
* [Support in rkt](#support-in-rkt)
* [Support in OCI runtimes](#support-in-oci-runtimes)
- [Existing SecurityContext objects](#existing-securitycontext-objects)
- [Changes of SecurityContext objects](#changes-of-securitycontext-objects)
- [Pod Security Policy changes](#pod-security-policy-changes)


## Description

In Linux, the `execve` system call can grant more privileges to a newly-created
process than its parent process. Considering security issues, since Linux kernel
v3.5, there is a new flag named `no_new_privs` added to prevent those new
privileges from being granted to the processes.

[`no_new_privs`](https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt)
is inherited across `fork`, `clone` and `execve` and can not be unset. With
`no_new_privs` set, `execve` promises not to grant the privilege to do anything
that could not have been done without the `execve` call.

For more details about `no_new_privs`, please check the
[Linux kernel documention](https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt).

This is different from `NOSUID` in that `no_new_privs`can give permission to
the container process to further restrict child processes with seccomp. This
permission goes only one-way in that the container process can not grant more
permissions, only further restrict.

### Interactions with other Linux primitives

- suid binaries: will break when `no_new_privs` is enabled
- seccomp2 as a non root user: requires `no_new_privs`
- seccomp2 with dropped `CAP_SYS_ADMIN`: requires `no_new_privs`
- ambient capabilities: requires `no_new_privs`
- selinux transitions: bugs that were fixed documented [here](https://github.com/moby/moby/issues/23981#issuecomment-233121969)


## Current Implementations

### Support in Docker

Since Docker 1.11, a user can specify `--security-opt` to enable `no_new_privs`
while creating containers, for example
`docker run --security-opt=no_new_privs busybox`.

Docker provides via their Go api an object named `ContainerCreateConfig` to
configure container creation parameters. In this object, there is a string
array `HostConfig.SecurityOpt` to specify the security options. Client can
utilize this field to specify the arguments for security options while
creating new containers.

This field did not scale well for the Docker client, so it's suggested that
Kubernetes does not follow that design.

This is not on by default in Docker.

More details of the Docker implementation can be read
[here](https://github.com/moby/moby/pull/20727) as well as the original
discussion [here](https://github.com/moby/moby/issues/20329).

### Support in rkt

Since rkt v1.26.0, the `NoNewPrivileges` option has been enabled in rkt.

More details of the rkt implementation can be read
[here](https://github.com/rkt/rkt/pull/2677).

### Support in OCI runtimes

Since version 0.3.0 of the OCI runtime specification, a user can specify the
`noNewPrivs` boolean flag in the configuration file.

More details of the OCI implementation can be read
[here](https://github.com/opencontainers/runtime-spec/pull/290).

## Existing SecurityContext objects

Kubernetes defines `SecurityContext` for `Container` and `PodSecurityContext`
for `PodSpec`. `SecurityContext` objects define the related security options
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW why we didn't implement allowPrivilegeEscalation for PodSecurityContext?

for Kubernetes containers, e.g. selinux options.

To support "no new privileges" options in Kubernetes, it is proposed to make
the following changes:

## Changes of SecurityContext objects

Add a new `*bool` type field named `allowPrivilegeEscalation` to the `SecurityContext`
definition.

By default, ie when `allowPrivilegeEscalation=nil`, we will set `no_new_privs=true`
with the following exceptions:
Copy link
Contributor

@php-coder php-coder May 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which cases are not covered by these exceptions? It seems like they cover all the cases (uid = 0, uid != 0).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if uid=0 then no_new_privs is true, if uid!=0 the no_new_privs is not set, does that tanble at the bottom clear it up for you... confused as to what you mean

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confused as to what you mean

Perhaps I got confused because there are 2 fields named allowPrivilegeEscalation and table bellow explains the 2nd in the PSP while here we're talking about SC.

By default, ie when allowPrivilegeEscalation=nil, we will set no_new_privs=true with the following exceptions:

In other words, if allowPrivilegeEscalation=nil, we will set no_new_privs=true when SecurityContext.RunAsUser=nil or SecurityContext.RunAsUser=0.

All other cases are exceptions and no_new_privs will be set to false.

Is this correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes except if the container is run as privileged or adds CAP_SYS_ADMIN

It is the first case in the table... when allowPrivilegeEscalation=nil


- when a container is `privileged`
- when `CAP_SYS_ADMIN` is added to a container
- when a container is not run as root, uid `0` (to prevent breaking suid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit unfortunate. We want users to not run as uid 0 ideally. Will cluster admins have an option to break setuid binaries if they want to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah they could explicitly set to false I guess but that behavior feels
weird since false is a default...
Maybe a Boolean for this is wrong

I thought about adding another option like "allowSetuid" I think you had mentioned it in the other thread as well, maybe there is a way to find one config toggle to cover everything that will also allow a sane default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we knew how many people use suid binaries lol because I'd like to just have that be explicit, and not only set it when it's 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Maybe we can have a less secure default, but a more secure option that can be added using a single config change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could it be a kubelet config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I feel like you are combining "am I allowed to do this" and "what is the default " all into allowPrivilegeEscalation and I think that is going to cause a lot of confusion for users and add a lot of complexity to one Boolean causing us to potentially shoot ourselves in the foot trying to get a certain behavior. I suggest the following if we absolutely need a gate for "am I allowed to set the securityContext" then we should have a defaultAllowPrivilegeEscalation field AND a allowPrivilegeEscalation field, of course these could be named differently but I don't think we should bundle all this behavior into one field

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no problem with that if folks like it. We've done similar things with the annotation fields that were added. Gives control at both levels which is the only thing I care about.

Copy link
Contributor

@smarterclayton smarterclayton May 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with defaultAllow* and allow* (two fields, one to control the default, and one to describe whether it is allowed). Naming wise, the double allow is confusing - I don't have a strong opinion, but we should ensure users don't get confused about which field does what. Since "privileged" as defined in PSP as "user is allowed to request privileged", we can probably be consistent and use allowPrivilegeEscalation in the same sense.

Copy link
Contributor Author

@jessfraz jessfraz May 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thanks for your patience! I will be sure to include docs that make it explicit which field does what :) but makes sense to be consistent with privileged

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't have to thank me, you're the one who has to suffer through writing this :)

binaries)

The API will reject as invalid `privileged=true` and
`allowPrivilegeEscalation=false`, as well as `capAdd=CAP_SYS_ADMIN` and
`allowPrivilegeEscalation=false.`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jessfraz I don't see that this was implemented. Do we still need this check or we don't?


When `allowPrivilegeEscalation` is set to `false` it will enable `no_new_privs`
for that container.

`allowPrivilegeEscalation` in `SecurityContext` provides container level
control of the `no_new_privs` flag and can override the default in both directions
of the `allowPrivilegeEscalation` setting.

This requires changes to the Docker, rkt, and CRI runtime integrations so that
kubelet will add the specific `no_new_privs` option.

## Pod Security Policy changes

The default can be set via a new `*bool` type field named `defaultAllowPrivilegeEscalation`
in a Pod Security Policy.
This would allow users to set `defaultAllowPrivilegeEscalation=false`, overriding the
default `nil` behavior of `no_new_privs=false` for containers
whose uids are not 0.

This would also keep the behavior of setting the security context as
`allowPrivilegeEscalation=true`
for privileged containers and those with `capAdd=CAP_SYS_ADMIN`.

To recap, below is a table defining the default behavior at the pod security
policy level and what can be set as a default with a pod security policy.

| allowPrivilegeEscalation setting | uid = 0 or unset | uid != 0 | privileged/CAP_SYS_ADMIN |
|----------------------------------|--------------------|--------------------|--------------------------|
| nil | no_new_privs=true | no_new_privs=false | no_new_privs=false |
| false | no_new_privs=true | no_new_privs=true | no_new_privs=false |
| true | no_new_privs=false | no_new_privs=false | no_new_privs=false |

A new `bool` field named `allowPrivilegeEscalation` will be added to the Pod
Security Policy as well to gate whether or not a user is allowed to set the
security context to `allowPrivilegeEscalation=true`. This field will default to
false.