Skip to content

Commit

Permalink
Merge pull request #5 from giuseppe/userns-followup
Browse files Browse the repository at this point in the history
userns KEP followup
  • Loading branch information
rata authored Feb 3, 2022
2 parents 72618fc + 384f677 commit 6b4562b
Showing 1 changed file with 24 additions and 87 deletions.
111 changes: 24 additions & 87 deletions keps/sig-node/127-user-namespaces/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,18 +22,13 @@
- [Phases](#phases)
- [Phase 1: pods "without" volumes](#phase-1-pods-without-volumes)
- [Phase 2: pods with volumes](#phase-2-pods-with-volumes)
- [Phase 3: pod to pod isolation](#phase-3-pod-to-pod-isolation)
- [Phase 3: TBD](#phase-3-tbd)
- [Summary of the Proposed Changes](#summary-of-the-proposed-changes)
- [Test Plan](#test-plan)
- [Graduation Criteria](#graduation-criteria)
- [pod.spec.useHostUsers graduation](#podspecusehostusers-graduation)
- [Alpha](#alpha)
- [Beta](#beta)
- [GA](#ga)
- [pod.spec.securityContext.userns.pod2podIsolation graduation](#podspecsecuritycontextusernspod2podisolation-graduation)
- [Alpha](#alpha-1)
- [Beta](#beta-1)
- [GA](#ga-1)
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
- [Version Skew Strategy](#version-skew-strategy)
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
Expand Down Expand Up @@ -71,14 +66,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*

## Summary

This KEP adds a new `hostUsers` field to `pod.Spec` to allow to enable/disable
using user namespaces for pods. Furthermore, it allows increased pod to pod
isolation by means of `pod.spec.securityContext.userns.pod2podIsolation` field.

It allows users to place pods in different user namespaces increasing the
pod-to-pod and pod-to-host isolation. This extra isolation increases the cluster
security as it protects the host and other pods from malicious or compromised
processes inside containers that are able to break into the host.
This KEP adds support to use user-namespaces in pods.

## Motivation

Expand Down Expand Up @@ -149,6 +137,9 @@ Here we use UIDs, but the same applies for GIDs.

## Proposal

This KEP adds a new `hostUsers` field to `pod.Spec` to allow to enable/disable
using user namespaces for pods.

This proposal aims to support running pods inside user namespaces. This will
improve the pod to node isolation (phase 1 and 2) and pod to pod isolation
(phase 3) we currently have.
Expand All @@ -173,7 +164,7 @@ kernel module with `CAP_SYS_MODULE`.
#### Story 3

As a cluster admin, I want to allow users to run their container as root
without that process having root privileged on the host, so I can mitigate the
without that process having root privileges on the host, so I can mitigate the
impact of a compromised container.

#### Story 4
Expand All @@ -185,7 +176,7 @@ host files).

#### Story 5

As a cluster admin, I want to use different host UIDs/GIDs for pods running in
As a cluster admin, I want to use different host UIDs/GIDs for pods running on
the same node (whenever kernel/kube features allow it), so I can mitigate the
impact a compromised pod can have on other pods and the node itself.

Expand All @@ -199,29 +190,30 @@ impact a compromised pod can have on other pods and the node itself.

## Design Details

Note: Names are preliminary yet, I'm using field names to simplify explanations.

### Pod.spec changes

The following changes will be done to the pod.spec:

- `pod.spec.useHostUsers`: bool.
- `pod.spec.hostUsers`: bool.
If true or not present, uses the host user namespace (as today)
If false, a new userns is created for the pod.
This field will be used for phase 1, 2 and 3.

- `pod.spec.securityContext.userns.pod2podIsolation`: Enum
If enabled, we will make the userns mappings be non-overlapping as much as possible.
This field will be used in phase 3.
By default it is set to `true`.

### Phases

We propose to divide the work in 3 phases. Each phase makes this work with
either more isolation or more workloads. When no support is yet added to handle
some workload, a clear error will be shown.

PLEASE note that only phase 1 is targeted for alpha. Also note that the missing
details (CRI changes, changes needed in container runtimes, etc.) will be added
in a follow-up PRs.

Please note the last sub-section here is a table with the summary of the changes
proposed on each phase.
proposed on each phase. That table is not updated (it is from the initial
proposal, doesn't have all the feedback and adjustments we discussed) but can
still be useful as a general overview.


#### Phase 1: pods "without" volumes

Expand Down Expand Up @@ -267,60 +259,11 @@ listed vulnerabilities (as the host is protected from the container). It is also
a trivial next-step to take, given that we have phase 1 implemented: just return
the same mapping if the pod has other volumes.

#### Phase 3: pod to pod isolation

This phase will provide more isolation between pods that use volumes (as in
phase 2) and requires another opt-in field:
`pod.spec.securityContext.pod2podIsolation`.

This phase will try to not share the same mapping for all pods with volumes, as
phase 2 does, but to achieve it some trade off needs to be made. This phase
builds on the work of the previous phases and more details will be defined while
the other phases evolve.

Here are some ideas so far:

One idea is to give different mappings to pods in different k8s namespaces or
that use a different service account. This needs to be explored in further
detail, but will probably impose limits to which workloads can run this (we need
to expose a shorter mapping, less than 65535).

Another idea is to use id mapped mounts. This probably needs changes to the
OCI runtime-spec, only works with certain filesystems and kernels that may take
too long for some users to get (like managed services). Giuseppe started to
experiment in crun with this
[here](https://github.com/containers/crun/pull/780).

The value for `pod.spec.securityContext.pod2podIsolation` will be an enum, to
select different strategies and allow room for future improvements.

It is being considered having a value that is "auto" for this fields, that
will select the best strategy that your node supports. However, as different
strategies will change the effective UID a container uses, if we add such an
option the documentation will be VERY clear about the implications and
automatizations will be provided whenever possible (we have some ideas on this
front).

Another improvement suggested by @ddebroy to do here is:
* Pods using also only [local ephemeral CSI volumes][csi-ephemeral-vol], as
they share the same lifecycle of the pod, can be moved to use non-overlapping
mappings.

This change can probably be done under the hood without the user noticing, to
achieve more pod 2 pod isolation, and might not need the user to use
`pod.spec.securityContext.pod2podIsolation`. However, some changes for the CSI
vol to use the effective UID/GID might be needed and not trivial. @ddebroy has
[kindly offered to help][csi-help] with this improvement

[csi-ephemeral-vol]: https://kubernetes-csi.github.io/docs/ephemeral-local-volumes.html#overview
[csi-help]: https://github.com/kubernetes/enhancements/pull/3065/files#r762046107

If this phase turns out to be a lot of work, it will be left out as future work
for other KEPs.
#### Phase 3: TBD

### Summary of the Proposed Changes

[This table](https://docs.google.com/presentation/d/1z4oiZ7v4DjWpZQI2kbFbI8Q6botFaA07KJYaKA-vZpg/edit#slide=id.gfd10976c8b_1_41) gives you a quick overview of each phase.
[This table](https://docs.google.com/presentation/d/1z4oiZ7v4DjWpZQI2kbFbI8Q6botFaA07KJYaKA-vZpg/edit#slide=id.gfd10976c8b_1_41) gives you a quick overview of each phase (note it is outdated, but still useful for a general overview).


### Test Plan
Expand All @@ -347,23 +290,17 @@ TBD

### Graduation Criteria

Graduation for each pod.spec field we introduce will be separate.

#### pod.spec.useHostUsers graduation

##### Alpha
- Phase 1 implemented

##### Beta

##### GA

#### pod.spec.securityContext.userns.pod2podIsolation graduation

##### Alpha

##### Beta

##### GA
- Make plans on whether, when, and how to enable by default
- Should we reconsider making the mappings smaller by default?
- Should we allow any way for users to for "more" IDs mapped? If yes, how many more and how?
- Should we allow the user can ask for specific mappings?

### Upgrade / Downgrade Strategy

Expand Down

0 comments on commit 6b4562b

Please sign in to comment.