-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
336 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,336 @@ | ||
--- | ||
title: Skip SELinux relabeling of volumes | ||
authors: | ||
- "@jsafrane" | ||
owning-sig: sig-storage | ||
participating-sigs: | ||
- sig-auth | ||
- sig-node | ||
reviewers: | ||
- "@msau42" | ||
- "@liggit" | ||
- "@tallclair" | ||
approvers: | ||
- "@saad-ali" | ||
editor: TBD | ||
creation-date: 2020-02-18 | ||
last-updated: 2020-02-18 | ||
status: provisional | ||
see-also: | ||
- /keps/sig-storage/20200120-skip-permission-change.md | ||
replaces: | ||
superseded-by: | ||
|
||
--- | ||
|
||
# Skip SELinux relabeling of volumes | ||
|
||
## Table of Contents | ||
|
||
<!-- toc --> | ||
- [Release Signoff Checklist](#release-signoff-checklist) | ||
- [Summary](#summary) | ||
- [Motivation](#motivation) | ||
- [SELinux intro](#selinux-intro) | ||
- [SELinux context assignment](#selinux-context-assignment) | ||
- [Volumes](#volumes) | ||
- [Goals](#goals) | ||
- [Non-Goals](#non-goals) | ||
- [Proposal](#proposal) | ||
- [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints-optional) | ||
- [<code>mount -o context</code>](#) | ||
- [New Kubernetes behavior](#new-kubernetes-behavior) | ||
- [Shared volumes](#shared-volumes) | ||
- [User Stories [optional]](#user-stories-optional) | ||
- [Story 1](#story-1) | ||
- [Story 2](#story-2) | ||
- [Story 3](#story-3) | ||
- [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints-optional-1) | ||
- [Risks and Mitigations](#risks-and-mitigations) | ||
- [Design Details](#design-details) | ||
- [Test Plan](#test-plan) | ||
- [Graduation Criteria](#graduation-criteria) | ||
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) | ||
- [Version Skew Strategy](#version-skew-strategy) | ||
- [Implementation History](#implementation-history) | ||
- [Drawbacks [optional]](#drawbacks-optional) | ||
- [Alternatives [optional]](#alternatives-optional) | ||
- [<code>FSGroupChangePolicy</code> approach](#-approach) | ||
- [Change container runtime](#change-container-runtime) | ||
<!-- /toc --> | ||
|
||
## Release Signoff Checklist | ||
|
||
- [ ] kubernetes/enhancements issue in release milestone, which links to KEP (this should be a link to the KEP location in kubernetes/enhancements, not the initial KEP PR) | ||
- [ ] KEP approvers have set the KEP status to `implementable` | ||
- [ ] Design details are appropriately documented | ||
- [ ] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input | ||
- [ ] Graduation criteria is in place | ||
- [ ] "Implementation History" section is up-to-date for milestone | ||
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] | ||
- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes | ||
|
||
## Summary | ||
|
||
This KEP tries to speed up the way how volumes (incl. persistent volumes) are made available to Pods on systems with SELinux in enforcing mode. | ||
Current way includes recursive relabeling of all files on a volume before a container can be started. This is slow for large volumes. | ||
|
||
## Motivation | ||
|
||
### SELinux intro | ||
On Linux machines with SELinux in enforcing mode, SELinux tries to prevent users that escaped from a container to access the host OS and also to access other containers running on the host. | ||
It does so by running each container with unique *SELinux context* (such as `system_u:system_r:container_t:s0:c309,c383`; shortened as `s0:c309,c383` in further text) and labeling all content on all volumes with the same label (`s0:c309,c383`). | ||
Only process with the context `s0:c309,c383` can access files with label `s0:c309,c383`, even if the process runs as root. | ||
Therefore rogue user cannot access potentially secret data of other containers, because volumes of each container have different label. | ||
|
||
See [SELinux documentation](https://selinuxproject.org/page/NB_MLS) for more details. | ||
|
||
### SELinux context assignment | ||
In Kubernetes, the SELinux context of a pod is assigned in two ways: | ||
1. Either it is set by user in PodSpec or Container: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/. | ||
1. If not set in Pod/Container, the container runtime will allocate a new unique SELinux context and assign it to a pod (container) by itself. | ||
|
||
### Volumes | ||
Currently Kubernetes *knows* which volume plugins supports SELinux (i.e. supports extended attributes on a filesystem the plugin provides). | ||
If SELinux is supported for a volume, it passes the volume to the container runtime with ":Z" option ("private unshared"). | ||
The container runtime then **recursively relabels** all files on the volume to either the label set in PodSpec/Container or the random value allocated by the container runtime itself. | ||
|
||
**This relabeling needs to traverse through the whole volume and it can be slow for volumes with large amount of files.** | ||
|
||
### Goals | ||
|
||
Optionally (chosen by user), do not recursively relabel content of the volumes. | ||
|
||
### Non-Goals | ||
|
||
Change container runtimes / CRI. | ||
|
||
## Proposal | ||
|
||
Offer option in `Pod.Spec.PodSecurityContext to` *mount* volumes with the right labels instead of recursive relabeling: | ||
|
||
```go | ||
type SELinuxRelabelPolicy string | ||
|
||
const ( | ||
Mount SELinuxRelabelPolicy = "Mount" | ||
AlwaysRelabel SELinuxRelabelPolicy = "Always" | ||
) | ||
|
||
type PodSecurityContext `struct { | ||
// SELinuxRelabelPolicy ← new field | ||
// Defines behavior of changing SELinux labels of the volume before being exposed inside Pod. | ||
// Valid values are "Mount" and "Always". If not specified, "Always" is used. | ||
// "Always" policy recursively changes SELinux labels on all files on all volumes used by the Pod. | ||
// "Mount" tries to mount volumes used by the Pod with the right context and skip recursive ownership | ||
// change. | ||
// + optional | ||
SELinuxRelabelPolicy *SELinuxRelabelPolicy | ||
// For context: | ||
// fsGroupChangePolicy defines behavior of changing ownership and permission of the volume | ||
// before being exposed inside Pod. This field will only apply to | ||
// volume types which support fsGroup based ownership(and permissions). | ||
// It will have no effect on ephemeral volume types such as: secret, configmaps | ||
// and emptydir. | ||
// Valid values are "OnRootMismatch" and "Always". If not specified defaults to "Always". | ||
// +optional | ||
FSGroupChangePolicy *PodFSGroupChangePolicy `json:"fsGroupChangePolicy,omitempty" protobuf:"bytes,9,opt,name=fsGroupChangePolicy"` | ||
... | ||
} | ||
``` | ||
|
||
See https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/20200120-skip-permission-change.md for similar API for ownership change for fsGroup. | ||
This KEP should follow API provided for fsGroup closely, however, the implementation is different (mount vs. recursive `chown`). | ||
|
||
### Implementation Details/Notes/Constraints [optional] | ||
|
||
#### `mount -o context` | ||
Linux kernel, with SELinux compiled in, allows `mount -o context=s0:c309,c383 <what> <where>` to mount a volume and pretend that all files on the volume have given SELinux label. | ||
It works only for the first mount of the volume! | ||
It does not work for bind-mounts or any subsequent mount of the same volume. | ||
|
||
### New Kubernetes behavior | ||
|
||
* When kubelet *knows* SELinux context of a pod / container to run (i.e. Pod/Container contains `SELinuxOptions`) and `SELinuxRelabelPolicy` is `Mount`, it tries to mount all volumes for the Pod with given SELinux label using `mount -o context=XYZ`. | ||
Kubelet makes sure that the option is passed to the first mount in all in-tree volume plugins (incl. ephemeral volumes like Secrets). | ||
Kubelet passes it as an mount option to all CSI calls for given volume. | ||
|
||
After the volume is mounted, kubelet checks that the root of the volume has the expected SELinux label, i.e. that the volume was mounted correctly. | ||
* If the volume root has expected label, kubelet passes the volume to the container runtime without any ":z" or ":Z" options - no relabeling is necessary. | ||
* If the volume root has unexpected label, for example when CSI driver did not apply `-o context` correctly or the volume was already mounted: | ||
* If the volume supports SELinux (i.e. has `selabel` mount option in `/proc/mounts`), it passes ":Z" to the container runtime for the volume. This is current kubelet behavior. | ||
* If the volume does not support SELinux, it does not pass any ":Z" option to the container runtime. | ||
|
||
* Nothing changes if kubelet does not know the SELinux context of a pod (`SELinuxOptions` are empty), kubelet passes ":Z" to the container runtime as today and lets the container runtime to choose a random label + relabel the volumes. | ||
* Nothing changes if pod's `SELinuxRelabelPolicy` is `Always`, kubelet passes ":Z" to the container runtime as today. | ||
* Kubernetes validation check that `SELinuxRelabelPolicy` field can be used in a pod only when `SELinuxOptions` is set and SELinux label is known. | ||
### Shared volumes | ||
If a single volume that supports SELinux labels is being shared by multiple pods, each of them must have the same SELinux context. | ||
Currently, a running pod with context `A` will loose access to all files on a volume if a pod with context `B` starts and uses the same volume, because the container runtime relabels the volume for pod `B`. | ||
This behavior changes with this KEP: kubelet mounts the volume with `-o context=A` for the first pod. | ||
It tries to do the same for the second pod with `-o context=B`, however, the volume has already been mounted and kernel won't change the label of the volume. So pod `A` can access the volume, while pod `B` cannot. | ||
|
||
We don't think that this is a bug in the design - only one pod will have access to the volume, this KEP only changes the selection. | ||
### User Stories [optional] | ||
#### Story 1 | ||
User does not configure anything special in their pods: | ||
```yaml | ||
apiVersion: v1 | ||
kind: Pod | ||
metadata: | ||
name: testpod | ||
spec: | ||
containers: | ||
- image: nginx | ||
name: nginx | ||
volumeMounts: | ||
- name: vol | ||
mountPath: /mnt/test | ||
volumes: | ||
- name: vol | ||
persistentVolumeClaim: | ||
claimName: myclaim | ||
``` | ||
No change from current Kubernetes behavior: | ||
1. Kubelet does not see any `SELinuxRelabelPolicy` configured in the pod and thus mounts `myclaim` PVC as usual and if the underlying volume supports SELinux, it passes it to the container runtime with ":Z". | ||
Kubelet passes also implicit Secret volume with token with ":Z". | ||
2. Container runtime allocates a new unique SELinux label to the pod and recursively relabels all volumes with ":Z" to this label. | ||
#### Story 2 | ||
User (or something else, e.g. an admission webhook) configures SELinux label for a pod. | ||
```yaml | ||
apiVersion: v1 | ||
kind: Pod | ||
metadata: | ||
name: testpod | ||
spec: | ||
securityContext: | ||
seLinuxOptions: | ||
level: s0:c10,c0 | ||
containers: | ||
- image: nginx | ||
name: nginx | ||
volumeMounts: | ||
- name: vol | ||
mountPath: /mnt/test | ||
volumes: | ||
- name: vol | ||
persistentVolumeClaim: | ||
claimName: myclaim | ||
``` | ||
No change from current Kubernetes behavior. | ||
1. Kubelet does not see any `SELinuxRelabelPolicy` configured in the pod and thus mounts `myclaim` PVC as usual and if the underlying volume supports SELinux, it passes it to the container runtime with ":Z". | ||
Kubelet passes also implicit Secret volume with token with ":Z". | ||
2. Container runtime uses SELinux label "s0:c10,c0", as instructed by Kubernetes. It will recursively relabels all volumes with ":Z" to this label. | ||
#### Story 3 | ||
User (or something else, e.g. an admission webhook) configures SELinux label for a pod. | ||
User chooses `SELinuxRelabelPolicy: "Mount"`, because they expect a potentially large volume to be used by the pod. | ||
```yaml | ||
apiVersion: v1 | ||
kind: Pod | ||
metadata: | ||
name: testpod | ||
spec: | ||
securityContext: | ||
seLinuxOptions: | ||
level: s0:c10,c0 | ||
seLinuxRelabelPolicy: Mount | ||
containers: | ||
- image: nginx | ||
name: nginx | ||
volumeMounts: | ||
- name: vol | ||
mountPath: /mnt/test | ||
volumes: | ||
- name: vol | ||
persistentVolumeClaim: | ||
claimName: myclaim | ||
``` | ||
In this case, kubelet tries to mount all pod's volumes with `-o context=s0:c10,c0` mount option`. | ||
If it succeeds, it passes the volume to the container runtime without ":Z" and the container runtime does not relabel the volume. | ||
See [New Kubernetes behavior](#new-kubernetes-behavior) for error cases. | ||
### Implementation Details/Notes/Constraints [optional] | ||
### Risks and Mitigations | ||
## Design Details | ||
### Test Plan | ||
* Unit tests: | ||
* API validation (all permutations missing / present PodSecurityPolicy.SELinuxOptions & SELinuxRelabelPolicy & container.SecurityPolicy.SELinuxOptions) | ||
* Passing mount options from kubelet to volume plugins. | ||
* E2e tests: | ||
* Check no recursive `chcon` is done on a volume when not needed / | ||
* Check recursive `chcon` is done on a volume when needed (with a matrix of SELinuxOptions / SELinuxRelabelPolicy). | ||
* Prepare e2e job that runs with SELinux in Enforcing mode! | ||
### Graduation Criteria | ||
* Alpha: | ||
* Provided all tests defined above are passing and gated by the feature gate `SELinuxRelabelPolicy` and set to a default of `false`. | ||
* Documentation exists. | ||
* Beta: with discussions in SIG-Storage regarding success of deployments. A metric will be added to report time taken to perform a volume ownership change. Feature gate `ConfigurableFSGroupPolicy` is `true`. | ||
* GA: all known issues fixed. | ||
### Upgrade / Downgrade Strategy | ||
`SELinuxRelabelPolicy` becomes "invisible" or dropped in an downgraded cluster. Container runtime will get ":Z" on volumes and they will do slow recursive chown as they do today. | ||
### Version Skew Strategy | ||
## Implementation History | ||
* 1.19: Alpha | ||
## Drawbacks [optional] | ||
* This KEP changes behavior of volumes shared by multiple pods, where each of them has a different SELinux label. See [Shared Volumes](#shared-volumes) for detail. | ||
* The API is slightly different that `FSGroupChangePolicy`, which may create confusion. | ||
## Alternatives [optional] | ||
### `FSGroupChangePolicy` approach | ||
The same approach & API as in `FSGroupChangePolicy` can be used. | ||
This is a viable option. | ||
If kubelet knows SELinux context that should be applied to a volume && hypothetical `SELinuxChangePolicy` is `OnRootMismatch`, it would check context only of the top-level directory of a volume and recursively `chcon` all files only when the top level dir does not match. | ||
This could be done together with recursive change for `fsGroup`. | ||
Kubelet would not use ":Z" when passing the volume to container runtime. | ||
With `SELinuxChangePolicy: Always`, usual ":Z" is passed to container runtime and it relabels all volumes recursively. | ||
### Change container runtime | ||
We considered implementing something like `SELinuxChangePolicy: OnRootMismatch` in the container runtime. | ||
It would do the same as `PodFSGroupChangePolicy: OnRootMismatch` in [fsGroup KEP], however, in the container runtime. | ||
This approach cannot work because of `SubPath`. | ||
If a Pod uses a volume with SubPath, container runtime gets only a subdirectory of the volume. | ||
It could check the top-level of this subdir only and recursively change SELinux context there, however, this could leave different subdirectories of the volume with different SELinux labels and checking top-level directory only does not work. | ||
With solution implemented in kubelet, we can always check top level directory of the whole volume and change context on the whole volume too. | ||