diff --git a/keps/prod-readiness/sig-node/3857.yaml b/keps/prod-readiness/sig-node/3857.yaml new file mode 100644 index 000000000000..e55da9d37de8 --- /dev/null +++ b/keps/prod-readiness/sig-node/3857.yaml @@ -0,0 +1,3 @@ +kep-number: 3857 +alpha: + approver: "@johnbelamaric" diff --git a/keps/sig-node/3857-rro-mounts/README.md b/keps/sig-node/3857-rro-mounts/README.md new file mode 100644 index 000000000000..eb6abd298150 --- /dev/null +++ b/keps/sig-node/3857-rro-mounts/README.md @@ -0,0 +1,1275 @@ + + + +# KEP-3857: Recursive read-only (RRO) mounts + + + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Core API](#core-api) + - [CRI API](#cri-api) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [GA](#ga) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + + + +Utilize runc's "rro" bind mount option (https://github.com/opencontainers/runc/pull/3272) +to make read-only bind mounts literally read-only. + +The "rro" bind mount options is implemented by calling [`mount_setattr(2)`](https://man7.org/linux/man-pages/man2/mount_setattr.2.html) +with `MOUNT_ATTR_RDONLY` and `AT_RECURSIVE`. + +Requires kernel >= 5.12, with one of the following OCI runtimes: +- runc >= 1.1 +- crun >= 1.4 + +## Motivation + + + +The current `readOnly` volumes are not recursively read-only, and may result in compromise of data; +e.g., even if `/mnt` is mounted as read-only, its submounts such as `/mnt/usbstorage` are not read-only. + +### Goals + + +Support recursive read-only mounts for kernel >= 5.12. + +### Non-Goals + + +Support recursive read-only mounts for old runc and old kernel releases. + +## Proposal + + + +### User Stories (Optional) + + + +#### Story 1 + +A user wants to mount `/mnt`, includings its submounts such as `/mnt/usbstorage`, as read-only. + +### Notes/Constraints/Caveats (Optional) + + +Constraints: needs runc >= 1.1 && kernel >= 5.12. + +### Risks and Mitigations + + + +- Increased API surface but still not secure-by-default, for sake of compatibility. + - Mitigation: None + +- False sense of security when not implemented + - Mitigation: `VolumeMountStatus` indicating actual RRO setting + +## Design Details + + + + +### Core API +Add `RecursiveReadOnly: (Disabled|IfPossible|Enabled)` to the [`VolumeMount`](https://github.com/kubernetes/kubernetes/blob/v1.26.1/pkg/apis/core/types.go#L1854-L1880) struct. + +A pod manifest will look like this: +```yaml +spec: + volumes: + - name: foo + hostPath: + path: /mnt + type: Directory + containers: + - volumeMounts: + - mountPath: /mnt + name: foo + mountPropagation: None + readOnly: true + # NEW + recursiveReadOnly: IfPossible +``` + +See the comment lines in the diff below for the constraints of the `VolumeMount` options: +```diff +diff --git a/pkg/apis/core/types.go b/pkg/apis/core/types.go +index e40b8bfa104..09c88222c2d 100644 +--- a/pkg/apis/core/types.go ++++ b/pkg/apis/core/types.go +@@ -1914,6 +1914,31 @@ type VolumeMount struct { + // Optional: Defaults to false (read-write). + // +optional + ReadOnly bool ++ // RecursiveReadOnly specifies recursive-readonly mode. ++ // ++ // 1. If ReadOnly is false, RecursiveReadOnly must be unspecified. ++ // 2. If ReadOnly is true: ++ // 2.1. If RecursiveReadOnly is unspecified: ++ // 2.1.1. if it belongs to a Pod being created, it is initialized to Disabled. ++ // 2.1.2 if it belongs to a PodSpec under Deployment, Job, etc., it remains unspecified ++ // (and will be set to Disabled eventually, when the Pod is created). ++ // 2.2. If RecursiveReadOnly is set to Disabled, the mount is not made recursively read-only. ++ // 2.3. If RecursiveReadOnly is set to IfPossible, the mount is made recursively read-only, ++ // if it is supported by the runtime. ++ // If it is not supported by the runtime, the mount is not made recursively read-only. ++ // MountPropagation must be None or unspecified (which defaults to None). ++ // 2.4. If RecursiveReadOnly is set to Enabled, the mount is made recursively read-only. ++ // If it is not supported by the runtime, the Pod will be terminated by kubelet, ++ // and an error will be generated to indicate the reason. ++ // MountPropagation must be None or unspecified (which defaults to None). ++ // 2.5. If RecursiveReadOnly is set to unknown value, it will result in an error. ++ // ++ // When this property is recognized by kubelet and kube-apiserver, ++ // VolumeMountStatus.RecursiveReadOnly will be set to either Disabled or Enabled. ++ // ++ // +featureGate=RecursiveReadOnlyMounts ++ // +optional ++ RecursiveReadOnly *RecursiveReadOnlyMode + // Required. If the path is not an absolute path (e.g. some/path) it + // will be prepended with the appropriate root prefix for the operating + // system. On Linux this is '/', on Windows this is 'C:\'. +@@ -1926,6 +1951,8 @@ type VolumeMount struct { + // to container and the other way around. + // When not set, MountPropagationNone is used. + // This field is beta in 1.10. ++ // When RecursiveReadOnly is set to IfPossible or to Enabled, MountPropagation must be None or unspecified ++ // (which defaults to None). + // +optional + MountPropagation *MountPropagationMode + // Expanded path within the volume from which the container's volume should be mounted. +@@ -1961,6 +1988,18 @@ const ( + MountPropagationBidirectional MountPropagationMode = "Bidirectional" + ) + ++// RecursiveReadOnlyMode describes recursive-readonly mode. ++type RecursiveReadOnlyMode string ++ ++const ( ++ // RecursiveReadOnlyDisabled disables recursive-readonly mode. ++ RecursiveReadOnlyDisabled RecursiveReadOnlyMode = "Disabled" ++ // RecursiveReadOnlyIfPossible enables recursive-readonly mode if possible. ++ RecursiveReadOnlyIfPossible RecursiveReadOnlyMode = "IfPossible" ++ // RecursiveReadOnlyEnabled enables recursive-readonly mode, or raise an error. ++ RecursiveReadOnlyEnabled RecursiveReadOnlyMode = "Enabled" ++) ++ + // VolumeDevice describes a mapping of a raw block device within a container. + type VolumeDevice struct { + // name must match the name of a persistentVolumeClaim in the pod +@@ -2591,6 +2630,10 @@ type ContainerStatus struct { + // +featureGate=InPlacePodVerticalScaling + // +optional + Resources *ResourceRequirements ++ // Status of volume mounts. ++ // +listType=atomic ++ // +optional ++ VolumeMounts []VolumeMountStatus + } + + // PodPhase is a label for the condition of a pod at the current time. +@@ -2664,6 +2707,21 @@ const ( + PodResizeStatusInfeasible PodResizeStatus = "Infeasible" + ) + ++// VolumeMountStatus shows status of volume mounts. ++type VolumeMountStatus struct { ++ // Name corresponds to the name of the original VolumeMount. ++ Name string ++ // ReadOnly corresponds to the original VolumeMount. ++ // +optional ++ ReadOnly bool ++ // RecursiveReadOnly must be set to Disabled, Enabled, or unspecified (for non-readonly mounts). ++ // An IfPossible value in the original VolumeMount must be translated to Disabled or Enabled, ++ // depending on the mount result. ++ // +featureGate=RecursiveReadOnlyMounts ++ // +optional ++ RecursiveReadOnly *RecursiveReadOnlyMode ++} ++ + // RestartPolicy describes how the container should be restarted. + // Only one of the following restart policies may be specified. + // If none of the following policies is specified, the default one +@@ -4591,6 +4649,24 @@ type NodeDaemonEndpoints struct { + KubeletEndpoint DaemonEndpoint + } + ++// RuntimeClassFeatures is a set of runtime features. ++type RuntimeClassFeatures struct { ++ // RecursiveReadOnlyMounts is set to true if the runtime class supports RecursiveReadOnlyMounts. ++ // +optional ++ RecursiveReadOnlyMounts *bool ++} ++ ++// RuntimeClass is a set of runtime class information. ++type RuntimeClass struct { ++ // Runtime class name. ++ // Empty for the default runtime class. ++ // +optional ++ Name string ++ // Supported features. ++ // +optional ++ Features *RuntimeClassFeatures ++} ++ + // NodeSystemInfo is a set of ids/uuids to uniquely identify the node. + type NodeSystemInfo struct { + // MachineID reported by the node. For unique machine identification +@@ -4701,6 +4777,9 @@ type NodeStatus struct { + // Status of the config assigned to the node via the dynamic Kubelet config feature. + // +optional + Config *NodeConfigStatus ++ // The available runtime classes. ++ // +optional ++ RuntimeClasses []RuntimeClass + } + + // UniqueVolumeName defines the name of attached volume +``` + +### CRI API + +Add `bool recursive_read_only` to the [`Mount`](https://github.com/kubernetes/cri-api/blob/v0.26.1/pkg/apis/runtime/v1/api.proto#L212-L224) message. +CRI implementations will also expose the availability of the feature via the `RuntimeHandlerFeatures` message. + +As kubelet can inspect the availability of the feature via the `RuntimeHandlerFeatures` message, +there is no concept of "IfPossible" in the CRI API; +kubelet translates an "IfPossible" value in the Core API into true or false in the CRI API + +The `RuntimeHandlerFeatures` message is also propagated to the `NodeSystemInfo` struct of the Core API. + +Diff: +```diff +diff --git a/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto b/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto +index e16688d8386..194d591c27f 100644 +--- a/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto ++++ b/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto +@@ -235,6 +235,15 @@ message Mount { + repeated IDMapping uidMappings = 6; + // GidMappings specifies the runtime GID mappings for the mount. + repeated IDMapping gidMappings = 7; ++ // If set to true, the mount is made recursive read-only. ++ // In this CRI API, recursive_read_only is a plain true/false boolean, although its equivalent ++ // in the Kubernetes core API is a quaternary that can be nil, "Enabled", "IfPossible", or "Disabled". ++ // kubelet translates that quaternary value in the core API into a boolean in this CRI API. ++ // Remarks: ++ // - nil is just treated as false ++ // - when set to true, readonly must be explicitly set to true, and propagation must be PRIVATE (0). ++ // - (readonly == false && recursive_read_only == false) does not make the mount read-only. ++ bool recursive_read_only = 8; + } + + // IDMapping describes host to container ID mappings for a pod sandbox. +@@ -1524,6 +1533,22 @@ message StatusRequest { + bool verbose = 1; + } + ++message RuntimeHandlerFeatures { ++ // recursive_read_only_mounts is set to true if the runtime handler supports ++ // recursive read-only mounts. ++ // For runc-compatible runtimes, availability of this feature can be detected by checking whether ++ // the Linux kernel version is >= 5.12, and, `runc features | jq .mountOptions` contains "rro". ++ bool recursive_read_only_mounts = 1; ++} ++ ++message RuntimeHandler { ++ // Name must be unique in StatusResponse. ++ // An empty string denotes the default handler. ++ string name = 1; ++ // Supported features. ++ RuntimeHandlerFeatures features = 2; ++} ++ + message StatusResponse { + // Status of the Runtime. + RuntimeStatus status = 1; +@@ -1532,6 +1557,8 @@ message StatusResponse { + // debug, e.g. plugins used by the container runtime. + // It should only be returned non-empty when Verbose is true. + map info = 2; ++ // Runtime handlers. ++ repeated RuntimeHandler runtime_handlers = 3; + } + + message ImageFsInfoRequest {} +diff --git a/staging/src/k8s.io/cri-api/pkg/errors/errors.go b/staging/src/k8s.io/cri-api/pkg/errors/errors.go +index a4538669122..c8e4a18dec5 100644 +--- a/staging/src/k8s.io/cri-api/pkg/errors/errors.go ++++ b/staging/src/k8s.io/cri-api/pkg/errors/errors.go +@@ -29,6 +29,9 @@ var ( + + // ErrSignatureValidationFailed - Unable to validate the image signature on the PullImage RPC call. + ErrSignatureValidationFailed = errors.New("SignatureValidationFailed") ++ ++ // ErrRROUnsupported - Unable to enforce recursive readonly mounts ++ ErrRROUnsupported = errors.New("RROUnsupported") + ) + + // IsNotFound returns a boolean indicating whether the error +``` + +### Test Plan + + + +[X] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +The existing tests will continue to pass. +New tests have to be added to cover the proposed feature. + +##### Unit tests + + + + + +- kubelet unit tests: will take a CRI status and populate the `VolumeMountStatus`. +- [CRI test](https://github.com/kubernetes-sigs/cri-tools): + will be similar to [e2e tests](#e2e-tests) below but without using Kubernetes Core API. + +##### Integration tests + + + + + + + +See [e2e tests](#e2e-tests) below. + +##### e2e tests + + + + + +- run a pod in each RecursiveReadOnly mode and verify that the status comes back correctly +- run RecursiveReadOnly="Enabled" on a runtime that does not support it and ensure the error +- run RecursiveReadOnly="Enabled", and verify that the mount is actually recursively read-only +- run RecursiveReadOnly="Disabled", and verify that the mount is actually not recursively read-only + +### Graduation Criteria + + + +#### Alpha +- Feature implemented behind a feature flag +- Unit tests and CRI tests will pass + +#### Beta +- e2e tests pass with containerd, CRI-O, and cri-dockerd + +#### GA +- (Will be revisited during beta) + +### Upgrade / Downgrade Strategy + + + +Upgrade: No action is needed. Existing readonly mounts will remain non-recursively readonly. + +Downgrade: +- On downgrading kube-apiserver, the `[]volumeMounts.recursiveReadOnly` property will be lost + and will not be propagated to kubelet. + If the mode was set to non-`Disabled`, this will result in producing writable mounts. + It is the user's responsibility to use the correct version of kube-apiserver + when they need non-`Disabled` mode. + +- On downgrading kubelet, the `[]volumeMounts.recursiveReadOnly` properties will be lost, + and the `[]containerStatuses.[]volumeMount.recursiveReadOnly` status will not be updated. + It is the user's responsibility to use the correct version of kubelet when they need to check + `[]containerStatuses.[]volumeMount.recursiveReadOnly`. + +- On downgrading the CRI or OCI runtime, if the `RecursiveReadOnly` mode is set to `Enabled`, + kubelet will raise an error. + `IfPossible` will be just treated as `Disabled`. + +### Version Skew Strategy + + + +- It is the user's responsibility to use the correct version of kube-apiserver + when they need non-`Disabled` mode. Otherwise the mode will not be propagated to kubelet. + +- It is the user's responsibility to use the correct version of kube-apiserver and kubelet when they need to check + `[]containerStatuses.[]volumeMount.recursiveReadOnly`. + Otherwise the property may have an inconsistent value. + +- CRI and OCI runtimes have to be updated before kubelet, otherwise kubelet will not be aware whether they + supports the feature or not, and it will assume that they do not support the feature. + +- If only partial nodes supports the feature, `Disabled` and `IfPossible` will continue to work on all the nodes, + but `Enabled` will fail on a node that does not support the feature. + kube-scheduler does not care about this, and, it is the user's responsibility to set `nodeSelector`, `nodeAffinity`, + etc. to avoid scheduling a pod with `Enabled` to a node that does not support the feature. + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [X] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: `RecursiveReadOnlyMounts` + - Components depending on the feature gate: kube-apiserver,kubelet + + +###### Does enabling the feature change any default behavior? + + + +No + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +Yes, by unsetting `RecursiveReadOnly=Enabled`. + +Components can be downgraded too, but it should be noted that `VolumeMountStatus` +may still see an inconsistent state when kubelet was downgraded. +The pod manifest has to be recreated to get a consistent state in this case. + +###### What happens if we reenable the feature if it was previously rolled back? + +Works. +Just same as a fresh roll-out, as long as the user has recreated the pod manifests. +(See "Can the feature be disabled once ..." section above) + +###### Are there any tests for feature enablement/disablement? + + + +Unit tests will run with and without the feature gate. + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +A rollout may fail when at least one of the following components are too old: + +| Component | `readOnlyRecursive` value that will cause an error | +|----------------|----------------------------------------------------| +| kube-apiserver | any value | +| kubelet | any value | +| CRI runtime | `Enabled` | +| OCI runtime | `Enabled` | +| kernel | `Enabled` | + +For example, an error will be returned like this if kube-apiserver is too old: +```console +$ kubectl apply -f rro.yaml +Error from server (BadRequest): error when creating "rro.yaml": Pod in version "v1" cannot be handled as a Pod: +strict decoding error: unknown field "spec.containers[0].volumeMounts[0].recursiveReadOnly" +``` + +No impact on already running workloads. + +###### What specific metrics should inform a rollback? + + + +Look for an event saying indicating RRO is not supported by the runtime. +```console +$ kubectl get events -o json -w +... +{ + ... + "kind": "Event", + "message": "Error: RRONotSupported", + ... +} +... +``` + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + +(Will be revisited during beta) + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + +No + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +Yes, the feature is used if the following `jq` command prints non-zero number: + +```bash +kubectl get pods -A -o json | jq '[.items[].spec.containers[].volumeMounts[]? | select(.recursiveReadOnly)] | length' +``` + +###### How can someone using this feature know that it is working for their instance? + + + +- [X] API .status + - Condition name: `volumeMountStatus.recursiveReadOnly` + + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +- `recursiveReadOnly=Enabled`: + 100% of pods that were scheduled into a node must run with recursive read-only mounts, + or, 100% of them must fail to run. + +- `recursiveReadOnly=IfPossible`: + 100% of pods that were scheduled into a node must run with or without recursive read-only mounts + +- `recursiveReadOnly=Disabled`, or unset: + 100% of pods that were scheduled into a node must run without recursive read-only mounts + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [X] Metrics + - Metric name: Event + - [Optional] Aggregation method: `kubectl get events -o json -w` + - Components exposing the metric: kubelet -> kube-apiserver + +If `recursiveReadOnly` is set to `Enabled` but it is not supported, kubelet will raise an event like this: + +```console +$ kubectl get events -o json -w +... +{ + ... + "kind": "Event", + "message": "Error: RRONotSupported", + ... +} +... +``` + +If the OCI runtime claims that it supports recursive read only mounts but it actually fails to mount them, +the pod will enter CrashLoopBackoff. +The error from the OCI runtime can be inspected by running: +``` +kubectl get pod -o json foo | jq .status.containerStatuses[0].lastState.terminated.message +``` + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + +Potentially, kube-scheduler could be implemented to avoid scheduling a pod with `recursiveReadOnly: Enabled` +to a pod running an old kernel. + +In this way, the Event metric described above would not happen, and users would instead see `Pending` pods +as an error metric. + +However, this is not planned to be implemented in kube-scheduler, as it seems overengineering. +Users may use `nodeSelector`, `nodeAffinity`, etc. to workaround this. + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +Specific version of CRI, OCI, and Linux kernel + +### Scalability + + + +A pod with `recursiveReadOnly: Enabled` may be rejected by kubelet with the probablility of $$B/A$$, +where $$A$$ is the number of all the nodes that may potentially accept the pod, +and $$B$$ is the number of the nodes that may potentially accept the pod but does not support RRO. +This may affect scalability. + +To evaluate this risk, users may run +`kubectl get nodes -o json | jq '[.items[].status.nodeInfo.runtimeClasses[].Features]'` +to see how many nodes support `RecursiveReadOnlyMounts: true`. + +###### Will enabling / using this feature result in any new API calls? + + +No + +###### Will enabling / using this feature result in introducing new API types? + + +No + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + +No + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + +A dozen of bytes + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + +No + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + +No + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + +No + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +A pod cannot be created, just as in other pods. + +###### What are other known failure modes? + + +None + +###### What steps should be taken if SLOs are not being met to determine the problem? + +- Make sure that the node is running Linux kernel v5.12 or later. +- Make sure that `runc features | jq .mountOptions` contains "rro". Otherwise update runc. +- Make sure that `crictl info` (with the latest crictl) + reports that `RecursiveReadOnlyMounts` is supported. + Otherwise update the CRI runtime, and make sure that no relevant error is printed in + the CRI runtime's log. +- Make sure that `kubectl get nodes -o json | jq '[.items[].status.nodeInfo.runtimeClasses[].Features]'` + (with the latest kubectl and control planes) + reports that `RecursiveReadOnlyMounts` is supported. + Otherwise update the CRI runtime, and make sure that no relevant error is printed in + kubelet's log. + +## Implementation History + + + +## Drawbacks + + +See "Alternatives" below. + +## Alternatives + + + +Plan B is to keep the Kubernetes Core API and the CRI API completely unmodified, +and just let the CRI runtime treat "readonly" as "recursive readonly". + +This would be much easier to implement and adopt, however, small portion of users may find this to be a breaking change. + +Actually, containerd has once adopted the Plan B (https://github.com/containerd/containerd/pull/9713) in its main branch +(not in any GA release), but it is being reverted in favor of this KEP now (https://github.com/containerd/containerd/pull/9747). + +## Infrastructure Needed (Optional) + + + +runc >= 1.1 && kernel >= 5.12 diff --git a/keps/sig-node/3857-rro-mounts/kep.yaml b/keps/sig-node/3857-rro-mounts/kep.yaml new file mode 100644 index 000000000000..69c75ff86c6d --- /dev/null +++ b/keps/sig-node/3857-rro-mounts/kep.yaml @@ -0,0 +1,47 @@ +title: Recursive read-only mounts +kep-number: 3857 +authors: + - "@AkihiroSuda" +owning-sig: sig-node +#participating-sigs: +# - sig-aaa +# - sig-bbb +status: implementable +creation-date: 2023-02-09 +reviewers: + - "@thockin" + - "@SergeyKanzhelev" +approvers: + - "@johnbelamaric" + +#see-also: +# - "/keps/sig-aaa/1234-we-heard-you-like-keps" +# - "/keps/sig-bbb/2345-everyone-gets-a-kep" +#replaces: +# - "/keps/sig-ccc/3456-replaced-kep" +# +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.30" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.30" +# beta: "v1.XX" +# stable: "v1.XX" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: RecursiveReadOnlyMounts + components: + - kube-apiserver,kubelet +disable-supported: true + +# The following PRR answers are required at beta release +#metrics: +# - my_feature_metric