diff --git a/keps/prod-readiness/sig-node/3386.yaml b/keps/prod-readiness/sig-node/3386.yaml new file mode 100644 index 00000000000..61ef1f54a2c --- /dev/null +++ b/keps/prod-readiness/sig-node/3386.yaml @@ -0,0 +1,3 @@ +kep-number: 3386 +alpha: + approver: "@deads2k" diff --git a/keps/sig-node/3386-kubelet-evented-pleg/README.md b/keps/sig-node/3386-kubelet-evented-pleg/README.md new file mode 100644 index 00000000000..29c44401fde --- /dev/null +++ b/keps/sig-node/3386-kubelet-evented-pleg/README.md @@ -0,0 +1,546 @@ +# KEP-3386: Kubelet Evented PLEG for Better Performance + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Acknowledgements](#acknowledgements) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories](#user-stories) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Feature Gate](#feature-gate) + - [Runtime Service Changes](#runtime-service-changes) + - [Events Filter](#events-filter) + - [Kubelet Changes](#kubelet-changes) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Acknowledgements + +This proposal is heavily based off of [this community enhancement][1], as the problem was never addressed. The purpose of this document is to modernize the proposal: both in the sense of process--updating the doc to meet the new KEP guidelines, as well as in the sense of implementation--updating the proposal to be about changing the CRI instead of the now dropped dockershim. + +A lot of credit goes to the authors of the previous proposal. + +[1]: https://github.com/kubernetes/community/blob/4026287dc3a2d16762353b62ca2fe4b80682960a/contributors/design-proposals/node/pod-lifecycle-event-generator.md#leverage-upstream-container-events + +## Summary + +The purpose of this KEP is to outline changes to the Kubelet and Container Runtime Interface (CRI) that update the way the Kubelet updates changes to pod state to a List/Watch model that polls less frequently reducing overhead. Specifically, the Kubelet will listen for [gRPC server streaming](https://grpc.io/docs/what-is-grpc/core-concepts/#server-streaming-rpc) events from the CRI implementation for events required for generating pod lifecycle events. + +The overarching goal of this effort is to reduce the Kubelet and CRI implementation's steady state CPU usage. + +## Motivation + +In Kubernetes, Kubelet is a per-node daemon that manages the pods on the node, driving the pod states to match their pod specifications (specs). To achieve this, Kubelet needs to react to changes in both (1) pod specs and (2) the container states. For the former, Kubelet watches the pod specs changes from multiple sources; for the latter, Kubelet polls the container runtime [periodically](https://github.com/kubernetes/kubernetes/blob/release-1.24/pkg/kubelet/kubelet.go#L162) for the latest states for all containers. the current hardcoded default value is 1s. + +Polling incurs non-negligible overhead as the number of pods/containers increases, and is exacerbated by Kubelet's parallelism -- one worker (goroutine) per pod, which queries the container runtime individually. Periodic, concurrent, large number of requests causes high CPU usage spikes (even when there is no spec/state change), poor performance, and reliability problems due to overwhelmed container runtime. Ultimately, it limits Kubelet's scalability. + +### Goals + +- Reduce unnecessary work during inactivty (no spec/state changes) + - In other words, reduce steady-state CPU usage of Kubelet and CRI implementation by reducing frequent polling of the container statuses. + +### Non-Goals + +- Completely eliminate polling altogether. + - This proposal does not advocate completely removing the polling. We cannot solely rely on the upstream container events due to the possibility of missing events. PLEG should relist at reduced frequency to ensure no events are missed. +- Addressing container image relisting via CRI events is out of scope for this enhancement at this point in time. + +## Proposal + +This proposal aims to replace the periodic polling with a pod lifecycle event watcher. Currently, the Kubelet calls into three CRI calls of the form `List*`: [ListContainers](https://github.com/kubernetes/kubernetes/blob/6efd6582df2011f1ec8c146ef711b3348ae07d60/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto#L78), [ListPodSandbox](https://github.com/kubernetes/kubernetes/blob/6efd6582df2011f1ec8c146ef711b3348ae07d60/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto#L60). Each of these is used to populate the Kubelet's perspective +of the state of the node. + +As the number of pods on a node increases, the amount of time the Kubelet and CRI implementation takes in generating and reading this list increases linearly. What is needed is a way of the Kubelet being notified when a container changes state in a way it did not trigger. + +There should only be two such cases, and in normal operation, only one would happen frequently: + +- The first, and most clear case of a container changing state without the Kubelet triggering that state change is when a container stops. Containers can exit gracefully, or be OOM killed, and the Kubelet would not know. + - We will also introduce events when the container is created as well as is started. This will help us reduce the relisting that takes placed while the kubelet waits for the container to start. + - Although kubelet initiates the container deletion, for sake of increased validation we are also introducing the event to denote that from the runtime. +- The second, and less likely case is when another entity comes and changes the state of the node. + - For container related events (such as a container creating, starting, stopping or being killed), this can appear as a user calling [crictl](https://github.com/kubernetes-sigs/cri-tools/blob/master/docs/crictl.md) manually, or even using the runtime directly. + +The Kubelet currently covers each of thse cases quite easily: by listing all of the resources on the node, it will have an accurate picture after the amount of time of its [poll interval](https://github.com/kubernetes/kubernetes/blob/release-1.24/pkg/kubelet/kubelet.go#L162). For each of these cases, a new CRI-based events API can be made, using [gRPC server streaming](https://grpc.io/docs/what-is-grpc/core-concepts/#server-streaming-rpc). This way, the entity closest to the activity of the containers and pods (the CRI implementation) can be responsible for informing the Kubelet of their behavior directly. + +### User Stories + +- As a cluster administrator I want to enable `Evented PLEG` feature of the kubelet for better performance with as little infrastructure overhead as possible. + +### Notes/Constraints/Caveats (Optional) + + + +### Risks and Mitigations + +- PLEG is very core to the container status handling in the kubelet. Hence any miscalculation there would result in unpredictable behaviour not just for the node but for an entire cluster. + - To reduce the risk of regression, this feature initially will be available only as an opt-in. + - Users can disable this feature to make kubelet use existing relisting based PLEG. +- Another risk is the CRI implementation could have a buggy event emitting system, and miss pod lifecycle events. + - A mitigation is a `kube_pod_missed_events` metric, which the Kubelet could report when a lifecycle event is registered that wasn't triggered by an event, but rather by changes of state between lists. + +## Design Details + +### Feature Gate +This feature can only be enabled using the feature gate `EventedPLEG`. + +### Runtime Service Changes + +A new RPC will be introduced in the [CRI Runtime Service](https://github.com/kubernetes/kubernetes/blob/6efd6582df2011f1ec8c146ef711b3348ae07d60/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto#L34), + +```protobuf= + // GetContainerEvents gets container events from the CRI runtime + rpc GetContainerEvents(GetEventsRequest) returns (stream ContainerEventResponse) {} +``` + +```protobuf= +message ContainerEventResponse { + // ID of the container + string container_id = 1; + + // Type of the container event + ContainerEventType container_event_type = 2; + + // Creation timestamp of this event + int64 created_at = 3; + + // ID of the sandbox container + string sandbox_id = 4; +} +``` + +```protobuf= +enum ContainerEventType { + // Container created + CONTAINER_CREATED_EVENT = 0; + + // Container started + CONTAINER_STARTED_EVENT = 1; + + // Container stopped + CONTAINER_STOPPED_EVENT = 2; + + // Container deleted + CONTAINER_DELETED_EVENT = 3; +} +``` +### Events Filter +Events can be filtered to retrieve only subset of events, + +```protobuf= +message GetEventsRequest { + // Optional to filter a list of events. + GetEventsFilter filter = 1; +} +``` + +```protobuf= +// GetEventsFilter is used to filter a list of events. +// All those fields are combined with 'AND' +message GetEventsFilter { + // ID of the container, sandbox. + string id = 1; + // LabelSelector to select matches. + // Only api.MatchLabels is supported for now and the requirements + // are ANDed. MatchExpressions is not supported yet. + map label_selector = 2; +} +``` + +### Kubelet Changes + +Kubelet generates [PodLifecycleEvent](https://github.com/kubernetes/kubernetes/blob/release-1.24/pkg/kubelet/pleg/pleg.go#L41) using [relisting](https://github.com/kubernetes/kubernetes/blob/050f930f8968874855eb215f0c0f0877bcdaa0e8/pkg/kubelet/pleg/generic.go#L150). These `PodLifecycleEvents` get [used](https://github.com/kubernetes/kubernetes/blob/050f930f8968874855eb215f0c0f0877bcdaa0e8/pkg/kubelet/kubelet.go#L2060) in kubelet's sync loop to infer the state of the container. e.g. to determine if the [container has died](https://github.com/kubernetes/kubernetes/blob/050f930f8968874855eb215f0c0f0877bcdaa0e8/pkg/kubelet/kubelet.go#L2118). + + + The idea behind this enhancment is, kubelet will receive the [CRI events](###Runtime-Service-Changes) mentioned above from the CRI runtime and generate the corresponding `PodLifecycleEvent`. This will reduce kubelet's dependency on relisting to generate `PodLifecycleEvent` and that event will be immediately available within sync loop instead of waiting for relisting to finish. Kubelet will still do relisting but with a reduced frequency. + +### Test Plan + +[X] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + + + + + +- `kubernetes/kubernetes/tree/master/pkg/kubelet` : `15-Jun-2022` - `64.5` +##### Integration tests + + + +- Ensure the `PodLifecycleEvent` is generated by the kubelet when the CRI events are received. +- Verify the Pod status is updated correctly when the CRI events are received. + +##### e2e tests + + + +- Existing Pod Lifecycle tests must pass fine even after increasing the relisting frequency. + + +### Graduation Criteria +#### Alpha + +- Feature implemented behind a feature flag +- Existing `node e2e` tests around pod lifecycle must pass + +### Upgrade / Downgrade Strategy + +N/A + +### Version Skew Strategy + +N/A. + +Since this feature alters only the way kubelet determines the container statuses, this section is irrelevant to this feature. + +## Production Readiness Review Questionnaire + + +### Feature Enablement and Rollback + +###### How can this feature be enabled / disabled in a live cluster? + +- [X] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: EventedPLEG + - Components depending on the feature gate: kubelet +- [X] CRI runtime must enable/disable this feature as well for it to work properly. + +###### Does enabling the feature change any default behavior? + +This feature does not introduce any user facing changes. Although users should notice increased performance of the kubelet which should result in reduced overhead of kubelet and the CRI runtime after enabling this feature. + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + +Yes, kubelet needs to be restarted to disable this feature. + +###### What happens if we reenable the feature if it was previously rolled back? + +If reenabled, kubelet will again start updating container statuses using CRI events instead of relisting. Everytime this feature is enabled or disabled, the kubelet will need to be restarted. Hence, the kubelet will start from a clean state. + +###### Are there any tests for feature enablement/disablement? + +Yes, unit tests for the feature when enabled and disabled will be implemented in both kubelet + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +This feature relies on the CRI runtime events to determine the container statuses. If the CRI runtime is not upgraded to the version which emits those CRI events before enabling this feature, the kubelet will not be able to determine the container statuses immediately. However, we aren't getting rid of the exiting relisting altogether. So the kubelet should eventually reconcile the container statuses using relisting abeit rather more infrequently due to [increased relisting period](https://github.com/kubernetes/kubernetes/blob/release-1.24/pkg/kubelet/kubelet.go#L162) that comes with this feature. + +###### What specific metrics should inform a rollback? + + + +If users observe incosistancy in the container statuses reported by the kubelet and the CRI runtime (e.g. using a tool like `crictl`) after enabling this feature, they should consider rolling back the feature. +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + +N/A for alpha release. But we will add the tests for beta release. +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + +No. + + +### Monitoring Requirements + + +- Add a metric `kube_pod_missed_events` that describes when a pod changed state between relisting periods without a corresponding event. + - This is to catch situations where a CRI implementation is buggy and is not properly emitting events. + +###### How can an operator determine if the feature is in use by workloads? + + + +This feature is not directly going to be used by the workloads. This is an optimization for the kubelet to determine the container statuses. + +However, users can use existing pod lifecycle related pod metrics such as, `kube_pod_start_time` or `kube_pod_completion_time` and compare the timestamps reported in the CRI runtime (e.g. `CRI-O` or `containerd`) logs. The time difference must always be lesser than the relisting frequency. + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [X] Other (treat as last resort) + - Details: In the kubelet logs look for `PodLifecycleEvent` getting generated from the received CRI runtime event. This is a good indicator that the feature is working. + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +- The time between pod status change and Kubelet reporting the pod status change must decrease on average from the current polling interval of 1 second. +- The number listed in the `kube_pod_missed_events` metric should remain low (ideally zero or at least near-zero). + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [X] Metrics + - Metric name: `kube_pod_start_time` + - Aggregation method: Compare against the start time reported in the CRI runtime logs. + - Components exposing the metric: Kubelet +- Metric name: `kube_pod_completion_time` + - Aggregation method: Compare against the container exit time reported in the CRI runtime logs. + - Components exposing the metric: Kubelet +- [X] Other (treat as last resort) + - Details: Admins can also look for the `PodLifecycleEvent` getting generated from the received CRI runtime event in the kubelet logs. This is a good indicator that the feature is working. +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + +Kubelet already has the metrics for the pod status update times (e.g `kube_pod_start_time` and `kube_pod_completion_time`). But there is no standard metric emitted by the various CRI runtime implementations for the pod statuses update times. It would be ideal if we had a standard metrics for the container statuses emitted by all the CRI implementations. + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + +- CRI Runtime + - CRI runtimes that are capable of emitting CRI events must be installed and running. + - Impact of its outage on the feature: Kubelet will detect the outage and fall back on the current default relisting period to make sure the pod statuses are updated in time. + - Impact of its degraded performance or high-error rates on the feature: + - Any instability with the CRI runtime events stream that results in an error can be detected by the kubelet. Such an error will result in the kubelet falling back to the current default relisting period to make sure the pod statuses are updated in time. + - If the instability is only of the form degraded performance but does not result in an error then the kubelet will not fall back to the current default relisting period and will continue to use the CRI runtime events stream. This will result in the kubelet updating the pod statuses with either the CRI runtime events or the increased relisting period, whichever is less. + - Without the stable stream CRI events this feature will suffer, and kubelet will fall back to relisting with the current default relisting period. + - Kubelet should emit a metric `kube_pod_missed_events` when it detects pods changing state between relist periods not caught by an event. +### Scalability +###### Will enabling / using this feature result in any new API calls? + +No. + +###### Will enabling / using this feature result in introducing new API types? + +No. + +###### Will enabling / using this feature result in any new calls to the cloud provider? + +No. + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + +No. + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + +No. + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + +No. + +### Troubleshooting + +###### How does this feature react if the API server and/or etcd is unavailable? + +Since it's a kubelet specific feature, it has no effect of unavailibility of either API server and/or etcd. + +###### What are other known failure modes? + +- Incorrect container statuses + - Detection: If the user notices that the container statuses reported by the kubelet are not consistent with the container statuses reported by the CRI runtime (i.e. using say, `crictl`) then we are running into the failure of this feature. + - Mitigations: They will have to disable this feature and open an issue for further investigation. + - Diagnostics: CRI Runtime logs (such as, `cri-o` or `containerd`) may not be consistent with the kubelet logs on container statuses. +- Missed events + - Detection: If there's a bug in the CRI implementation, it may miss events or not send them correctly. Kubelet will see this when the statuses are listed. It should emit a metric `kube_pod_missed_events` to quantify. + - Mitigations: The feature could be disabled or relist frequency could be increased until CRI fixes. + - Diagnostics: Increasing value of `kube_pod_missed_events` metric coming from Kubelet. + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +Disabling this feature in the kubelet will revert to the existing relisting PLEG. + +## Implementation History + +- PR for required CRI changes - https://github.com/kubernetes/kubernetes/pull/110165 + +## Drawbacks + +This KEP introduces changes to the [kubelet PLEG](https://github.com/kubernetes/kubernetes/tree/master/pkg/kubelet/pleg), which is very core to the kubelet operation. + +## Alternatives + +The Kubelet PLEG can be made to utilize the events from cadvisor as well. But we are trying to reduce the kubelet's dependency on cadvisor so that option is not viable. This is also discussed in the older [enhancement](https://github.com/kubernetes/community/blob/4026287dc3a2d16762353b62ca2fe4b80682960a/contributors/design-proposals/node/pod-lifecycle-event-generator.md#leverage-upstream-container-events) in detail. + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-node/3386-kubelet-evented-pleg/kep.yaml b/keps/sig-node/3386-kubelet-evented-pleg/kep.yaml new file mode 100644 index 00000000000..19de4e43489 --- /dev/null +++ b/keps/sig-node/3386-kubelet-evented-pleg/kep.yaml @@ -0,0 +1,39 @@ +title: Kubelet Evented PLEG for Better Performance +kep-number: 3386 +authors: + - "@haircommander" + - "@harche" +owning-sig: sig-node +status: implementable +creation-date: 2022-06-13 +reviewers: + - "@mrunalp" + - "@SergeyKanzhelev" +approvers: + - "@derekwaynecarr" +prr-approvers: + - "@deads2k" + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.26" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.26" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: EventedPLEG + components: + - kubelet +disable-supported: true + +# The following PRR answers are required at beta release +metrics: + - "N/A"