diff --git a/keps/NNNN-kep-template/README.md b/keps/NNNN-kep-template/README.md index 7ca80a43f16..8657cb9003f 100644 --- a/keps/NNNN-kep-template/README.md +++ b/keps/NNNN-kep-template/README.md @@ -93,6 +93,13 @@ tags, and then generate with `hack/update-toc.sh`. - [Graduation Criteria](#graduation-criteria) - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature enablement and rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) - [Implementation History](#implementation-history) - [Drawbacks](#drawbacks) - [Alternatives](#alternatives) @@ -115,11 +122,15 @@ Check these off as they are completed for the Release Team to track. These checklist items _must_ be updated for the enhancement to be released. --> -- [ ] Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) -- [ ] KEP approvers have approved the KEP status as `implementable` -- [ ] Design details are appropriately documented -- [ ] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input -- [ ] Graduation criteria is in place +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input +- [ ] (R) Graduation criteria is in place +- [ ] (R) Production readiness review completed +- [ ] Production readiness review approved - [ ] "Implementation History" section is up-to-date for milestone - [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] - [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes @@ -332,6 +343,205 @@ enhancement: CRI or CNI may require updating that component before the kubelet. --> +## Production Readiness Review Questionnaire + + + +### Feature enablement and rollback + +_This section must be completed when targeting alpha to a release._ + +* **How can this feature be enabled / disabled in a live cluster?** + - [ ] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: + - Components depending on the feature gate: + - [ ] Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control + plane? + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). + +* **Can the feature be disabled once it has been enabled (i.e. can we rollback + the enablement)?** + Also set `rollback-supported` to `true` or `false` in `kep.yaml`. + Describe the consequences on existing workloads (e.g. if this is runtime + feature, can it break the existing applications?). + +* **What happens if we reenable the feature if it was previously rolled back?** + +* **Are there any tests for feature enablement/disablement?** + The e2e framework does not currently support enabling and disabling feature + gates. However, unit tests in each component dealing with managing data created + with and without the feature are necessary. At the very least, think about + conversion tests if API types are being modified. + +### Rollout, Upgrade and Rollback Planning + +_This section must be completed when targeting beta graduation to a release._ + +* **How can a rollout fail? Can it impact already running workloads?** + Try to be as paranoid as possible - e.g. what if some components will restart + in the middle of rollout? + +* **What specific metrics should inform a rollback?** + +* **Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested?** + Describe manual testing that was done and the outcomes. + Longer term, we may want to require automated upgrade/rollback tests, but we + are missing a bunch of machinery and tooling and do that now. + +### Monitoring requirements + +_This section must be completed when targeting beta graduation to a release._ + +* **How can an operator determine if the feature is in use by workloads?** + Ideally, this should be a metrics. Operations against Kubernetes API (e.g. + checking if there are objects with field X set) may be last resort. Avoid + logs or events for this purpose. + +* **What are the SLIs (Service Level Indicators) an operator can use to + determine the health of the service?** + - [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: + - [ ] Other (treat as last resort) + - Details: + +* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** + At the high-level this usually will be in the form of "high percentile of SLI + per day <= X". It's impossible to provide a comprehensive guidance, but at the very + high level (they needs more precise definitions) those may be things like: + - per-day percentage of API calls finishing with 5XX errors <= 1% + - 99% percentile over day of absolute value from (job creation time minus expected + job creation time) for cron job <= 10% + - 99,9% of /health requests per day finish with 200 code + +* **Are there any missing metrics that would be useful to have to improve + observability if this feature?** + Describe the metrics themselves and the reason they weren't added (e.g. cost, + implementation difficulties, etc.). + +### Dependencies + +_This section must be completed when targeting beta graduation to a release._ + +* **Does this feature depend on any specific services running in the cluster?** + Think about both cluster-level services (e.g. metrics-server) as well + as node-level agents (e.g. specific version of CRI). Focus on external or + optional services that are needed. For example, if this feature depends on + a cloud provider API, or upon an external software-defined storage or network + control plane. + + For each of the fill in the following, thinking both about running user workloads + and creating new ones, as well as about cluster-level services (e.g. DNS): + - [Dependency name] + - Usage description: + - Impact of its outage on the feature: + - Impact of its degraded performance or high error rates on the feature: + + +### Scalability + +_For alpha, this section is encouraged: reviewers should consider these questions +and attempt to answer them._ + +_For beta, this section is required: reviewers must answer these questions._ + +_For GA, this section is required: approvers should be able to confirms the +previous answers based on experience in the field._ + +* **Will enabling / using this feature result in any new API calls?** + Describe them, providing: + - API call type (e.g. PATCH pods) + - estimated throughput + - originating component(s) (e.g. Kubelet, Feature-X-controller) + focusing mostly on: + - components listing and/or watching resources they didn't before + - API calls that may be triggered by changes of some Kubernetes resources + (e.g. update of object X triggers new updates of object Y) + - periodic API calls to reconcile state (e.g. periodic fetching state, + heartbeats, leader election, etc.) + +* **Will enabling / using this feature result in introducing new API types?** + Describe them providing: + - API type + - Supported number of objects per cluster + - Supported number of objects per namespace (for namespace-scoped objects) + +* **Will enabling / using this feature result in any new calls to cloud + provider?** + +* **Will enabling / using this feature result in increasing size or count + of the existing API objects?** + Describe them providing: + - API type(s): + - Estimated increase in size: (e.g. new annotation of size 32B) + - Estimated amount of new objects: (e.g. new Object X for every existing Pod) + +* **Will enabling / using this feature result in increasing time taken by any + operations covered by [existing SLIs/SLOs][]?** + Think about adding additional work or introducing new steps in between + (e.g. need to do X to start a container), etc. Please describe the details. + +* **Will enabling / using this feature result in non-negligible increase of + resource usage (CPU, RAM, disk, IO, ...) in any components?** + Things to keep in mind include: additional in-memory state, additional + non-trivial computations, excessive access to disks (including increased log + volume), significant amount of data send and/or received over network, etc. + This through this both in small and large cases, again with respect to the + [supported limits][]. + +### Troubleshooting + +Troubleshooting section serves the `Playbook` role as of now. We may consider +splitting it into a dedicated `Playbook` document (potentially with some monitoring +details). For now we leave it here though. + +_This section must be completed when targeting beta graduation to a release._ + +* **How does this feature react if the API server and/or etcd is unavailable?** + +* **What are other known failure modes?** + For each of them fill in the following information by copying the below template: + - [Failure mode brief description] + - Detection: How can it be detected via metrics? Stated another way: + how can an operator troubleshoot without loogging into a master or worker node? + - Mitigations: What can be done to stop the bleeding, especially for already + running user workloads? + - Diagnostics: What are the useful log messages and their required logging + levels that could help debugging the issue? + Not required until feature graduated to Beta. + - Testing: Are there any tests for failure mode? If not describe why. + +* **What steps should be taken if SLOs are not being met to determine the problem?** + +[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md +[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos + ## Implementation History