From 3e93937142c5221e70df1368ef131ff9782f4040 Mon Sep 17 00:00:00 2001 From: Eddie Zaneski Date: Wed, 2 Jun 2021 14:37:35 -0600 Subject: [PATCH] 2775: kepctl create Signed-off-by: Eddie Zaneski --- .../README.md | 579 ++++++++++++++++++ .../kep.yaml | 24 + 2 files changed, 603 insertions(+) create mode 100644 keps/sig-cli/2775-kubectl-delete-interactivity-delay/README.md create mode 100644 keps/sig-cli/2775-kubectl-delete-interactivity-delay/kep.yaml diff --git a/keps/sig-cli/2775-kubectl-delete-interactivity-delay/README.md b/keps/sig-cli/2775-kubectl-delete-interactivity-delay/README.md new file mode 100644 index 00000000000..2400a09853f --- /dev/null +++ b/keps/sig-cli/2775-kubectl-delete-interactivity-delay/README.md @@ -0,0 +1,579 @@ +# KEP-2775: kubectl delete interactivity and delay + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + + +## Release Signoff Checklist + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +When a namespace is deleted it also deletes all of the resources under it. The deletion runs without further confirmation, and can be devastating if accidentally run against the wrong namespace (e.g. thanks to hasty tab completion use). + +``` +kubectl delete namespace prod-backup +``` + +When all namespaces are deleted essentially all resources are deleted. This deletion is trivial to do with the `--all` flag, and it also runs without further confirmation. It can effectively wipe out a whole cluster. + +``` +kubectl delete namespace --all +``` + +There are certainly things cluster operators should be doing to help prevent this user error (like locking down permissions) but it doesn't matter how many mistakes have to pile up to create a perfect storm of a bad thing because we're allowing a bad thing to happen without a confirmation gate. + +This KEP proposes two protections when deleting resources with kubectl. + +## Motivation + +Over the years we've heard stories from users who've accidentally deleted resources in their clusters - often all of their resources. This trend seems to be rising lately as newer folks venture into the Kubernetes/DevOps/Infra world. Experienced users are also not accident proof and have shared tragic tales. + +### Goals + +* Mitigate the potential for accidental imperative deletes + +### Non-Goals + +* Server side solutions + +## Proposal + +1. `kubectl delete [--all | --all-namespaces]` will warn about the destructive action that will be performed and artificially delay for x seconds allowing users a chance to abort. + +2. Add a new `--interactive | -i` flag to `kubectl delete` that will require confirmation before deleting resources imperatively (i.e. not `-f`). This flag will be false by default. If this flag is explicitly set to false it will not prompt or artificially delay. + +These two changes introduce significant protections for users and do not break backwards compatibility. Scripts will see an x second delay unless they explicitly set `-i=false`. + +This KEP is specifically targeting imperative deletes to speed up rollout. Future KEPs may expand the scope of interactive confirmation. + +``` +kubectl delete deployment nginx --interactive +``` + +### User Stories (Optional) + +#### Story 1 + +> Creating basic safeguards is not just about junior users, I have deleted massive amounts of infrastructure several times because I was in the wrong kubectl context. + +https://groups.google.com/g/kubernetes-dev/c/y4Q20V3dyOk/m/8xdQ8TM_BgAJ + +> We had this run by a developer by mistake today and it wiped all resources in the cluster, with the exception of the `kube-system`, `kube-public` and `default` namespaces, and the resources within. + +https://github.com/kubernetes/kubectl/issues/696 + +> One of the members of our technical staff with admin privileges tried to delete all objects in a namespace, in an event of hurry, he ran, something like `kubectl delete namespace -n some-namespace --all` which caused all namespaces and their objects deleted to a very granular level, apart from objects in kube-* namespaces. + +As a user I should be warned before accidentally deleting all of the resources in my cluster. + +### Notes/Constraints/Caveats (Optional) + +Several contributors have raised concerns about making breaking changes here. Our goal is to strike a balance between not breaking existing scripts and protecting users from making mistakes. + +The major restriction is around users running kubectl in CI/CD pipelines. We do not want to break their scripts. + +### Risks and Mitigations + +CI/CD jobs that upgrade their version of kubectl will see an x second slow down if they are deleting with `--all | --all-namespaces` without setting `-i=false`. Some CI/CD platforms set `CI=true` in their default environment. We could potentially skip the delay if this is set. + +Do: +* https://circleci.com/docs/2.0/env-vars/#built-in-environment-variables +* https://docs.travis-ci.com/user/environment-variables/#default-environment-variables +* https://docs.semaphoreci.com/ci-cd-environment/environment-variables/#ci +* https://codefresh.io/docs/docs/codefresh-yaml/variables/#system-provided-variables +* https://docs.gitlab.com/ee/ci/variables/predefined_variables.html +* https://docs.github.com/en/actions/reference/environment-variables +* https://docs.cloudbees.com/docs/cloudbees-codeship/latest/basic-builds-and-configuration/set-environment-variables +* https://docs.drone.io/pipeline/environment/reference/ci/ +* https://devcenter.wercker.com/administration/environment-variables/available-env-vars/ +* https://buildkite.com/docs/pipelines/environment-variables#bk-env-vars-ci + +Do Not: +* https://docs.microsoft.com/en-us/azure/devops/pipelines/build/variables?view=azure-devops&tabs=yaml +* https://cloud.google.com/build/docs/configuring-builds/substitute-variable-values +* https://docs.aws.amazon.com/codebuild/latest/userguide/build-env-ref-env-vars.html +* https://www.jetbrains.com/help/teamcity/predefined-build-parameters.html#Server+Build+Properties +* https://wiki.jenkins.io/display/JENKINS/Building+a+software+project#Buildingasoftwareproject-belowJenkinsSetEnvironmentVariables +* https://jenkins-x.io/docs/resources/guides/using-jx/pipelines/envvars/ +* https://confluence.atlassian.com/bamboo/bamboo-variables-289277087.html +* https://docs.cloudbees.com/docs/cloudbees-codeship/latest/pro-builds-and-configuration/environment-variables#_default_environment_variables (not on pro version?) +* https://docs.gocd.org/current/faq/dev_use_current_revision_in_build.html#standard-gocd-environment-variables + +There is concern that these protections are not enough as they require users to opt-in for confirmation. This KEP is viewed as a starting point to introduce some form of protection that users may start to use. A way for users to set this as the default in the future will be in a future KEP. + +## Design Details + +If `--all` or `--all-namespaces` flags are passed in for a delete operation a warning will be printed and an artificial delay will be imposed. We need to determine what the correct length is for this. Initial thoughts were between 5 and 10 seconds. User testing will be done to gauge how long is needed to read the warning message. If `CI=true` is set in the environment this delay can be skipped. + +The warning message will be printed to stderr. We need to decide if we print a generic warning message or detail out the resources that will be deleted. The latter will be more complicated (especially calculating the resources deleted under a namespace) but more rewarding to the user. Deleting a namespace with `--dry-run=server` does not return any resources in the namespace. + +> Warning! You are deleting resources with `--all`. This will delete all of the (pods) in the (cluster or current namespace). Proceeding in x seconds. ctrl+c to interrupt + +> Warning! You are deleting the following resources: +> Name: Group: Kind: +> my-pod core/v1 Pod +> Proceeding in x seconds. ctrl+c to interrupt + +If the `--interactive` flag is passed for imperative deletes we will prompt the user for confirmation. As above we need to determine the prompt -- a generic message or explicitly lay out the resources. No (`N`) will be the default response if enter is pressed without input. + +``` +kubectl delete namespace prod -i +Are you sure? (y/N) +``` + +``` +kubectl delete namespace prod -i +Are you sure you want to delete namespace prod? (y/N) +``` + +Setting `--interactive=false` will intentionally declare that a user does not want to be prompted or delayed. I believe [Cobra/Viper can determine if a flag has been explicitly set](https://github.com/spf13/cobra/issues/453#issuecomment-303548563). + +### Test Plan + +Unit tests, e2e, and integration tests will be implemented. We already have these suites in place for kubectl. + +### Graduation Criteria + +TBD + + + +### Upgrade / Downgrade Strategy + +n/a + +### Version Skew Strategy + +n/a + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + +###### How can this feature be enabled / disabled in a live cluster? + +- [x] Other + - Describe the mechanism: The `-i` flag is opt-in from the client. The delay is opt-out from the client. + - Will enabling / disabling the feature require downtime of the control + plane? n/a + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). n/a + +###### Does enabling the feature change any default behavior? + +An x second delay will be introduced client-side when deleting with `--all | --all-namespaces`. + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + +n/a + +###### What happens if we reenable the feature if it was previously rolled back? + +n/a + +###### Are there any tests for feature enablement/disablement? + +n/a + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +###### Will enabling / using this feature result in introducing new API types? + + + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +## Drawbacks + +The delay introduced in this KEP will cause CI/CD jobs to take x seconds longer unless authors add `-i=false`. As stated above we can potentially mitigate this by skipping if `CI=true` is set. + +## Alternatives + +We considered showing the confirmation prompt if a TTY was detected. After further research it was determined that we would run into issues with TTY detection due to platforms and pipelines spoofing a TTY in an attempt to match the output of developer's terminals (e.g. color). + +https://groups.google.com/g/kubernetes-dev/c/y4Q20V3dyOk/m/LOe7Id1DBgAJ +https://groups.google.com/g/kubernetes-dev/c/y4Q20V3dyOk/m/W0y2fD-NAAAJ + +During RFC discussion several ideas were shared. One of them was a locking mechanism for resources that would prevent deletion. This is considered orthogonal to this KEP because users don't intend to delete the resources they are accidentally deleting - not that there are things that should never be deleted. + +A similar mechanism is preset in OpenKruise. + +https://openkruise.io/en-us/docs/deletion_protection.html + +Resource locking would make a great future KEP. + +https://groups.google.com/g/kubernetes-dev/c/y4Q20V3dyOk/m/zTLxKK5ABgAJ diff --git a/keps/sig-cli/2775-kubectl-delete-interactivity-delay/kep.yaml b/keps/sig-cli/2775-kubectl-delete-interactivity-delay/kep.yaml new file mode 100644 index 00000000000..9e378f42693 --- /dev/null +++ b/keps/sig-cli/2775-kubectl-delete-interactivity-delay/kep.yaml @@ -0,0 +1,24 @@ +id: "" +prnumber: "" +name: 2775-kubectl-delete-interactivity-delay +title: kubectl delete interactivity and delay +kep-number: "2775" +authors: + - "@eddiezane" +owning-sig: sig-cli +participating-sigs: [sig-cli] +reviewers: [] +approvers: [] +prr-approvers: [] +creation-date: "2021-06-02" +last-updated: v1.22 +status: provisional +stage: "" +latest-milestone: "" +milestone: + alpha: "" + beta: "" + stable: "" +feature-gates: [] +disable-supported: false +metrics: []