Need clarification: KEP stages and transitions #2960

thockin · 2021-09-08T19:07:31Z

From slack (summarized):

Is the KEP freeze "you must have all KEP changes merged" or "You must have all intended-to-change KEPs in the sheet"?

For example: I have some KEPs which we WANT to move from Alpha->Beta or Beta->GA, but the code to do those things is not done yet and MIGHT slip. I don't want to change KEP stage text until we know whether it is in or not, and that won't be until well after the enhancement freeze.

That's not an exception - that's the norm, I think.

I think the only sane interpretation is that the KEP deadline indicates which KEPs are being tracked, but every KEP should be considered "at risk" until stage changes, and when stage changes, that should be authoritative. What I DON'T want is to have to remember to revert changes to a KEP if the code misses the release. That will fail and leave ambiguous results.

Do we need a 2-phase commit - a distinction between target and actual?

We need clarity on the enhancement freeze - what EXACTLY is the deadline for? What sorts of KEP changes are allowed after the enhancement freeze?

Rey said "Enhancements Freeze indicates which KEPs are being tracked ... What the release team looks for in Enhancements Freeze is for the KEP to be marked as "implementable", has the correct stage for the release, ... and an approved PRR"

This says to me that if I want to move to GA in 1.23, I update the KEP to say GA for stage. If the code doesn't make it, we have a KEP that is ambiguous (or flat out lying). Expecting people to come back and revert the stage is unrealistic.

IMO - stage should be a trailing indicator. Updating stage should be on the success path, not on the failure path.

Some SIGs/reviewers are resetting the status to implemented/implementable for each stage. I don't think that was the intention.

Between stage, milestone, and status there's overlap and ambiguity. For example, let's look at a new KEP.

It starts as:

stage: alpha
status: provisional
alpha: 
beta: 
ga:

We iterate on that and it ultimately gets approved:

stage: alpha
status: implementable
alpha: v1.X
beta: v1.Y
ga: v1.Z

Development work happens and we release v1.X. Given the above fields, I can assume that this KEP is alpha in v1.X. But if, for some reason, the code missed the boat, someone has to go back and edit the KEP to change alpha, beta, and ga. That is prone to failure, and if it is forgotten, then people might take the wrong result.

I think we should instead do soemthing like:

stage: prealpha
status: provisional
alpha: 
beta: 
ga:

which gets approved to:

stage: prealpha
status: implementable
alpha: v1.X
beta: v1.Y
ga: v1.Z

When we release v1.X, we can update this to:

stage: alpha
status: implementable
alpha: v1.X
beta: v1.Y
ga: v1.Z

This means that someone has to touch the KEP - the same as before, but the failure mode is better. People won't accidentally assume it is ready to use. It would stay in this state until all the code for beta is merged, at which point it becomes:

stage: beta
status: implementable
alpha: v1.X
beta: v1.Y
ga: v1.Z

Obviously, we could model this other ways. E.g. we could have a per-stage status:

status:
  alpha: implemented
  beta: implementable

The text was updated successfully, but these errors were encountered:

thockin · 2021-09-08T19:08:04Z

/sig release

justaugustus · 2021-09-08T20:03:13Z

cc: @kubernetes/sig-release @kubernetes/enhancements
/area enhancements

johnbelamaric · 2021-09-08T20:18:10Z

thanks Tim. Here's another one where the CI makes too many assumptions about the various fields being set correctly at the correct times:

#563 (comment)

We had intended to have kepctl manage this metadata in a consistent way, but that has stalled. We need a better solution in the interim.

thockin · 2021-09-08T20:21:15Z

If we can define the state-machine, we can code it up as CI. "Error: can't transition directly from state FOO to state BAR".

wojtek-t · 2021-09-09T11:45:01Z

Copying my comment from the slack channel:

I think that what we're missing is basically "two-phase-commit"
so in the ideal world the way I see that is:
- you update the KEP for feature freeze with your intention
- then you do the actual implementation
- once  all the implementation (including metrics, tests, etc.), you mark the KEP as "actual intent"
- and only then you are allow to send a PR changing the value of a feature-gate

that would gives us ways to ensure that the code is actually graduated only when it should be...
and on third step you're allow to update the KEP if that is needed (e.g. something that wasn't predicted was discovered)

Which means that the state machine I'm proposing is (well state-machine seems like a too big word for that, as transitions can go only between adjacent states):

provisional -> intended for Alpha -> code-complete for Alpha [-> graduated Alpha ??] -> intended for Beta -> code-complete for Beta [-> graduated beta ??] -> intended for Stable -> code-complete for Stable -> implemented

In the ideal world how I would see that is:

we have feature-freeze as it is now [for putting keps into "intended for X" state]
the there is code-freeze [but PRs touching feature-gates file are not allowed then]
then it's graduation-freeze [where you mark your KEP as code-complete for X, approvers check if everything that should happen really happened, not just code, but also tests, metrics, ...]
based on that you're only allowed to bump your feature-gate

I admit that adds more burden on the approvers, but given that code-approvers are not necessary the same people are KEP approvers, in the current maturity state of the project I think it would actually be worth it.

kikisdeliveryservice · 2021-09-22T18:05:11Z

Q: this is a speculative issue? Because when I was on the release team, the team did go through the KEPs post code freeze to ensure all PRs intended to merge did merge and if not asked the authors to update the KEPs.

While I think there is more granular control we could have on the KEPs, I worry that the reality is that forcing people to update KEPs multiple times within a release post freeze would create a new set of missed updates for authors who previously had no problems creating more issues with up to date KEPs than currently exist.

Thinking of something that is lower friction but provides the verification that @thockin rightly wants about the state: Could we ask sigs to somehow verify the state of the KEPs post code-freeze? I'd be more inclined to do this via the release team for 1 release, then see how it works out, get feedback and then memorialize it in the actual KEP template.

Further, I think a good point is made, that until the release is released, all the KEP statuses are tentative. I think that is something that we should/could more clearly communicate to outside observers, even outside of the KEP itself - perhaps via docs in the enhancements repo and within the KEP website that is being worked on something like: "all KEPs targeted at the current in-progress release: $RELEASENUMBER are only tentative. please wait until the release is cut on $DATE for the final list and status of KEPs in this release" or some such thing...

kikisdeliveryservice · 2021-09-22T18:10:56Z

As a final note "implemented" is generally used for KEPs that are completely finished post-GA. I think the goal should be that at the end point of the release only the verifiably "finished" KEPs (for their stage) have the release and graduation in their kepy.yaml. IMO we should be removing/updating inaccurate KEPs as opposed to adding a hurdle for accurate KEPs.

I'd like to give this a bit more thought, but thanks for raising this

wojtek-t · 2021-09-23T10:56:35Z

Q: this is a speculative issue? Because when I was on the release team, the team did go through the KEPs post code freeze to ensure all PRs intended to merge did merge and if not asked the authors to update the KEPs.

This isn't a speculative question - this is a real thing that happened.
There are multiple aspects of this:

you were going through PRs that linked the enhancement issue, right? There is no guarantee that all indeed were linked
you were looking into PRs that were expected; the issue that I've found was also the opposite - things that shouldn't merge were merged - see [WIP] Promote KEP-1672 to GA #2938 (comment)

spiffxp · 2021-09-23T15:03:44Z

FWIW, the logic itself is pretty easy to change, the tangly bits are in the data, since it's been mostly hand edited YAML that has historically not been validated against much.

So for example, if we decide to change required = ! (stage != implementable ) to required = stage == implementable || stage == implemented then we need to update... *checks notes* about 71 KEPs to have a PRR file.

We could use latest-milestone to avoid checking this for KEPs that are earlier than v1.21, but this is yet another field that's not common and easily falls out of date. Tracking down what the value should actually be and/or requiring people to update it is toilsome. Stuff like this starts to make me think this issue is a subset of #2988

spiffxp · 2021-10-08T15:31:06Z

FWIW, the following KEPs lack a latest-milestone field, which is the blocker to switching the PRR check to trigger on stage == Implemented:

keps/provider-aws/2313-aws-k8s-tester/kep.yaml
keps/provider-aws/2314-custom-endpoints-support-for-aws-cloud-provider/kep.yaml
keps/provider-aws/2315-aws-loadbalancer-prefix/kep.yaml
keps/provider-aws/629-alb-ingress-controller/kep.yaml
keps/provider-aws/630-ebs-csi-driver/kep.yaml
keps/sig-api-machinery/1101-immutable-fields/kep.yaml
keps/sig-api-machinery/1152-less-object-serializations/kep.yaml
keps/sig-api-machinery/1623-standardize-conditions/kep.yaml
keps/sig-api-machinery/1872-manifest-based-admission-webhooks/kep.yaml
keps/sig-api-machinery/2332-pruning-for-custom-resources/kep.yaml
keps/sig-api-machinery/2333-legacyflags-kflag/kep.yaml
keps/sig-api-machinery/2335-vanilla-crd-openapi-subset-structural-schemas/kep.yaml
keps/sig-api-machinery/2337-k8s.io-group-protection/kep.yaml
keps/sig-api-machinery/2340-Consistent-reads-from-cache/kep.yaml
keps/sig-api-machinery/2341-enabling-clients-to-tell-if-resource-endpoints-serve-the-same-set-of-objects/kep.yaml
keps/sig-api-machinery/2342-exposing-hashed-storage-versions-via-the-discovery-API/kep.yaml
keps/sig-api-machinery/2343-automated-storage-version-migration-with-storage-version-hash/kep.yaml
keps/sig-api-machinery/492-admission-webhooks/kep.yaml
keps/sig-api-machinery/575-crd-defaulting/kep.yaml
keps/sig-api-machinery/576-dry-run/kep.yaml
keps/sig-api-machinery/598-crd-conversion-webhook/kep.yaml
keps/sig-api-machinery/956-watch-bookmark/kep.yaml
keps/sig-apps/2360-optional-service-environment-variables/kep.yaml
keps/sig-apps/706-portable-service-definitions/kep.yaml
keps/sig-architecture/0000-kep-process/kep.yaml
keps/sig-architecture/1194-prod-readiness/kep.yaml
keps/sig-architecture/1618-conformance-profiles/kep.yaml
keps/sig-architecture/1659-standard-topology-labels/kep.yaml
keps/sig-architecture/617-improve-kep-implementation/kep.yaml
keps/sig-architecture/960-conformance-behaviors/kep.yaml
keps/sig-auth/1513-certificate-signing-request/kep.yaml
keps/sig-auth/1687-hierarchical-namespaces-subproject/kep.yaml
keps/sig-auth/279-limit-node-access/kep.yaml
keps/sig-auth/600-dynamic-audit-configuration/kep.yaml
keps/sig-auth/789-harden-default-discover-bindings/kep.yaml
keps/sig-autoscaling/117-hpa-metrics-specificity/kep.yaml
keps/sig-autoscaling/853-configurable-hpa-scale-velocity/kep.yaml
keps/sig-cli/2229-kubectl-xdg-base-dir/kep.yaml
keps/sig-cli/2377-Kustomize/kep.yaml
keps/sig-cli/2379-kubectl-plugins/kep.yaml
keps/sig-cli/2380-data-driven-commands-for-kubectl/kep.yaml
keps/sig-cli/2381-future-of-kubectl-cp/kep.yaml
keps/sig-cli/2384-kustomize-file-processing-integration/kep.yaml
keps/sig-cli/2386-kustomize-subcommand-integration/kep.yaml
keps/sig-cli/491-kubectl-diff/kep.yaml
keps/sig-cloud-provider/1179-building-without-in-tree-providers/kep.yaml
keps/sig-cloud-provider/2133-out-of-tree-credential-provider/kep.yaml
keps/sig-cloud-provider/2392-cloud-controller-manager/kep.yaml
keps/sig-cloud-provider/2394-support-out-of-tree-AWS-cloud-provider/kep.yaml
keps/sig-cloud-provider/668-out-of-tree-gce/kep.yaml
keps/sig-cloud-provider/669-out-of-tree-openstack/kep.yaml
keps/sig-cloud-provider/670-out-of-tree-vsphere/kep.yaml
keps/sig-cloud-provider/671-out-of-tree-ibm/kep.yaml
keps/sig-cloud-provider/azure/2328-ccm-instance-metadata/kep.yaml
keps/sig-cloud-provider/providers/0000-cloud-provider-template/kep.yaml
keps/sig-cloud-provider/providers/2530-alibaba-cloud/kep.yaml
keps/sig-cluster-lifecycle/addons/2492-Addons-via-Operators/kep.yaml
keps/sig-cluster-lifecycle/addons/2494-Manifest-Bundle/kep.yaml
keps/sig-cluster-lifecycle/clusterapi/2495-Kubernetes-Cluster-Management-API/kep.yaml
keps/sig-cluster-lifecycle/etcdadm/2496-etcdadm/kep.yaml
keps/sig-cluster-lifecycle/image-builder/2497-Kubernetes-Image-Builder/kep.yaml
keps/sig-cluster-lifecycle/kubeadm/1177-kubeadm-with-kustomize/kep.yaml
keps/sig-cluster-lifecycle/kubeadm/2498-Kubeadm-Config-versioning/kep.yaml
keps/sig-cluster-lifecycle/kubeadm/2500-kubeadm-join-control-plane-workflow/kep.yaml
keps/sig-cluster-lifecycle/kubeadm/2501-kubeadm-phases-to-beta/kep.yaml
keps/sig-cluster-lifecycle/kubeadm/2505-Kubeadm-operator/kep.yaml
keps/sig-cluster-lifecycle/kubeadm/378-bootstrap-checkpointing/kep.yaml
keps/sig-contributor-experience/0000-community-forum/kep.yaml
keps/sig-contributor-experience/1553-issue-triage/kep.yaml
keps/sig-contributor-experience/2225-contributor-site/kep.yaml
keps/sig-docs/1326-third-party-content-in-docs/kep.yaml
keps/sig-instrumentation/1013-metrics-watch-api/kep.yaml
keps/sig-instrumentation/1206-metrics-overhaul/kep.yaml
keps/sig-network/1024-nodelocal-cache-dns/kep.yaml
keps/sig-network/2447-Make-kube-proxy-service-abstraction-optional/kep.yaml
keps/sig-network/2448-Remove-kube-proxy-automatic-clean-up-logic/kep.yaml
keps/sig-network/2449-move-externalDNS-out-of-kubernetes-incubator/kep.yaml
keps/sig-network/2450-Remove-knowledge-of-pod-cluster-CIDR-from-iptables-rules/kep.yaml
keps/sig-network/265-ipvs-based-load-balancing/kep.yaml
keps/sig-network/2829-gateway-api-to-k8s-io/kep.yaml
keps/sig-network/427-coredns/kep.yaml
keps/sig-network/566-coredns-default/kep.yaml
keps/sig-network/580-pod-readiness-gates/kep.yaml
keps/sig-network/980-load-balancer-finalizers/kep.yaml
keps/sig-node/135-seccomp/kep.yaml
keps/sig-node/1547-building-kubelet-without-docker/kep.yaml
keps/sig-node/166-taint-based-eviction/kep.yaml
keps/sig-node/495-pod-pid-namespace/kep.yaml
keps/sig-node/589-efficient-node-heartbeats/kep.yaml
keps/sig-node/688-pod-overhead/kep.yaml
keps/sig-node/753-sidecar-containers/kep.yaml
keps/sig-node/757-pid-limiting/kep.yaml
keps/sig-node/950-liveness-probe-holdoff/kep.yaml
keps/sig-release/1731-publishing-packages/kep.yaml
keps/sig-release/1732-artifact-management/kep.yaml
keps/sig-release/1733-release-notes/kep.yaml
keps/sig-scheduling/2372-node-labels-quota/kep.yaml
keps/sig-scheduling/382-taint-node-by-condition/kep.yaml
keps/sig-scheduling/548-schedule-daemonset-pods/kep.yaml
keps/sig-scheduling/583-coscheduling/kep.yaml
keps/sig-scheduling/986-resource-quota-scope-selectors/kep.yaml
keps/sig-storage/121-local-persistent-volumes/kep.yaml
keps/sig-storage/1979-object-storage-support/kep.yaml
keps/sig-storage/2451-service-account-token-volumes/kep.yaml
keps/sig-storage/351-raw-block-support/kep.yaml
keps/sig-storage/554-maximum-volume-count/kep.yaml
keps/sig-storage/557-csi-topology/kep.yaml
keps/sig-storage/559-volume-subpath-expansion/kep.yaml
keps/sig-storage/565-csi-block-support/kep.yaml
keps/sig-storage/603-csi-pod-info/kep.yaml
keps/sig-storage/770-csi-skip-attach/kep.yaml
keps/sig-storage/989-extend-datasource/kep.yaml
keps/sig-windows/1043-windows-security-context/kep.yaml
keps/sig-windows/116-windows-node-support/kep.yaml
keps/sig-windows/689-windows-gmsa/kep.yaml

Less toilsome logic we could try instead of "latest-milestone < 1.21":

"last-updated" < whatever date 1.21 was
hardcode this list as an exemption list, and ignore the validation error for any of these KEPs

spiffxp · 2021-10-08T15:51:44Z

#2672 opted to mark everything as implementable and give a dummy "0.0" milestone, we could also add that milestone to everything currently marked as implemented...

spiffxp · 2021-10-08T19:05:01Z

I've opened #3007 to add missing metadata and enable the PRR check for implemented KEPs

k8s-triage-robot · 2022-03-02T07:45:28Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-04-01T08:14:37Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-05-01T08:48:11Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-05-01T08:48:25Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

thockin assigned justaugustus, johnbelamaric, wojtek-t and reylejano Sep 8, 2021

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Sep 8, 2021

k8s-ci-robot added sig/release Categorizes an issue or PR as relevant to SIG Release. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 8, 2021

k8s-ci-robot added the area/enhancements Issues or PRs related to the Enhancements subproject label Sep 8, 2021

justaugustus added this to the keps-beta milestone Sep 8, 2021

kikisdeliveryservice self-assigned this Sep 22, 2021

kikisdeliveryservice added the sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. label Sep 22, 2021

spiffxp mentioned this issue Sep 23, 2021

api: add Stage and Status types, require more fields #2991

Merged

spiffxp mentioned this issue Oct 8, 2021

pkg/kepval: enable PRR check for implemented KEPs #3007

Merged

justaugustus removed their assignment Dec 2, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 2, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 1, 2022

k8s-ci-robot closed this as completed May 1, 2022

thockin added this to KEPs I am tracking Dec 20, 2022

thockin moved this to meta in KEPs I am tracking Dec 20, 2022

thockin removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need clarification: KEP stages and transitions #2960

Need clarification: KEP stages and transitions #2960

thockin commented Sep 8, 2021

thockin commented Sep 8, 2021

justaugustus commented Sep 8, 2021

johnbelamaric commented Sep 8, 2021

thockin commented Sep 8, 2021

wojtek-t commented Sep 9, 2021

kikisdeliveryservice commented Sep 22, 2021 •

edited

Loading

kikisdeliveryservice commented Sep 22, 2021

wojtek-t commented Sep 23, 2021

spiffxp commented Sep 23, 2021

spiffxp commented Oct 8, 2021

spiffxp commented Oct 8, 2021

spiffxp commented Oct 8, 2021

k8s-triage-robot commented Mar 2, 2022

k8s-triage-robot commented Apr 1, 2022

k8s-triage-robot commented May 1, 2022

k8s-ci-robot commented May 1, 2022

Need clarification: KEP stages and transitions #2960

Need clarification: KEP stages and transitions #2960

Comments

thockin commented Sep 8, 2021

thockin commented Sep 8, 2021

justaugustus commented Sep 8, 2021

johnbelamaric commented Sep 8, 2021

thockin commented Sep 8, 2021

wojtek-t commented Sep 9, 2021

kikisdeliveryservice commented Sep 22, 2021 • edited Loading

kikisdeliveryservice commented Sep 22, 2021

wojtek-t commented Sep 23, 2021

spiffxp commented Sep 23, 2021

spiffxp commented Oct 8, 2021

spiffxp commented Oct 8, 2021

spiffxp commented Oct 8, 2021

k8s-triage-robot commented Mar 2, 2022

k8s-triage-robot commented Apr 1, 2022

k8s-triage-robot commented May 1, 2022

k8s-ci-robot commented May 1, 2022

kikisdeliveryservice commented Sep 22, 2021 •

edited

Loading