KEP-3960: Introducing Sleep Action for PreStop Hook #3961

AxeZhan · 2023-04-22T09:19:16Z

One-line PR description: This KEP proposes the addition of a new sleep action for the PreStop lifecycle hook in Kubernetes, allowing containers to pause for a specified duration before termination.

Issue link: Introducing Sleep Action for PreStop Hook #3960

Other comments: issue in k/k

keps/sig-node/3960-pod-lifecycle-sleep-action/README.md

keps/sig-node/3960-pod-lifecycle-sleep-action/kep.yaml

keps/sig-node/3960-pod-lifecycle-sleep-action/README.md

bart0sh · 2023-04-25T10:32:46Z

/assign @SergeyKanzhelev @mrunalp

keps/sig-node/3960-pod-lifecycle-sleep-action/README.md

keps/sig-node/3960-pod-lifecycle-sleep-action/kep.yaml

keps/sig-node/3960-pod-lifecycle-sleep-action/README.md

keps/sig-node/3960-pod-lifecycle-sleep-action/kep.yaml

SergeyKanzhelev

/lgtm

jeremyrickard

👋 Giving this a PRR review as a shadow this time around, have a few questions to clarify.

jeremyrickard · 2023-06-14T04:33:59Z

keps/sig-node/3960-pod-lifecycle-sleep-action/README.md

+
+###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
+
+The feature can be disabled in Alpha and Beta versions by restarting kube-apiserver with the feature-gate off. In terms of Stable versions, users can choose to opt-out by not setting the sleep field.


If users have set the sleep field and then the feature is disabled, it will just be ignored and the old behavior (i.e. no sleep) would apply?

then feature disabled

IIUC, you mean enabled first then setup the sleep field, then restart/disabled it? if that was the case, we want it be ignored and old behavior apply

If users have set the sleep field and then the feature is disabled, it will just be ignored and the old behavior (i.e. no sleep) would apply?

Yes, in that case, the prestop hook will not take effect.

For most API changes, the feature gate controls the admission of new uses to the system, and not the actual implementation logic.

In other words, if you enable the feature, use the feature, then disable the feature - it keeps working, bt no NEW uses of the feature work. This is not universally true, we actually do not have a consistent rule here.

keps/sig-node/3960-pod-lifecycle-sleep-action/README.md

jeremyrickard · 2023-06-14T04:39:09Z

keps/sig-node/3960-pod-lifecycle-sleep-action/README.md

+
+###### How can a rollout or rollback fail? Can it impact already running workloads?
+
+The change is opt-in, it doesn't impact already running workloads. But problems with the updated validation logic may cause crashes in the apiserver.


How would the user determine this is the cause of crashes in the apiserver? Will there be any tests that help prevent this from making it into the release?

I think this is misleading (crash), let me update later. If a pod with sleepAction was created and the featue is disabled. And this pod is recreated/updated by a user, the pod's yaml won't pass the validation.
In this case, an error will occur to point out the wrong field, instead of the "crash in the apiserver"

It's a hard rule that previously accepted objects must not later fail validation. When it comes to actual API review, we will ensure that :)

It's a hard rule that previously accepted objects must not later fail validation. When it comes to actual API review, we will ensure that :)

Then I think, we can safely say that this feature will not impact already running workloads?

wojtek-t

Few more comments about PRR on top of previously added by @jeremyrickard

keps/sig-node/3960-pod-lifecycle-sleep-action/README.md

wojtek-t · 2023-06-14T11:17:17Z

keps/sig-node/3960-pod-lifecycle-sleep-action/README.md

+
+### Version Skew Strategy
+
+N/A


Disagree - it actually matters. It's definitely possible [even more, it will happen for sure at least for a moment] that the FG will be enabled only in one component and not the other.

Ideally, I would like to see matrix of:

kubeapiserver enabled/disable

kubelet enabled/disable
with description of what exactly we expect

Updated, but I'm not sure what will happen in the scenario only the kube-apiserver enable this feature, will the creating request pass the validation and successsfully processed or will it be rejected?

@wojtek-t this goes to the discussion about alpha/beta that I started and have not followed up on :)

Should the kubelet even have the gate? We are not consistent.

IOW:

The feature gate only exists to prevent use of the field in the API. Once accepted on a pod, the feature is on for that pod. Kubelet treats the field like it is GA.

The feature gate prevents new use of the field in the API and nullifies the effect on existing uses. If gate is disabled, Kubelet will see the field and ignore it.

What do you think makes more sense?

The semantic that I believe works best is close to (2) [although I didn't fully understand "and nullifies ..." part]
I believe what we should target is:

If FG is disabled:
(a) any attempt to set the field for an object doesn't succeed - the field is silently dropped - the "dropFields" strategy: https://github.com/kubernetes/kubernetes/blob/master/pkg/registry/core/pod/strategy.go#LL89C10-L89C31
(b) if the field was set once the FG was enabled, it stays to be set
[the above is what kube-apiserver is doing]

(c) if the FG is disabled, even if Kubelet (or in general any other component) observes the field it ignores its existence

[for (c) there are exceptions - the nice example appeared in Sidecar KEP and computing resources in scheduler, but I personally treat them as exceptions - by default I consider the above the desired semantic]

@thockin - do you have any concerns about this?

For this specific field, that seems somewhat acceptable because it only matters exactly once (when pod is deleted), although it is pretty surprising that the API says it is on but its really not.

But imagine a field which is used in real-time on a long-lived resource like Service or Deployment. The object was admitted, the feature was used, then the gate was disabled, the API stil says the feature is on, but suddenly it stopped working. Worse - it could stop working on some components and not others (e.g. disabled on some kube-proxy and noth others).

That makes testing MUCH more complicated.

Now imagine an enum sort of field or a loosening of validation, where a new value was allowed by the gate, and then the gate is disabled, and the value is no longer allowable. Do we fall back on some next-closest value without updating the API? That seems terrible.

As for the real-time example - I would actually say that the "stops working" (or "partially stops working") is still what I want. If I'm disabling a FG, I'm doing it on purpose, so I actually want this feature to really stop working, as this is presumably causing some troubles.
I definitely agree that the current support for it is poor, but I still think it's what I want.
And eventually the FeatureGate KEP will address the issue [you register a hook for disabling FG that is clearing this field, or doing whatever else is better in a given case]

Re enum example - this is harder one. The only ones that I've seen where effectively "yet another type of optional behavior", so disabling meant "you no longer have this, so you don't get anything". Which is kind of the same case as the above.
If it won't be an "optional behavior", but rather "some behavior is required" that would become super tricky and that may justify an exception. But I guess it depends on the specific case.

So I guess the summary of my answer is: if I'm requesting disabling a FG, I actually expect the feature to become disabled. And I acknowledge the fact that if someone was using it, they may got affected/broken, but I'm disabling FG for a reason.

@deads2k @johnbelamaric - FYI - this is an interesting discussion

@jpbetz also :)

So, in this kep, should the kubelet have the gate?

if yes:

only apiserver enable the FG

validation will pass, but kubelet will ignore this field when pod is terminating?

only kubelet enable the FG

new pod will fail the validation, but the existing pods will exec the sleepAction.

if no:

FG only controls the validation, once a container with sleepAction is set, it will always sleep before terminating regardless of whether FG still holds

Am I understanding this correct?

THis is correct.

And what I'm saying is that we should go with the first option.
I know it has drawbacks, but I think the gains from it outweigh those drawbacks. It seems that Tim is far from being convinced on the generic case, but he seems to be ok for this particular case, so let's go for it here.

keps/sig-node/3960-pod-lifecycle-sleep-action/README.md

thockin · 2023-06-15T16:18:59Z

I remain LGTM on the design, not sure if this will make the deadline or not, but don't block it on me. IMO the KEP is substantially well understood and should proceeed to implementation, even if we argue a bit more about testing.

mrunalp · 2023-06-15T18:16:27Z

/approve

keps/sig-node/3960-pod-lifecycle-sleep-action/README.md

wojtek-t · 2023-06-15T19:19:21Z

I had two more comments, but given they are quite minor I don't want to block my approval given deadline.
So I'm approving and holding the PR - please fix them and then anyone can re-lgtm.

/approve PRR
/hold

k8s-ci-robot · 2023-06-15T19:19:36Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AxeZhan, mrunalp, SergeyKanzhelev, thockin, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [wojtek-t]
~~keps/sig-node/OWNERS~~ [mrunalp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

SergeyKanzhelev · 2023-06-16T00:29:23Z

/unhold

let's address as a follow up, only 30 minnutes left

SergeyKanzhelev · 2023-06-16T00:29:37Z

/lgtm

kep 3960

5415316

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Apr 22, 2023

k8s-ci-robot requested review from dchen1107 and derekwaynecarr April 22, 2023 09:19

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 22, 2023

AxeZhan mentioned this pull request Apr 22, 2023

Introducing Sleep Action for PreStop Hook #3960

Open

12 tasks

aojea reviewed Apr 24, 2023

View reviewed changes

keps/sig-node/3960-pod-lifecycle-sleep-action/README.md Outdated Show resolved Hide resolved

aojea reviewed Apr 24, 2023

View reviewed changes

keps/sig-node/3960-pod-lifecycle-sleep-action/README.md Outdated Show resolved Hide resolved

aojea reviewed Apr 24, 2023

View reviewed changes

keps/sig-node/3960-pod-lifecycle-sleep-action/README.md Outdated Show resolved Hide resolved

aojea reviewed Apr 24, 2023

View reviewed changes

keps/sig-node/3960-pod-lifecycle-sleep-action/README.md Show resolved Hide resolved

aojea reviewed Apr 24, 2023

View reviewed changes

keps/sig-node/3960-pod-lifecycle-sleep-action/README.md Show resolved Hide resolved

aojea reviewed Apr 24, 2023

View reviewed changes

keps/sig-node/3960-pod-lifecycle-sleep-action/kep.yaml Show resolved Hide resolved

aojea reviewed Apr 24, 2023

View reviewed changes

keps/sig-node/3960-pod-lifecycle-sleep-action/kep.yaml Outdated Show resolved Hide resolved

wojtek-t reviewed Apr 24, 2023

View reviewed changes

keps/sig-node/3960-pod-lifecycle-sleep-action/README.md Show resolved Hide resolved

remove Integration tests

38bdda7

k8s-ci-robot assigned mrunalp and SergeyKanzhelev Apr 25, 2023

update user story

703df90

charles-chenzz reviewed Jun 2, 2023

View reviewed changes

keps/sig-node/3960-pod-lifecycle-sleep-action/README.md Show resolved Hide resolved

charles-chenzz reviewed Jun 2, 2023

View reviewed changes

keps/sig-node/3960-pod-lifecycle-sleep-action/README.md Show resolved Hide resolved

charles-chenzz reviewed Jun 2, 2023

View reviewed changes

keps/sig-node/3960-pod-lifecycle-sleep-action/kep.yaml Show resolved Hide resolved

thockin reviewed Jun 2, 2023

View reviewed changes

keps/sig-node/3960-pod-lifecycle-sleep-action/kep.yaml Outdated Show resolved Hide resolved

thockin reviewed Jun 2, 2023

View reviewed changes

keps/sig-node/3960-pod-lifecycle-sleep-action/README.md Outdated Show resolved Hide resolved

thockin reviewed Jun 2, 2023

View reviewed changes

keps/sig-node/3960-pod-lifecycle-sleep-action/README.md Show resolved Hide resolved

SergeyKanzhelev reviewed Jun 2, 2023

View reviewed changes

keps/sig-node/3960-pod-lifecycle-sleep-action/README.md Outdated Show resolved Hide resolved

AxeZhan force-pushed the sleepAction branch from 7dd2f0f to d0d3567 Compare June 6, 2023 06:49

thockin reviewed Jun 12, 2023

View reviewed changes

keps/sig-node/3960-pod-lifecycle-sleep-action/kep.yaml Outdated Show resolved Hide resolved

SergeyKanzhelev approved these changes Jun 13, 2023

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 13, 2023

jeremyrickard reviewed Jun 14, 2023

View reviewed changes

no deprecations

4cd8ae0

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 14, 2023

elaborate prr

c54cb31

wojtek-t reviewed Jun 14, 2023

View reviewed changes

update prr

2f53fd3

AxeZhan force-pushed the sleepAction branch from 1db4e58 to 2f53fd3 Compare June 14, 2023 15:43

mrunalp reviewed Jun 14, 2023

View reviewed changes

keps/sig-node/3960-pod-lifecycle-sleep-action/README.md Outdated Show resolved Hide resolved

mrunalp reviewed Jun 14, 2023

View reviewed changes

keps/sig-node/3960-pod-lifecycle-sleep-action/README.md Outdated Show resolved Hide resolved

fix nits

153d611

wojtek-t reviewed Jun 15, 2023

View reviewed changes

keps/sig-node/3960-pod-lifecycle-sleep-action/README.md Show resolved Hide resolved

keps/sig-node/3960-pod-lifecycle-sleep-action/README.md Show resolved Hide resolved

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 15, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 15, 2023

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 16, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 16, 2023

k8s-ci-robot merged commit 6ce7493 into kubernetes:master Jun 16, 2023

k8s-ci-robot added this to the v1.28 milestone Jun 16, 2023

This was referenced Jun 16, 2023

It will not reject it - it will silently drop it #4100

Closed

kep3960: update status to "implementable" #4101

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP-3960: Introducing Sleep Action for PreStop Hook #3961

KEP-3960: Introducing Sleep Action for PreStop Hook #3961

AxeZhan commented Apr 22, 2023

bart0sh commented Apr 25, 2023

SergeyKanzhelev left a comment

jeremyrickard left a comment

jeremyrickard Jun 14, 2023

charles-chenzz Jun 14, 2023 •

edited

Loading

AxeZhan Jun 14, 2023

thockin Jun 14, 2023

jeremyrickard Jun 14, 2023

AxeZhan Jun 14, 2023

thockin Jun 14, 2023

AxeZhan Jun 15, 2023

thockin Jun 15, 2023

wojtek-t left a comment

wojtek-t Jun 14, 2023

AxeZhan Jun 14, 2023

thockin Jun 14, 2023

wojtek-t Jun 14, 2023

thockin Jun 14, 2023

wojtek-t Jun 14, 2023

thockin Jun 14, 2023

AxeZhan Jun 15, 2023

wojtek-t Jun 15, 2023

thockin Jun 15, 2023

thockin commented Jun 15, 2023

mrunalp commented Jun 15, 2023

wojtek-t commented Jun 15, 2023

k8s-ci-robot commented Jun 15, 2023

SergeyKanzhelev commented Jun 16, 2023

SergeyKanzhelev commented Jun 16, 2023


		###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

		The feature can be disabled in Alpha and Beta versions by restarting kube-apiserver with the feature-gate off. In terms of Stable versions, users can choose to opt-out by not setting the sleep field.


		###### How can a rollout or rollback fail? Can it impact already running workloads?

		The change is opt-in, it doesn't impact already running workloads. But problems with the updated validation logic may cause crashes in the apiserver.

KEP-3960: Introducing Sleep Action for PreStop Hook #3961

KEP-3960: Introducing Sleep Action for PreStop Hook #3961

Conversation

AxeZhan commented Apr 22, 2023

bart0sh commented Apr 25, 2023

SergeyKanzhelev left a comment

Choose a reason for hiding this comment

jeremyrickard left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

charles-chenzz Jun 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wojtek-t left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thockin commented Jun 15, 2023

mrunalp commented Jun 15, 2023

wojtek-t commented Jun 15, 2023

k8s-ci-robot commented Jun 15, 2023

SergeyKanzhelev commented Jun 16, 2023

SergeyKanzhelev commented Jun 16, 2023

charles-chenzz Jun 14, 2023 •

edited

Loading