-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make StatefulSet restart pods with phase Succeeded #120398
Make StatefulSet restart pods with phase Succeeded #120398
Conversation
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also like to have an integration test, if feasible.
// | ||
// Also note that it is impossible for StatefulSet pods to ever truly complete, | ||
// as validation allows only restartPolicy=Always. | ||
if isFinished(replicas[i]) { | ||
ssc.recorder.Eventf(set, v1.EventTypeWarning, "RecreatingFailedPod", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to issue a dedicated event RecreatingSucceededPod
. This is what we do for daemon set in the analogous case: https://github.com/kubernetes/kubernetes/pull/117073/files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this level of granularity in event type tbh. Maybe we could have event like RecreatingPod
(both here and elsewhere - happy to fix), and add the more detailed reason only in description?
Or:
event type - what happened
reason/message - why it happened
kind of like scheduler only has generic event type for failed scheduling, but gives detailed information in the reason. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially, but I'm concerned about backwards compatibility as this would mean renaming an existing event type. I would suggest decoupling this to a dedicated discussion / PR. @alculquicondor wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I would do it all over again, I would call it: RecreatingTerminatedPod. But since it says Failed, I prefer we keep the existing one and add a new one for Succeeded, like we did for DaemonSet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on discussion on sig-apps slack:
- we need to be careful with renaming events of type warning (so,
RecreateFailedPod
stays as it is), - just recreating a pod should be emitted as a normal event, not warning.
@alculquicondor DaemonSet events are already different - I might be missing something, but I haven't found any 'recreate' events, only FailedDaemonPod
and SucceededDaemonPod
informing of deleting the pod. So I'm not sure if we need to name the new event RecreateSucceededPod
, as there's already little consistency. I think we can use RecreateTerminatedPod
here, and it'll be much clearer to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean about Daemonset is that there used to be only one event FailedDaemonPod
and we added SucceededDaemonPod
. So for StatefulSet we should do something similar and add RecreateSucceededPod
. A normal event makes sense.
t.Error(err) | ||
} | ||
if isCreated(pods[0]) { | ||
t.Error("StatefulSet did not recreate failed Pod") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t.Error("StatefulSet did not recreate failed Pod") | |
t.Error("StatefulSet did not recreate succeeded Pod") |
ffce67e
to
452989e
Compare
// Also note that it is impossible for StatefulSet pods to ever truly complete, | ||
// as validation allows only restartPolicy=Always. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove this last paragraph. It is confusing, as we can see that it will actually complete in case of VM preemption or node pressure.
if err != nil { | ||
t.Error(err) | ||
} | ||
pods[0].Status.Phase = v1.PodSucceeded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the only difference with RecreatesFailedPod? Can we deduplicate by adding a function that returns a function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's leave it for a follow up, given the cherry-pick deadline.
452989e
to
d7264d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM label has been added. Git tree hash: a0b3260f4d222dff23d857bb72b8bff0c446eb00
|
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, aleksandra-malinowska, janetkuo The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
What type of PR is this?
/kind bug
/sig apps
What this PR does / why we need it:
After the changes in #115331, pod phase determination changed. It is now possible for StatefulSet pods to get in
Succeeded
pod phase. This can happen when a pod is evicted or a node is stopped and the pod container exists with 0. StatefulSet should restart such pods. It is not possible for a StatefulSet pod to ever truly complete, as validation enforcesrestartPolicy=Always
.Which issue(s) this PR fixes:
Part of #118310
Does this PR introduce a user-facing change?
/cc @mimowo