Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-35343: make shutdown-delay-duration configurable #1685

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tkashem
Copy link
Contributor

@tkashem tkashem commented May 21, 2024

No description provided.

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels May 21, 2024
@openshift-ci-robot
Copy link

@tkashem: This pull request references Jira Issue OCPBUGS-30860, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.17.0) matches configured target version for branch (4.17.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @geliu2016

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label May 21, 2024
@tkashem
Copy link
Contributor Author

tkashem commented May 21, 2024

/cc @p0lyn0mial @vrutkovs

Copy link
Contributor

openshift-ci bot commented May 21, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tkashem

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 21, 2024
@tkashem
Copy link
Contributor Author

tkashem commented May 21, 2024

/retest-required

Copy link
Contributor

openshift-ci bot commented May 21, 2024

@tkashem: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/unit 681dd36 link true /test unit
ci/prow/e2e-metal-single-node-live-iso 681dd36 link false /test e2e-metal-single-node-live-iso
ci/prow/k8s-e2e-gcp 681dd36 link true /test k8s-e2e-gcp
ci/prow/e2e-aws-operator-disruptive-single-node 681dd36 link false /test e2e-aws-operator-disruptive-single-node
ci/prow/e2e-aws-ovn-single-node 681dd36 link false /test e2e-aws-ovn-single-node
ci/prow/e2e-gcp-operator-single-node 681dd36 link false /test e2e-gcp-operator-single-node

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@@ -97,6 +106,7 @@ func (r *renderOpts) AddFlags(fs *pflag.FlagSet) {
fs.StringVar(&r.clusterConfigFile, "cluster-config-file", r.clusterConfigFile, "Openshift Cluster API Config file.")
fs.StringVar(&r.clusterAuthFile, "cluster-auth-file", r.clusterAuthFile, "Openshift Cluster Authentication API Config file.")
fs.StringVar(&r.infraConfigFile, "infra-config-file", "", "File containing infrastructure.config.openshift.io manifest.")
fs.DurationVar(&r.shutdownDelayDuration, "shutdown-delay-duration", r.shutdownDelayDuration, "shutdown-delay-duration argument for the bootstrap kube-apiserver.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bootkube already has an override -

{{- if .ShutdownDelayDuration}}
shutdown-delay-duration:
- {{ .ShutdownDelayDuration }}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, cool!
should we also set shutdown-send-retry-after ?

@@ -192,7 +206,7 @@ func (r *renderOpts) Run() error {
BindAddress: "0.0.0.0:6443",
BindNetwork: "tcp4",
TerminationGracePeriodSeconds: 135, // bit more than 70s (minimal termination period) + 60s (apiserver graceful termination)
ShutdownDelayDuration: "", // do not override
ShutdownDelayDuration: r.shutdownDelayDuration.String(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW: Do we want to expose it as a command line arg ? If yes then we have to coordinate updating it with terminationGracePeriodSeconds. Otherwise kubelet might terminate kas process before graceful termination completes.

// by default, we are giving the load balancer 20s to remove the
// bootstrap kube-apiserver from its pool after TERM signal is
// sent to the kube-apiserver on the bootstrap node.
shutdownDelayDuration: 20 * time.Second,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the default value for shutdownDelayDuration ?
It looks like it is 0, does it mean that the bootstrap api server didn't give any time to LBs to react ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iiuc its ) for SNO and 129 for AWS:

case infra.Status.ControlPlaneTopology == configv1.SingleReplicaTopologyMode:
// reduce the shutdown delay to 0 to reach the maximum downtime for SNO
observedShutdownDelayDuration = "0s"
case infra.Spec.PlatformSpec.Type == configv1.AWSPlatformType:
// AWS has a known issue: https://bugzilla.redhat.com/show_bug.cgi?id=1943804
// We need to extend the shutdown-delay-duration so that an NLB has a chance to notice and remove unhealthy instance.
// Once the mentioned issue is resolved this code must be removed and default values applied
//
// Note this is the official number we got from AWS
observedShutdownDelayDuration = "129s"

@vrutkovs
Copy link
Member

/retitle OCPBUGS-35343: make shutdown-delay-duration configurable

@openshift-ci openshift-ci bot changed the title OCPBUGS-30860: make shutdown-delay-duration configurable OCPBUGS-35343: make shutdown-delay-duration configurable Jun 12, 2024
@openshift-ci-robot openshift-ci-robot removed the jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. label Jun 12, 2024
@openshift-ci-robot
Copy link

@tkashem: This pull request references Jira Issue OCPBUGS-35343, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.17.0) matches configured target version for branch (4.17.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @wangke19

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from wangke19 June 12, 2024 12:39
@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 2, 2024
@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 2, 2024
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants