Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New option to set a shutdown grace period #5855

Merged
merged 1 commit into from
Jan 27, 2021

Conversation

bcarlsson
Copy link
Contributor

@bcarlsson bcarlsson commented Jul 6, 2020

What this PR does / why we need it:

When running the ingress controller behind a load balancer in AWS, we need to have a grace period for the shutdown process in case of a scale down event. If not, the pod might be terminated before it has been removed from the load balancer target group.

This PR gives the user an option to set a grace period for when the shutdown phase has been triggered and the actual shutdown begins.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Which issue/s this PR fixes

fixes #4726

How Has This Been Tested?

It has been deployed in a running Kubernetes cluster and verified by monitoring the /healthz endpoint.

Checklist:

  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I've read the CONTRIBUTION guide
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
  • If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
  • If you have done the above and are still having issues with the CLA being reported as unsigned, please log a ticket with the Linux Foundation Helpdesk: https://support.linuxfoundation.org/
  • Should you encounter any issues with the Linux Foundation Helpdesk, send a message to the backup e-mail support address at: [email protected]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Jul 6, 2020
@k8s-ci-robot
Copy link
Contributor

Welcome @bcarlsson!

It looks like this is your first PR to kubernetes/ingress-nginx 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/ingress-nginx has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @bcarlsson. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 6, 2020
@aledbf
Copy link
Member

aledbf commented Jul 6, 2020

@bcarlsson please add e2e tests.
How this change interacts with the wait-shutdown preStop ?

@aledbf
Copy link
Member

aledbf commented Jul 6, 2020

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 6, 2020
@aledbf
Copy link
Member

aledbf commented Jul 6, 2020

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 6, 2020
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jul 6, 2020

@bcarlsson: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-ingress-nginx-e2e-1-15 806423e51132815971513683bd485e77fe6d5741 link /test pull-ingress-nginx-e2e-1-15

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@bcarlsson
Copy link
Contributor Author

bcarlsson commented Jul 7, 2020

@bcarlsson please add e2e tests.
How this change interacts with the wait-shutdown preStop ?

/wait-shutdown preStop will trigger /healthz endpoint to return not ok while /livez remains ok during the shutdown grace period.
We'll look into e2e tests.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 28, 2020
@bcarlsson
Copy link
Contributor Author

@aledbf I've now added e2e tests to verify both endpoints during the shutdown grace period.

@bcarlsson
Copy link
Contributor Author

/assign @bowei

@aledbf aledbf assigned aledbf and unassigned bowei Sep 4, 2020
@aledbf
Copy link
Member

aledbf commented Sep 4, 2020

@bcarlsson please don't assign PRs manually

@bcarlsson
Copy link
Contributor Author

@aledbf ok sorry, tried to follow an earlier comment by the robot.

@codecov-io
Copy link

codecov-io commented Oct 7, 2020

Codecov Report

Merging #5855 (54b13bd) into master (2f08e87) will increase coverage by 0.07%.
The diff coverage is 64.22%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5855      +/-   ##
==========================================
+ Coverage   55.75%   55.82%   +0.07%     
==========================================
  Files          91       94       +3     
  Lines        6451     6579     +128     
==========================================
+ Hits         3597     3673      +76     
- Misses       2411     2447      +36     
- Partials      443      459      +16     
Impacted Files Coverage Δ
internal/ingress/controller/nginx.go 28.71% <0.00%> (-0.15%) ⬇️
internal/ingress/controller/template/configmap.go 75.84% <ø> (ø)
internal/ingress/sslcert.go 0.00% <0.00%> (ø)
internal/ingress/types.go 0.00% <0.00%> (ø)
internal/ingress/types_equals.go 14.15% <0.00%> (-0.18%) ⬇️
...ternal/ingress/annotations/globalratelimit/main.go 64.70% <64.70%> (ø)
internal/ingress/controller/controller.go 46.68% <71.42%> (+0.07%) ⬆️
internal/ingress/controller/template/template.go 76.82% <79.59%> (+0.19%) ⬆️
internal/net/ipnet.go 92.00% <81.81%> (+6.28%) ⬆️
cmd/nginx/flags.go 81.63% <100.00%> (+0.18%) ⬆️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 032fccb...54b13bd. Read the comment docs.

@bcarlsson
Copy link
Contributor Author

@aledbf as I wrote in an earlier comment, we do no longer see the point of a new /livez endpoint since the pod would already be in a shutting down state.
I've reverted the /livez commits and added a test for only the shutdown grace period.

@aledbf
Copy link
Member

aledbf commented Oct 7, 2020

@bcarlsson please squash the commits

@bcarlsson
Copy link
Contributor Author

@aledbf I've squashed the commits.

@jeroenj
Copy link

jeroenj commented Dec 10, 2020

It looks like this is good to go? (except for the PR title)

We've been eagerly waiting to get this merged in as on clusters that scale heavily we get some failed requests when the ingress-nginx pods are shutting down but the (GKE) load balancers are still sending traffic to it.

@Zhann
Copy link

Zhann commented Jan 25, 2021

What's the state of this? Anything blocking?

@aledbf
Copy link
Member

aledbf commented Jan 25, 2021

This change is missing the update of the helm chart values to show how to configure the endpoint and also an e2e test using it.

@bcarlsson
Copy link
Contributor Author

@aledbf I apologize for any confusion. As I wrote in a comment earlier on September 14th, we realized that a new /livez endpoint did not add any value since the pod would already be in a "shutting down" state and would be terminated after TerminationGracePeriodSeconds had elapsed.

As commented on Oct 7, I did revert all changes for a new /livez endpoint but kept the feature to set a grace period before the actual shutdown would begin.

Is a update of the helm chart needed for this as there are no changes to a /livez endpoint anymore?

I know that only the last part of the PR description reflects the actual change now. Do you want me to update the title and description?

@jeroenj
Copy link

jeroenj commented Jan 26, 2021

I know that only the last part of the PR description reflects the actual change now. Do you want me to update the title and description?

Personally I'd update the description (and title) to reflect the actual change. That said I do think it's gook to keep the initial description around for historic reference as to how this started out and how it came to the current state. That's something we do internally too.

@aledbf
Copy link
Member

aledbf commented Jan 26, 2021

Do you want me to update the title and description?

Yes, please.

@bcarlsson bcarlsson changed the title New livez endpoint added with option to set shutdown grace period New option to set a shutdown grace period Jan 27, 2021
@bcarlsson
Copy link
Contributor Author

Do you want me to update the title and description?

Yes, please.

I've now updated the title and description.

@aledbf
Copy link
Member

aledbf commented Jan 27, 2021

/test pull-ingress-nginx-e2e-1-19
/test pull-ingress-nginx-e2e-1-20

@aledbf
Copy link
Member

aledbf commented Jan 27, 2021

/lgtm
/approve

@aledbf
Copy link
Member

aledbf commented Jan 27, 2021

@bcarlsson thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 27, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aledbf, bcarlsson

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 27, 2021
@aledbf
Copy link
Member

aledbf commented Jan 27, 2021

@bcarlsson please remove the "Old description for reference" section. There is no code related to that description. Even if that was the initial intention, is confusing.

@bcarlsson
Copy link
Contributor Author

@bcarlsson please remove the "Old description for reference" section. There is no code related to that description. Even if that was the initial intention, is confusing.

Done

@aledbf
Copy link
Member

aledbf commented Jan 27, 2021

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 27, 2021
@k8s-ci-robot k8s-ci-robot merged commit 54b6843 into kubernetes:master Jan 27, 2021
@kubernetes kubernetes deleted a comment from aledbf May 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Configurable sleep before nginx shutdown for upstream LB
10 participants