Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor health checks and wait until NGINX process ends #4487

Merged
merged 1 commit into from
Sep 1, 2019

Conversation

aledbf
Copy link
Member

@aledbf aledbf commented Aug 23, 2019

What this PR does / why we need it:

Removes the go pprof endpoint from the healthz port (10254).

kubectl port-forward -n ingress-nginx pod/nginx-ingress-controller-5dc5df8559-96gwl 10255
curl localhost:10255/debug/pprof/

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #4006 #4052

Special notes for your reviewer:

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 23, 2019
@aledbf aledbf added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 23, 2019
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 23, 2019
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 23, 2019
@codecov-io
Copy link

codecov-io commented Aug 23, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@c85450c). Click here to learn what that means.
The diff coverage is 25.8%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #4487   +/-   ##
=========================================
  Coverage          ?   59.14%           
=========================================
  Files             ?       89           
  Lines             ?     6780           
  Branches          ?        0           
=========================================
  Hits              ?     4010           
  Misses            ?     2337           
  Partials          ?      433
Impacted Files Coverage Δ
internal/ingress/controller/config/config.go 98.58% <ø> (ø)
cmd/nginx/main.go 6.53% <0%> (ø)
internal/ingress/controller/controller.go 47.48% <0%> (ø)
internal/ingress/controller/nginx.go 29.44% <100%> (ø)
internal/ingress/controller/checker.go 35.71% <33.33%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c85450c...c2935ca. Read the comment docs.

@aledbf aledbf added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 23, 2019
@aledbf
Copy link
Member Author

aledbf commented Aug 28, 2019

Shutdown works like this


------------------------------
[ingress-nginx] [Serial] Shutdown ingress controller 
  should shutdown after waiting 60 seconds for pending connections to be closed
  /home/aledbf/go/src/k8s.io/ingress-nginx/test/e2e/gracefulshutdown/shutdown.go:70
[BeforeEach] [ingress-nginx] [Serial] Shutdown ingress controller
  /home/aledbf/go/src/k8s.io/ingress-nginx/test/e2e/framework/framework.go:77
STEP: Creating a kubernetes client
STEP: Building a namespace api object
Aug 28 05:11:08.801: INFO: Created namespace: e2e-tests-shutdown-ingress-controller-1566969068794777621-tdx8s
STEP: Starting new ingress controller
[BeforeEach] [ingress-nginx] [Serial] Shutdown ingress controller
  /home/aledbf/go/src/k8s.io/ingress-nginx/test/e2e/gracefulshutdown/shutdown.go:36
[It] should shutdown after waiting 60 seconds for pending connections to be closed
  /home/aledbf/go/src/k8s.io/ingress-nginx/test/e2e/gracefulshutdown/shutdown.go:70
Aug 28 05:12:07.619: INFO: waiting for request completion after shutdown
Aug 28 05:12:17.619: INFO: waiting for request completion after shutdown
Aug 28 05:12:27.619: INFO: waiting for request completion after shutdown
Aug 28 05:12:37.619: INFO: waiting for request completion after shutdown
Aug 28 05:12:47.619: INFO: waiting for request completion after shutdown
Aug 28 05:12:57.619: INFO: waiting for request completion after shutdown
[AfterEach] [ingress-nginx] [Serial] Shutdown ingress controller
  /home/aledbf/go/src/k8s.io/ingress-nginx/test/e2e/framework/framework.go:78
STEP: Waiting for test namespace to no longer exist
[AfterEach] [ingress-nginx] [Serial] Shutdown ingress controller
  /home/aledbf/go/src/k8s.io/ingress-nginx/test/e2e/gracefulshutdown/shutdown.go:42

• [SLOW TEST:113.831 seconds]
[ingress-nginx] [Serial] Shutdown ingress controller
/home/aledbf/go/src/k8s.io/ingress-nginx/test/e2e/framework/framework.go:150
  should shutdown after waiting 60 seconds for pending connections to be closed
  /home/aledbf/go/src/k8s.io/ingress-nginx/test/e2e/gracefulshutdown/shutdown.go:70
------------------------------


-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       dev
  Build:         git-259c47c2e
  Repository:    https://github.com/aledbf/ingress-nginx
  nginx version: openresty/1.15.8.1

-------------------------------------------------------------------------------

W0828 05:11:37.635850       6 flags.go:223] SSL certificate chain completion is disabled (--enable-ssl-chain-completion=false)
W0828 05:11:37.635914       6 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0828 05:11:37.636080       6 main.go:181] Creating API client for https://10.96.0.1:443
I0828 05:11:37.640680       6 main.go:225] Running in Kubernetes cluster version v1.14 (v1.14.1) - git (clean) commit b7394102d6ef778017f2ca4046abbaa23b88c290 - platform linux/amd64
I0828 05:11:37.723867       6 main.go:100] SSL fake certificate created /etc/ingress-controller/ssl/default-fake-certificate.pem
I0828 05:11:37.742976       6 nginx.go:262] Starting NGINX Ingress controller
I0828 05:11:37.748501       6 event.go:258] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"e2e-tests-shutdown-ingress-controller-1566969068794777621-tdx8s", Name:"nginx-configuration", UID:"3ede591b-c952-11e9-ab0f-0242ac110004", APIVersion:"v1", ResourceVersion:"1353", FieldPath:""}): type: 'Normal' reason: 'CREATE' ConfigMap e2e-tests-shutdown-ingress-controller-1566969068794777621-tdx8s/nginx-configuration
I0828 05:11:37.753844       6 event.go:258] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"e2e-tests-shutdown-ingress-controller-1566969068794777621-tdx8s", Name:"tcp-services", UID:"3edf50bc-c952-11e9-ab0f-0242ac110004", APIVersion:"v1", ResourceVersion:"1247", FieldPath:""}): type: 'Normal' reason: 'CREATE' ConfigMap e2e-tests-shutdown-ingress-controller-1566969068794777621-tdx8s/tcp-services
I0828 05:11:37.753866       6 event.go:258] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"e2e-tests-shutdown-ingress-controller-1566969068794777621-tdx8s", Name:"udp-services", UID:"3edfdeef-c952-11e9-ab0f-0242ac110004", APIVersion:"v1", ResourceVersion:"1248", FieldPath:""}): type: 'Normal' reason: 'CREATE' ConfigMap e2e-tests-shutdown-ingress-controller-1566969068794777621-tdx8s/udp-services
I0828 05:11:38.943747       6 nginx.go:306] Starting NGINX process
I0828 05:11:38.943757       6 leaderelection.go:235] attempting to acquire leader lease  e2e-tests-shutdown-ingress-controller-1566969068794777621-tdx8s/ingress-controller-leader-nginx...
I0828 05:11:38.944899       6 controller.go:133] Configuration changes detected, backend reload required.
I0828 05:11:38.949695       6 status.go:86] new leader elected: nginx-ingress-controller-5cbbf5f4c9-r6rp2
I0828 05:11:39.059791       6 controller.go:149] Backend successfully reloaded.
I0828 05:11:39.059832       6 controller.go:158] Initial sync, sleeping for 1 second.
I0828 05:11:44.256050       6 event.go:258] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"e2e-tests-shutdown-ingress-controller-1566969068794777621-tdx8s", Name:"shutdown", UID:"53d35650-c952-11e9-ab0f-0242ac110004", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"1453", FieldPath:""}): type: 'Normal' reason: 'CREATE' Ingress e2e-tests-shutdown-ingress-controller-1566969068794777621-tdx8s/shutdown
I0828 05:11:46.108429       6 controller.go:133] Configuration changes detected, backend reload required.
I0828 05:11:46.270493       6 controller.go:149] Backend successfully reloaded.
I0828 05:11:57.756666       6 main.go:152] Received SIGTERM, shutting down
I0828 05:11:57.756697       6 nginx.go:389] Shutting down controller queues
I0828 05:11:57.756713       6 status.go:117] updating status of Ingress rules (remove)
I0828 05:11:57.763763       6 status.go:136] removing address from ingress status ([])
I0828 05:11:57.763815       6 nginx.go:405] Stopping NGINX process
2019/08/28 05:11:57 [notice] 140#140: signal process started

I0828 05:12:21.485432       6 leaderelection.go:245] successfully acquired lease e2e-tests-shutdown-ingress-controller-1566969068794777621-tdx8s/ingress-controller-leader-nginx
I0828 05:12:21.485450       6 status.go:86] new leader elected: nginx-ingress-controller-6ff7f8bb8b-kf9nb
E0828 05:12:21.485636       6 queue.go:78] queue has been shutdown, failed to enqueue: &ObjectMeta{Name:sync status,GenerateName:,Namespace:,SelfLink:,UID:,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,ManagedFields:[],}
10.36.0.0 - [10.36.0.0] - - [28/Aug/2019:05:13:02 +0000] "GET /sleep/70 HTTP/1.1" 200 31 "-" "Go-http-client/1.1" 148 70.008 [e2e-tests-shutdown-ingress-controller-1566969068794777621-tdx8s-slowecho-80] [] 10.36.0.2:80 31 70.007 200 7a7012e8c4d0f63d23b1cf7189c0ee64
I0828 05:13:02.794068       6 nginx.go:418] NGINX process has stopped
I0828 05:13:02.794090       6 main.go:160] Handled quit, awaiting Pod deletion
I0828 05:13:12.794276       6 main.go:163] Exiting with 0

@aledbf
Copy link
Member Author

aledbf commented Aug 28, 2019

Changes required:

  • update the configmap setting worker-shutdown-timeout to define the wait time (NGINX side) to terminate worker processes
  • increase terminationGracePeriodSeconds value to at least the worker-shutdown-timeout value
  • add lifecycle preStop command /wait-shutdown

With these changes, after a stop is triggered:

  • ingress controller calls the hook
  • sends a sigterm to the ingress controller process
  • stops processing events
  • sends the quit signal to nginx master process
  • waits until pending connections are terminated or killed by the timeout
  • exit with code 0

@aledbf aledbf force-pushed the health-check branch 3 times, most recently from f6ac5c9 to cb7009e Compare August 28, 2019 14:57
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 28, 2019
@aledbf aledbf force-pushed the health-check branch 2 times, most recently from 12ca913 to 624af90 Compare August 28, 2019 16:40
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 28, 2019
@aledbf aledbf force-pushed the health-check branch 3 times, most recently from ad21cbc to 71d9a0b Compare September 1, 2019 18:21
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Sep 1, 2019
@aledbf aledbf force-pushed the health-check branch 2 times, most recently from 9f2bbb3 to 087f47c Compare September 1, 2019 18:50
@aledbf
Copy link
Member Author

aledbf commented Sep 1, 2019

/retest

// StatusSocket defines the location of the unix socket used by NGINX for the status server
var StatusSocket = "/tmp/nginx-status-server.sock"
// StatusPort port used by NGINX for the status server
var StatusPort = 10256
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason you're switching to tcp?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason you're switching to tcp?

If the node is smaller than should be or during spikes of CPU, the request fails (socket). I cannot reproduce this consistently but when I can is only related to CPU utilization.

After this is merged I will add an e2e test so show how this can be reproduced (using stress)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aledbf did this e2e test ever get added? I'm running into a similar situation when CPU spikes and I'm wondering if the switch away from a socket actually resolves it

@ElvinEfendi
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 1, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aledbf, ElvinEfendi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fred
Copy link

fred commented May 26, 2020

Question for @aledbf : Was there any reason why not to use QUIT (graceful shutdown) instead of TERM (fast shutdown) ?

Referring to this line: https://github.com/aledbf/ingress-nginx/blob/master/cmd/waitshutdown/main.go#L29

@rittneje
Copy link

rittneje commented Jan 2, 2022

@fred I know your question is quite old, but for anyone who is wondering, it sends SIGTERM to nginx-ingress-controller, which in turn sends SIQUIT to nginx.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants