Add lifecycle hooks and readiness checks for smoother autoscaling #670

tcnghia · 2018-04-16T23:10:18Z

Fixes #429

Proposed Changes

(1) Adding readinessProbes and PreStop hooks and
(2) Gracefully shutting down queue HTTP server.

I tested this change on a cluster having 24 nodes, with 500 clients gradually ramping up in 100 seconds. That generated about 170K requests.

Before: 4K 503 errors out of 170K requests.
After: 0 error out of 170K requests.

tcnghia · 2018-04-17T00:11:51Z

/retest

mattmoor

thanks. I am super excited to see some of this instability put to rest :)

mattmoor · 2018-04-17T17:18:50Z

cmd/ela-queue/main.go

+	mutex sync.RWMutex
+}
+
+// isAlive() returns true iff a PreStop hook has not been called.


I think this is clearer: // isAlive() returns true until a PreStop hook has been called.

mattmoor · 2018-04-17T17:24:07Z

cmd/ela-queue/main.go

+func (h *healthServer) kill() {
+	h.mutex.Lock()
+	h.alive = false
+	defer h.mutex.Unlock()


nit: Generally when defer is used it's right after the Lock(). If you leave it here, you may as well drop the defer.

mattmoor · 2018-04-17T17:26:30Z

cmd/ela-queue/main.go

+//   the same time the pod is marked for removal.
+func (h *healthServer) quitHandler(w http.ResponseWriter, r *http.Request) {
+	h.kill()
+	time.Sleep(quitSleepSecs * time.Second)


You never write back a response?

Can you explain the ordering for the kill / delay? Why does this work? Why do you hold the connection open for 20 seconds after you start responding to health checks with non-200?

mattmoor · 2018-04-17T17:26:41Z

cmd/ela-queue/main.go

+	defer h.mutex.Unlock()
+}
+
+// healthHandler is used for readinessProbe of queue-proxy.


readiness vs. liveness?

mattmoor · 2018-04-17T17:27:43Z

cmd/ela-queue/main.go

+
+	// Add a SIGTERM handler to gracefully shutdown the servers during
+	// pod termination.
+	var sigTermChan = make(chan os.Signal)


sigTermChan := make(...)

mattmoor · 2018-04-17T17:30:51Z

pkg/controller/revision/revision.go

+	// Queue proxy admin port and paths provide health check and
+	// lifecycle hooks.
+	requestQueueAdminPortName string = "queueadm-port"
+	requestQueueAdminPort            = 8022


Can we make this public and use this constant from //cmd/ela-queue/main.go? Same for 8012 above.

mattmoor · 2018-04-17T17:31:53Z

pkg/controller/revision/ela_pod.go

@@ -59,13 +87,41 @@ func MakeElaPodSpec(u *v1alpha1.Revision, queueSidecarImage string) *corev1.PodS
 			},
 		},
 		Ports: []corev1.ContainerPort{
-			// TOOD: HTTPS connections from the Cloud LB require
+			// TODO: HTTPS connections from the Cloud LB require


we no longer use nginx, so perhaps this comment should simply be removed?

mattmoor · 2018-04-17T17:35:19Z

pkg/controller/revision/ela_pod.go

+		elaContainer.Lifecycle = &corev1.Lifecycle{}
+	}
+	if elaContainer.Lifecycle.PreStop == nil {
+		elaContainer.Lifecycle.PreStop = &corev1.Handler{


Can you explain what this is doing in a detailed comment?

Your other comments are great, thanks!

mattmoor · 2018-04-17T17:35:56Z

pkg/controller/revision/ela_pod.go

+	if elaContainer.Lifecycle == nil {
+		elaContainer.Lifecycle = &corev1.Lifecycle{}
+	}
+	if elaContainer.Lifecycle.PreStop == nil {


Here and above, we should simply disallow user-provided lifecycle hooks.

cc @evankanderson

mattmoor · 2018-04-17T17:38:10Z

pkg/controller/revision/ela_pod.go

+	if p.Handler.HTTPGet == nil {
+		return false
+	}
+	return p.Handler.HTTPGet.Path != ""


Should we also check that it's on the expected port?

I'm wondering if we should simply disallow users from specifying port in the readiness check, since the port will be implied by our runtime contract.

I think we should disallow setting the port here. I'm putting TODOs to follow up with webhook changes.

Open issues for each please :)

mattmoor · 2018-04-17T17:47:38Z

@tcnghia can you also see what this does to this issue? #348

tcnghia · 2018-04-17T21:04:37Z

@mattmoor the change didn't help #348, and I do suspect that it is Istio since the endpoint we curl is also the health check endpoint.

(1) Adding readinessProbes and PreStop hooks, and (2) Gracefully shutting down queue HTTP server.

tcnghia · 2018-04-17T23:11:42Z

/retest

mattmoor · 2018-04-18T00:27:22Z

pkg/controller/revision/ela_pod.go

+	if p.Handler.HTTPGet == nil {
+		return false
+	}
+	return p.Handler.HTTPGet.Path != ""


Open issues for each please :)

mattmoor · 2018-04-18T00:28:34Z

pkg/controller/revision/ela_pod.go

+				Name:          RequestQueuePortName,
+				ContainerPort: int32(RequestQueuePort),
+			},
+			// Provides health checks and life cycle hooks.


nit: lifecycle

mattmoor · 2018-04-18T00:29:34Z

/lgtm

mattmoor · 2018-04-18T00:29:43Z

/hold for comments

mattmoor · 2018-04-18T01:09:51Z

/hold

mattmoor · 2018-04-18T18:53:12Z

/lgtm

tcnghia · 2018-04-18T20:40:26Z

@josephburnett can you please give feedback/approval? thanks

tcnghia · 2018-04-18T21:04:34Z

We have an issue to make sure sidecar injection works properly or else Pod creation would fail (#379). Such would insulate us in case our changes here (to Istio injector configuration) break sidecar injection.

mattmoor · 2018-04-18T21:20:36Z

/approve
/lgtm

mattmoor · 2018-04-18T21:20:44Z

/hold

mattmoor · 2018-04-18T21:21:11Z

Sorry I thought that was asking me :)

mattmoor · 2018-04-19T18:54:50Z

I'm still good with this, deferring to @josephburnett for final approval.

Thanks for the improvements to availability :)

josephburnett · 2018-04-19T21:46:26Z

cmd/ela-queue/main.go

+	io.WriteString(w, "alive: false")
+}
+
+// Sets up /health and /quitquitquit endpoints.


Does this mean that any IP on the Internet can send a bunch of requests to /quitquitquit and bring down a revision?

No, this is not exposed like :8012. I believe only a local prober can access it.

josephburnett · 2018-04-19T21:49:14Z

cmd/ela-queue/main.go

@@ -143,15 +154,92 @@ func statReporter() {
 	}
 }

+func isProbe(r *http.Request) bool {
+	// Since K8s 1.8, prober requests have
+	//   User-Agent = "kube-probe/{major-version}.{minor-version}".


Can this User-Agent be spoofed? Can any IP on the Internet send a bunch of requests to with this User-Agent and overwhelm pods without causing autoscaling to kick in?

I think this is a valid concern. I'll open an issue to find out if we can avoid.

josephburnett

Just a few questions. Looks good!

mattmoor · 2018-04-20T00:50:13Z

/approve
/lgtm

google-prow-robot · 2018-04-20T00:50:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mattmoor, tcnghia

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mattmoor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tcnghia · 2018-04-20T17:20:21Z

/retest

* Reduce autoscaling-related 503s by: (1) Adding readinessProbes and PreStop hooks, and (2) Gracefully shutting down queue HTTP server. * Address PR feedback. * Exclude K8s prober requests from autoscaling consideration. * Clarifying why we added flag.Parse().

Fixes knative/pkg#2026 The actual issue is that the test context expires between individual stages run by the upgrade framework. This fix passes an external logger that survives the stages.

* Fix race condition with Prober logger in upgrade tests (knative#670) Fixes knative/pkg#2026 The actual issue is that the test context expires between individual stages run by the upgrade framework. This fix passes an external logger that survives the stages. * Only use exec probe at startup time (knative#10741) * Only use exec probe at startup time Now that StartupProbe is available, we can avoid using spawning the exec probe other than at startup time. For requests after startup this directly uses the same endpoint as the exec probe in the QP as the target of a HTTP readiness probe. Following on from this I think we may want to rework quite a bit of how our readiness probe stuff works (e.g. it'd be nice to keep the probes on the user container so failures are on the right object, and we currently ignore probes ~entirely after startup if periodSeconds>0), but this is a minimal change that should be entirely backwards-compatible and saves quite a few cpu cycles. * Use ProgressDeadline as failure timeout for startup probe - Also just drop exec probe entirely for periodSeconds > 1 since these can just use the readiness probe now. (Easier than figuring out how to do square ProgressDeadline with a custom period). * See if flag is what's making upgrades unhappy * reorganize comments * Default PeriodSeconds of the readiness probe to 1 if unset (knative#10992) Co-authored-by: Martin Gencur <[email protected]> Co-authored-by: Julian Friedman <[email protected]>

tcnghia requested a review from josephburnett April 16, 2018 23:10

google-prow-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 16, 2018

tcnghia requested a review from mattmoor April 16, 2018 23:12

mattmoor suggested changes Apr 17, 2018

View reviewed changes

mattmoor self-assigned this Apr 17, 2018

Reduce autoscaling-related 503s by:

2525734

(1) Adding readinessProbes and PreStop hooks, and (2) Gracefully shutting down queue HTTP server.

tcnghia force-pushed the down-scaling branch from 1de7dd5 to 2525734 Compare April 17, 2018 23:11

mattmoor approved these changes Apr 18, 2018

View reviewed changes

google-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 18, 2018

google-prow-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 18, 2018

Address PR feedback.

2386bfa

google-prow-robot removed the lgtm Indicates that a PR is ready to be merged. label Apr 18, 2018

google-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 18, 2018

google-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 18, 2018

Exclude K8s prober requests from autoscaling consideration.

b05fa60

google-prow-robot removed the lgtm Indicates that a PR is ready to be merged. label Apr 19, 2018

Clarifying why we added flag.Parse().

4749fc5

tcnghia mentioned this pull request Apr 19, 2018

Disallow specifying ports in ReadinessProbe of elaContainer #684

Closed

josephburnett reviewed Apr 19, 2018

View reviewed changes

josephburnett approved these changes Apr 19, 2018

View reviewed changes

google-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 20, 2018

tcnghia removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 20, 2018

google-prow-robot merged commit 50a596f into knative:master Apr 20, 2018

jsanda mentioned this pull request Feb 7, 2019

add readinessProbe to activator #3110

Closed

Add lifecycle hooks and readiness checks for smoother autoscaling #670

Add lifecycle hooks and readiness checks for smoother autoscaling #670

Conversation

tcnghia commented Apr 16, 2018

Proposed Changes

tcnghia commented Apr 17, 2018

mattmoor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattmoor commented Apr 17, 2018

tcnghia commented Apr 17, 2018

tcnghia commented Apr 17, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattmoor commented Apr 18, 2018

mattmoor commented Apr 18, 2018

mattmoor commented Apr 18, 2018

mattmoor commented Apr 18, 2018

tcnghia commented Apr 18, 2018

tcnghia commented Apr 18, 2018

mattmoor commented Apr 18, 2018

mattmoor commented Apr 18, 2018

mattmoor commented Apr 18, 2018

mattmoor commented Apr 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

josephburnett left a comment

Choose a reason for hiding this comment

mattmoor commented Apr 20, 2018

google-prow-robot commented Apr 20, 2018

tcnghia commented Apr 20, 2018