Skip to content

Commit

Permalink
remove shutdown-manager liveness probe (#4967)
Browse files Browse the repository at this point in the history
The probe can currently cause problems when it fails
by causing the shutdown-manager container to be restarted
by itself, which then results in the envoy container
getting stuck in a "DRAINING" state indefinitely.

Not having the probe is less bad overall because envoy pods
are less likely to get stuck in "DRAINING", and the
worst case without it is that shutdown-manager is truly
unresponsive during a pod termination, in which case
the envoy container will simply terminate without first
draining active connections.

Updates #4851.

Signed-off-by: Steve Kriss <[email protected]>
  • Loading branch information
skriss authored Jan 10, 2023
1 parent e8072f7 commit de4c25c
Show file tree
Hide file tree
Showing 9 changed files with 10 additions and 61 deletions.
9 changes: 9 additions & 0 deletions changelogs/unreleased/4967-skriss-minor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
## shutdown-manager sidecar container liveness probe removed

The liveness probe has been removed from the Envoy pods' shutdown-manager sidecar container.
This change is to mitigate a problem where when the liveness probe fails, the shutdown-manager container is restarted by itself.
This ultimately has the unintended effect of causing the envoy container to be stuck indefinitely in a "DRAINING" state and not serving traffic.

Overall, not having the liveness probe on the shutdown-manager container is less bad because envoy pods are less likely to get stuck in "DRAINING" indefinitely.
In the worst case, during termination of an Envoy pod (due to upgrade, scaling, etc.), shutdown-manager is truly unresponsive, in which case the envoy container will simply terminate without first draining active connections.
If appropriate (i.e. during an upgrade), a new Envoy pod will then be created and re-added to the set of ready Envoys to load balance traffic to.
6 changes: 0 additions & 6 deletions examples/contour/03-envoy.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -38,12 +38,6 @@ spec:
- /bin/contour
- envoy
- shutdown
livenessProbe:
httpGet:
path: /healthz
port: 8090
initialDelaySeconds: 3
periodSeconds: 10
name: shutdown-manager
volumeMounts:
- name: envoy-admin
Expand Down
6 changes: 0 additions & 6 deletions examples/deployment/03-envoy-deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -51,12 +51,6 @@ spec:
- /bin/contour
- envoy
- shutdown
livenessProbe:
httpGet:
path: /healthz
port: 8090
initialDelaySeconds: 3
periodSeconds: 10
name: shutdown-manager
volumeMounts:
- name: envoy-admin
Expand Down
6 changes: 0 additions & 6 deletions examples/render/contour-deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7362,12 +7362,6 @@ spec:
- /bin/contour
- envoy
- shutdown
livenessProbe:
httpGet:
path: /healthz
port: 8090
initialDelaySeconds: 3
periodSeconds: 10
name: shutdown-manager
volumeMounts:
- name: envoy-admin
Expand Down
6 changes: 0 additions & 6 deletions examples/render/contour-gateway.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7355,12 +7355,6 @@ spec:
- /bin/contour
- envoy
- shutdown
livenessProbe:
httpGet:
path: /healthz
port: 8090
initialDelaySeconds: 3
periodSeconds: 10
name: shutdown-manager
volumeMounts:
- name: envoy-admin
Expand Down
6 changes: 0 additions & 6 deletions examples/render/contour.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7349,12 +7349,6 @@ spec:
- /bin/contour
- envoy
- shutdown
livenessProbe:
httpGet:
path: /healthz
port: 8090
initialDelaySeconds: 3
periodSeconds: 10
name: shutdown-manager
volumeMounts:
- name: envoy-admin
Expand Down
7 changes: 0 additions & 7 deletions internal/provisioner/equality/equality_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -105,13 +105,6 @@ func TestDaemonSetConfigChanged(t *testing.T) {
description: "if probe values are set to default values",
mutate: func(ds *appsv1.DaemonSet) {
for i, c := range ds.Spec.Template.Spec.Containers {
if c.Name == dataplane.ShutdownContainerName {
ds.Spec.Template.Spec.Containers[i].LivenessProbe.ProbeHandler.HTTPGet.Scheme = "HTTP"
ds.Spec.Template.Spec.Containers[i].LivenessProbe.TimeoutSeconds = int32(1)
ds.Spec.Template.Spec.Containers[i].LivenessProbe.PeriodSeconds = int32(10)
ds.Spec.Template.Spec.Containers[i].LivenessProbe.SuccessThreshold = int32(1)
ds.Spec.Template.Spec.Containers[i].LivenessProbe.FailureThreshold = int32(3)
}
if c.Name == dataplane.EnvoyContainerName {
ds.Spec.Template.Spec.Containers[i].ReadinessProbe.TimeoutSeconds = int32(1)
// ReadinessProbe InitialDelaySeconds and PeriodSeconds are not set as defaults,
Expand Down
14 changes: 0 additions & 14 deletions internal/provisioner/objects/dataplane/dataplane.go
Original file line number Diff line number Diff line change
Expand Up @@ -148,20 +148,6 @@ func desiredContainers(contour *model.Contour, contourImage, envoyImage string)
"envoy",
"shutdown-manager",
},
LivenessProbe: &corev1.Probe{
FailureThreshold: int32(3),
ProbeHandler: corev1.ProbeHandler{
HTTPGet: &corev1.HTTPGetAction{
Scheme: corev1.URISchemeHTTP,
Path: "/healthz",
Port: intstr.IntOrString{IntVal: int32(8090)},
},
},
InitialDelaySeconds: int32(3),
PeriodSeconds: int32(10),
SuccessThreshold: int32(1),
TimeoutSeconds: int32(1),
},
Lifecycle: &corev1.Lifecycle{
PreStop: &corev1.LifecycleHandler{
Exec: &corev1.ExecAction{
Expand Down
11 changes: 1 addition & 10 deletions site/content/docs/main/redeploy-envoy.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,7 @@ When implementing this roll out, the following steps should be taken:

Contour implements an `envoy` sub-command named `shutdown-manager` whose job is to manage a single Envoy instances lifecycle for Kubernetes.
The `shutdown-manager` runs as a new container alongside the Envoy container in the same pod.
It exposes two HTTP endpoints which are used for `livenessProbe` as well as to handle the Kubernetes `preStop` event hook.

- **livenessProbe**: This is used to validate the shutdown manager is still running properly. If requests to `/healthz` fail, the container will be restarted.
- **preStop**: This is used to keep the Envoy container running while waiting for connections to drain. The `/shutdown` endpoint blocks until the connections are drained.
It uses a Kubernetes `preStop` event hook to keep the Envoy container running while waiting for connections to drain. The `/shutdown` endpoint blocks until the connections are drained.

```yaml
- name: shutdown-manager
Expand All @@ -34,12 +31,6 @@ It exposes two HTTP endpoints which are used for `livenessProbe` as well as to h
- /bin/contour
- envoy
- shutdown
livenessProbe:
httpGet:
path: /healthz
port: 8090
initialDelaySeconds: 3
periodSeconds: 10
```
The Envoy container also has some configuration to implement the shutdown manager.
Expand Down

0 comments on commit de4c25c

Please sign in to comment.