shutdown-manager issues #4851

skriss · 2022-11-15T20:18:22Z

We've encountered a number of interrelated issues with the graceful Envoy shutdown workflow that I want to capture in one place.

The first issue goes something like the following:

Something triggers Kubernetes to terminate/restart the shutdown-manager sidecar container in an Envoy pod. We've seen definite evidence of its liveness probes failing due to timeouts, which could be caused by the container not having any resource requests and getting CPU-throttled.
When the shutdown-manager container is terminated, its preStop hook is triggered, which runs the contour envoy shutdown command. This command tells the associated Envoy to start draining all Listeners, including the Listener that is the target of the Envoy container's readiness probe. So now, Envoy is draining HTTP/S connections as well as reporting unready.
The shutdown-manager's preStop hook eventually returns once enough connections have been closed, and the shutdown-manager container restarts successfully.
The Envoy container is stuck indefinitely in a "draining" state, failing its readiness probe, but never restarting because there is no liveness probe or anything else to trigger a restart.

A few thoughts about this issue:

adding resource requests to the shutdown-manager (and really, all containers) may help minimize the occurrence of this issue since they should ensure the container(s) get the CPU they need to avoid being restarted
adding a liveness probe to the Envoy container may help avoid it getting permanently stuck by restarting it when it's not healthy, as long as its implementation plays nicely with the whole graceful shutdown workflow
a restart of the shutdown-manager sidecar should probably not be triggering a drain of Envoy listeners. That should only be triggered by the Envoy container itself terminating/restarting.
GHSA-mjp8-x484-pm3r needs to be kept in mind as we potentially modify this workflow

Secondly, #4322 describes issues with draining nodes due to the emptyDir that is used to allow the shutdown-manager and envoy containers to communicate (via UDS for the envoy admin interface, and by file to communicate when the Listener drain is done). I won't repeat that discussion here, just linking it for reference.

The text was updated successfully, but these errors were encountered:

skriss · 2022-11-16T18:05:17Z

A few more thoughts:

it seems like the best solution here would be to build our own Envoy image, that includes the Contour binary, and then use the contour envoy shutdown command as a preStop hook for the Envoy container (that would trigger Envoy to drain connections and wait until they fall below the threshold). This would eliminate the sidecar, the emptyDir volume, and any need for communicating across containers.
However, I'm not 100% sure we're ready to move to building and publishing our own Envoy image yet. In the near term, I think adding (a) resource requests for the shutdown-manager (and probably envoy) containers, and (b) adding a liveness probe to the envoy container to prevent it from getting stuck in an unhealthy state indefinitely, should largely mitigate at least the first issue. It does not, however, do anything to address the second issue re: using emptyDir local storage.

cc @sunjayBhatia, interested in your thoughts here when you get a chance.

sunjayBhatia · 2022-11-16T21:50:34Z

yeah I agree that idea is the best solution for functionality, I don't think its too hard on the face of it to implement, but has some implications for downstream users that mean we would need very good communication in our release notes and upgrade documentation

stepping through what would need to happen to do that:

building the Envoy image
- we could at first build our own Envoy image using the existing upstream one as a base and just add the contour binary
- leaves the option in the future of building Envoy with some custom extensions etc. if we ever wanted to go that way
- also gives us an opportunity to switch to distributing a distroless based image so our default offering is less susceptible to CVE scans (can do this anyways regardless tbh)
- however any downstream consumer would now need to do this as well if they want to properly drain connections, which maybe is worth a major version change?
changes to Envoy DaemonSet/Deployment YAML
- the changes themselves are straightforward but would again need coordination with consumers of this YAML to ensure they can accomodate this change and that it needs to be paired with a specially built Envoy image
- could do the image change and YAML changed in stepped fashion to make it easier for upgrades?
we could eventually get rid of the shutdown-manager code and associated listener etc. in Envoy's xDS config to handle that flow, which would be nice to have

we could also probably think of a nice gradual release process to this final state that is minimally disruptive, but of course that will take time, though interested users could probably do a jump-upgrade change if needed

sunjayBhatia · 2022-11-16T22:16:11Z

Yeah it seems the only other option is to add a liveness probe to the envoy container.

One other quick idea i just had around getting a pod to "recover" after the shutdown-manager is restarted would be to get the shutdown-manager to ensure on startup that Envoy was in the right state. This won't work because you can't recover an Envoy from "draining," once it goes into that state it won't allow Listener updates/modifications (see Attention block here: https://www.envoyproxy.io/docs/envoy/latest/operations/admin#post--drain_listeners).

We could also have the shutdown-manager on startup check if Envoy was draining (using https://www.envoyproxy.io/docs/envoy/latest/api-v3/admin/v3/server_info.proto#envoy-v3-api-enum-admin-v3-serverinfo-state) and have it wait for the drain to finish and itself make Envoy exit (https://www.envoyproxy.io/docs/envoy/latest/operations/admin#post--quitquitquit). But again this seems worse than using proper pod lifecycle hooks for this and doesn't solve the emptyDir issue.

sunjayBhatia · 2022-11-16T22:20:31Z

But tbh I don't know if we've tried to move to this, but this endpoint could be useful: https://www.envoyproxy.io/docs/envoy/latest/operations/admin#post--drain_listeners

(rather than set the healthcheck to fail and wait for existing connections to finish)

sunjayBhatia · 2022-11-16T22:52:40Z

Opened this: #4852

skriss · 2022-11-17T20:55:15Z

Few more notes so I don't forget:

Even if we do move to a single container that contains both envoy and contour binaries, we're still using an emptyDir volume for coordination between the init container that writes out a bootstrap config and the main envoy container, so using emptyDir in the Envoy daemonset/deployment / draining concerns / connection failures #4322 would not be solved
Another failure mode for when the shutdown-manager container restarts is that, while running its preStop hook, it will write out an ok file to the emptyDir shared between shutdown-manager and envoy. Now, the next time the envoy container terminates, the /shutdown handler will see that this ok file exists and immediately return, allowing envoy to terminate immediately instead of waiting for connections to drain.
If we add a liveness probe to the envoy container to avoid getting stuck, we need to make sure it won't restart the container in less than 300 seconds (i.e. the terminationGracePeriodSeconds for the pod), because this is the amount of time that the graceful drain can take when things go normally. So the liveness probe would be restarting envoy after >5min stuck in draining if the pod doesn't terminate first. It's a pretty slow recovery, but at least better than nothing. Also need to keep in mind that there may not be an ok file if the shutdown-manager has not actually run, so need to ensure we don't hang there on the preStop hook (xref Why Envoy container doesn't have a liveliness probe by default? #4812)

skriss · 2022-11-18T16:13:18Z

main...skriss:contour:exp-sdm-requests-and-probes removes the liveness probe from the shutdown-manager container, adds a liveness probe to the envoy container, and adds resource requests to both containers. This should mitigate the issue by resulting in fewer shutdown-manager restarts, and enabling envoy to recover if/when it does get stuck in a "draining" state.

main...skriss:contour:exp-sdm-in-envoy gets rid of the shutdown-manager sidecar, creates a single Docker image with both contour and envoy binaries, and runs the contour envoy shutdown command as a preStop hook in the consolidated envoy container. This approach avoids most of the issues related to coordinating between multiple containers in the pod. Note that, because we're still using an emptyDir volume for the bootstrap config, this does not entirely solve #4322, but does get us a step closer.

skriss · 2023-01-09T18:16:43Z

Coming back to this, here's what I propose doing for the upcoming 1.24 release:

Remove the shutdown-manager's liveness probe. This will make it less likely that the shutdown-manager gets restarted by itself, and even if it does become unresponsive for any reason, the worst thing that happens is that the Pod shuts down without first gracefully draining connections, which is not as bad as the Pod getting permanently stuck in an unresponsive state.
Add documentation to the installation instructions covering that resource requests should be added to all the containers, and providing some reasonable starting values (with the caveat that they will need to be tuned based on usage). I'd rather not add default values to the example YAML since those could cause problems for some users.
Add a page to https://projectcontour.io/docs/v1.23.2/troubleshooting describing the overall issue in detail along with mitigations

I'd also like to hold off on adding a liveness probe to the Envoy container for now, since getting it wrong has the potential to cause more issues, but I do think it's something we should do, maybe for next release, once we have a chance to fully tune all the params.

The probe can currently cause problems when it fails by causing the shutdown-manager container to be restarted by itself, which then results in the envoy container getting stuck in a "DRAINING" state indefinitely. Not having the probe is less bad overall because envoy pods are less likely to get stuck in "DRAINING", and the worst case without it is that shutdown-manager is truly unresponsive during a pod termination, in which case the envoy container will terminate without first draining active connections. Updates projectcontour#4851. Signed-off-by: Steve Kriss <[email protected]>

The probe can currently cause problems when it fails by causing the shutdown-manager container to be restarted by itself, which then results in the envoy container getting stuck in a "DRAINING" state indefinitely. Not having the probe is less bad overall because envoy pods are less likely to get stuck in "DRAINING", and the worst case without it is that shutdown-manager is truly unresponsive during a pod termination, in which case the envoy container will simply terminate without first draining active connections. Updates projectcontour#4851. Signed-off-by: Steve Kriss <[email protected]>

Updates projectcontour#4851. Signed-off-by: Steve Kriss <[email protected]>

Also recommends setting resource requests on containers. Updates projectcontour#4851. Signed-off-by: Steve Kriss <[email protected]>

The probe can currently cause problems when it fails by causing the shutdown-manager container to be restarted by itself, which then results in the envoy container getting stuck in a "DRAINING" state indefinitely. Not having the probe is less bad overall because envoy pods are less likely to get stuck in "DRAINING", and the worst case without it is that shutdown-manager is truly unresponsive during a pod termination, in which case the envoy container will simply terminate without first draining active connections. Updates #4851. Signed-off-by: Steve Kriss <[email protected]>

Also recommends setting resource requests on containers. Updates #4851. Signed-off-by: Steve Kriss <[email protected]>

skriss · 2023-01-10T16:57:44Z

Moving this to 1.25.0 for any remaining follow-ups.

rajatvig · 2023-01-21T12:00:44Z

Here's what we have done to solve the shutdown manager issues and avoiding the emptyDir issues altogether.

Create a custom Envoy image with contour and a custom shutdown script.

FROM ghcr.io/projectcontour/contour:v1.23.1 as contour

FROM envoyproxy/envoy:v1.24.1

LABEL MAINTAINER Runtime <[email protected]>

COPY --from=contour //bin/contour /usr/local/bin/contour
COPY start.sh /usr/local/bin/start.sh
COPY shutdown.sh /usr/local/bin/shutdown.sh

RUN chmod +x /usr/local/bin/start.sh && chmod +x /usr/local/bin/shutdown.sh

ENTRYPOINT ["/usr/local/bin/start.sh"]

The scripts are

start.sh

#!/usr/bin/env bash

set -eou pipefail

mkdir -p /tmp/config
mkdir -p /tmp/admin

echo "bootstrapping envoy"

contour bootstrap \
  /tmp/config/envoy.json \
  --admin-address="/tmp/admin/admin.sock" \
  --xds-address=contour \
  --xds-port=8001 \
  --resources-dir=/tmp/config/resources \
  --envoy-cafile=/certs/ca.crt \
  --envoy-cert-file=/certs/tls.crt \
  --envoy-key-file=/certs/tls.key

echo "bootstrap succeeded"

envoy "$@"

shutdown.sh

#!/usr/bin/env bash

set -eou pipefail

if ! pidof envoy &>/dev/null; then
  exit 0
fi

echo "starting shutdown process"

contour envoy shutdown \
  --drain-delay=30s \
  --check-delay=10s \
  --admin-address=/tmp/admin/admin.sock \
  --ready-file=/tmp/admin/ok

echo "envoy is shutdown"

The rest is binding the preStop hook to call the shutdown.sh script. The wait etc works fine for the LoadBalancer + Kubelet to realise the pod is in terminating stage but allows Envoy to process requests until that happens.

skriss · 2023-01-24T18:22:36Z

Thanks @rajatvig, that is great information.

The probe can currently cause problems when it fails by causing the shutdown-manager container to be restarted by itself, which then results in the envoy container getting stuck in a "DRAINING" state indefinitely. Not having the probe is less bad overall because envoy pods are less likely to get stuck in "DRAINING", and the worst case without it is that shutdown-manager is truly unresponsive during a pod termination, in which case the envoy container will simply terminate without first draining active connections. Updates projectcontour#4851. Signed-off-by: Steve Kriss <[email protected]> Signed-off-by: yy <[email protected]>

Also recommends setting resource requests on containers. Updates projectcontour#4851. Signed-off-by: Steve Kriss <[email protected]> Signed-off-by: yy <[email protected]>

The probe can currently cause problems when it fails by causing the shutdown-manager container to be restarted by itself, which then results in the envoy container getting stuck in a "DRAINING" state indefinitely. Not having the probe is less bad overall because envoy pods are less likely to get stuck in "DRAINING", and the worst case without it is that shutdown-manager is truly unresponsive during a pod termination, in which case the envoy container will simply terminate without first draining active connections. Updates projectcontour#4851. Signed-off-by: Steve Kriss <[email protected]> Signed-off-by: yy <[email protected]>

Also recommends setting resource requests on containers. Updates projectcontour#4851. Signed-off-by: Steve Kriss <[email protected]> Signed-off-by: yy <[email protected]>

The probe can currently cause problems when it fails by causing the shutdown-manager container to be restarted by itself, which then results in the envoy container getting stuck in a "DRAINING" state indefinitely. Not having the probe is less bad overall because envoy pods are less likely to get stuck in "DRAINING", and the worst case without it is that shutdown-manager is truly unresponsive during a pod termination, in which case the envoy container will simply terminate without first draining active connections. Updates projectcontour#4851. Signed-off-by: Steve Kriss <[email protected]>

Also recommends setting resource requests on containers. Updates projectcontour#4851. Signed-off-by: Steve Kriss <[email protected]>

github-actions · 2023-07-03T00:24:42Z

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

After 60d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

Mark this Issue as fresh by commenting
Close this Issue
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

sunjayBhatia · 2023-07-10T16:43:21Z

An additional issue encountered when running Gateway API conformance tests is that when an Envoy pod is created and then deleted before the shutdown manager becomes ready, the pre-stop HTTP call to the shutdown manager fails which leaves the pod in Terminating state until the graceful cleanup timout

github-actions · 2023-10-01T00:21:50Z

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

After 60d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

Mark this Issue as fresh by commenting
Close this Issue
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

github-actions · 2023-11-04T00:18:30Z

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

After 60d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

Mark this Issue as fresh by commenting
Close this Issue
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

a5r0n · 2024-01-23T01:12:39Z

/reopen ?

skriss added kind/bug Categorizes issue or PR as related to a bug. lifecycle/needs-triage Indicates that an issue needs to be triaged by a project contributor. labels Nov 15, 2022

skriss mentioned this issue Nov 28, 2022

Why Envoy container doesn't have a liveliness probe by default? #4812

Closed

skriss added this to the 1.24.0 milestone Nov 29, 2022

skriss added this to Contour Nov 29, 2022

skriss removed the kind/bug Categorizes issue or PR as related to a bug. label Nov 29, 2022

skriss moved this to In Progress in Contour Nov 29, 2022

skriss added kind/bug Categorizes issue or PR as related to a bug. and removed lifecycle/needs-triage Indicates that an issue needs to be triaged by a project contributor. labels Nov 29, 2022

skriss self-assigned this Nov 29, 2022

skriss mentioned this issue Jan 9, 2023

remove shutdown-manager liveness probe #4967

Merged

skriss added a commit to skriss/contour that referenced this issue Jan 9, 2023

site: add troubleshooting doc for unready Envoy

80bbd55

Updates projectcontour#4851. Signed-off-by: Steve Kriss <[email protected]>

skriss mentioned this issue Jan 9, 2023

site: add troubleshooting doc for unready Envoy #4970

Merged

skriss added a commit to skriss/contour that referenced this issue Jan 9, 2023

site: add troubleshooting doc for unready Envoy

3780b35

Also recommends setting resource requests on containers. Updates projectcontour#4851. Signed-off-by: Steve Kriss <[email protected]>

skriss added a commit that referenced this issue Jan 10, 2023

site: add troubleshooting doc for unready Envoy (#4970)

02ff5b4

Also recommends setting resource requests on containers. Updates #4851. Signed-off-by: Steve Kriss <[email protected]>

skriss modified the milestones: 1.24.0, 1.25.0 Jan 10, 2023

skriss moved this from In Progress to Todo in Contour Jan 10, 2023

skriss modified the milestones: 1.25.0, 1.26.0 Apr 7, 2023

skriss added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label May 3, 2023

skriss removed this from the 1.26.0 milestone May 3, 2023

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 3, 2023

github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 11, 2023

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 1, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 4, 2023

github-project-automation bot moved this from Todo to Done in Contour Nov 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

shutdown-manager issues #4851

shutdown-manager issues #4851

skriss commented Nov 15, 2022 •

edited

Loading

skriss commented Nov 16, 2022

sunjayBhatia commented Nov 16, 2022 •

edited

Loading

sunjayBhatia commented Nov 16, 2022

sunjayBhatia commented Nov 16, 2022

sunjayBhatia commented Nov 16, 2022

skriss commented Nov 17, 2022 •

edited

Loading

skriss commented Nov 18, 2022

skriss commented Jan 9, 2023

skriss commented Jan 10, 2023

rajatvig commented Jan 21, 2023

skriss commented Jan 24, 2023

github-actions bot commented Jul 3, 2023

sunjayBhatia commented Jul 10, 2023

github-actions bot commented Oct 1, 2023

github-actions bot commented Nov 4, 2023

a5r0n commented Jan 23, 2024

shutdown-manager issues #4851

shutdown-manager issues #4851

Comments

skriss commented Nov 15, 2022 • edited Loading

skriss commented Nov 16, 2022

sunjayBhatia commented Nov 16, 2022 • edited Loading

sunjayBhatia commented Nov 16, 2022

sunjayBhatia commented Nov 16, 2022

sunjayBhatia commented Nov 16, 2022

skriss commented Nov 17, 2022 • edited Loading

skriss commented Nov 18, 2022

skriss commented Jan 9, 2023

skriss commented Jan 10, 2023

rajatvig commented Jan 21, 2023

skriss commented Jan 24, 2023

github-actions bot commented Jul 3, 2023

sunjayBhatia commented Jul 10, 2023

github-actions bot commented Oct 1, 2023

github-actions bot commented Nov 4, 2023

a5r0n commented Jan 23, 2024

skriss commented Nov 15, 2022 •

edited

Loading

sunjayBhatia commented Nov 16, 2022 •

edited

Loading

skriss commented Nov 17, 2022 •

edited

Loading