-
Notifications
You must be signed in to change notification settings - Fork 689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
shutdown-manager issues #4851
Comments
A few more thoughts:
cc @sunjayBhatia, interested in your thoughts here when you get a chance. |
yeah I agree that idea is the best solution for functionality, I don't think its too hard on the face of it to implement, but has some implications for downstream users that mean we would need very good communication in our release notes and upgrade documentation stepping through what would need to happen to do that:
we could also probably think of a nice gradual release process to this final state that is minimally disruptive, but of course that will take time, though interested users could probably do a jump-upgrade change if needed |
Yeah it seems the only other option is to add a liveness probe to the envoy container. One other quick idea i just had around getting a pod to "recover" after the shutdown-manager is restarted would be to get the shutdown-manager to ensure on startup that Envoy was in the right state. This won't work because you can't recover an Envoy from "draining," once it goes into that state it won't allow Listener updates/modifications (see We could also have the shutdown-manager on startup check if Envoy was draining (using https://www.envoyproxy.io/docs/envoy/latest/api-v3/admin/v3/server_info.proto#envoy-v3-api-enum-admin-v3-serverinfo-state) and have it wait for the drain to finish and itself make Envoy exit (https://www.envoyproxy.io/docs/envoy/latest/operations/admin#post--quitquitquit). But again this seems worse than using proper pod lifecycle hooks for this and doesn't solve the emptyDir issue. |
But tbh I don't know if we've tried to move to this, but this endpoint could be useful: https://www.envoyproxy.io/docs/envoy/latest/operations/admin#post--drain_listeners (rather than set the healthcheck to fail and wait for existing connections to finish) |
Opened this: #4852 |
Few more notes so I don't forget:
|
main...skriss:contour:exp-sdm-requests-and-probes removes the liveness probe from the shutdown-manager container, adds a liveness probe to the envoy container, and adds resource requests to both containers. This should mitigate the issue by resulting in fewer shutdown-manager restarts, and enabling envoy to recover if/when it does get stuck in a "draining" state. main...skriss:contour:exp-sdm-in-envoy gets rid of the shutdown-manager sidecar, creates a single Docker image with both contour and envoy binaries, and runs the |
Coming back to this, here's what I propose doing for the upcoming 1.24 release:
I'd also like to hold off on adding a liveness probe to the Envoy container for now, since getting it wrong has the potential to cause more issues, but I do think it's something we should do, maybe for next release, once we have a chance to fully tune all the params. |
The probe can currently cause problems when it fails by causing the shutdown-manager container to be restarted by itself, which then results in the envoy container getting stuck in a "DRAINING" state indefinitely. Not having the probe is less bad overall because envoy pods are less likely to get stuck in "DRAINING", and the worst case without it is that shutdown-manager is truly unresponsive during a pod termination, in which case the envoy container will terminate without first draining active connections. Updates projectcontour#4851. Signed-off-by: Steve Kriss <[email protected]>
The probe can currently cause problems when it fails by causing the shutdown-manager container to be restarted by itself, which then results in the envoy container getting stuck in a "DRAINING" state indefinitely. Not having the probe is less bad overall because envoy pods are less likely to get stuck in "DRAINING", and the worst case without it is that shutdown-manager is truly unresponsive during a pod termination, in which case the envoy container will simply terminate without first draining active connections. Updates projectcontour#4851. Signed-off-by: Steve Kriss <[email protected]>
Updates projectcontour#4851. Signed-off-by: Steve Kriss <[email protected]>
Also recommends setting resource requests on containers. Updates projectcontour#4851. Signed-off-by: Steve Kriss <[email protected]>
The probe can currently cause problems when it fails by causing the shutdown-manager container to be restarted by itself, which then results in the envoy container getting stuck in a "DRAINING" state indefinitely. Not having the probe is less bad overall because envoy pods are less likely to get stuck in "DRAINING", and the worst case without it is that shutdown-manager is truly unresponsive during a pod termination, in which case the envoy container will simply terminate without first draining active connections. Updates #4851. Signed-off-by: Steve Kriss <[email protected]>
Also recommends setting resource requests on containers. Updates #4851. Signed-off-by: Steve Kriss <[email protected]>
Moving this to 1.25.0 for any remaining follow-ups. |
Here's what we have done to solve the shutdown manager issues and avoiding the Create a custom
The scripts are
#!/usr/bin/env bash
set -eou pipefail
mkdir -p /tmp/config
mkdir -p /tmp/admin
echo "bootstrapping envoy"
contour bootstrap \
/tmp/config/envoy.json \
--admin-address="/tmp/admin/admin.sock" \
--xds-address=contour \
--xds-port=8001 \
--resources-dir=/tmp/config/resources \
--envoy-cafile=/certs/ca.crt \
--envoy-cert-file=/certs/tls.crt \
--envoy-key-file=/certs/tls.key
echo "bootstrap succeeded"
envoy "$@"
#!/usr/bin/env bash
set -eou pipefail
if ! pidof envoy &>/dev/null; then
exit 0
fi
echo "starting shutdown process"
contour envoy shutdown \
--drain-delay=30s \
--check-delay=10s \
--admin-address=/tmp/admin/admin.sock \
--ready-file=/tmp/admin/ok
echo "envoy is shutdown" The rest is binding the |
Thanks @rajatvig, that is great information. |
The probe can currently cause problems when it fails by causing the shutdown-manager container to be restarted by itself, which then results in the envoy container getting stuck in a "DRAINING" state indefinitely. Not having the probe is less bad overall because envoy pods are less likely to get stuck in "DRAINING", and the worst case without it is that shutdown-manager is truly unresponsive during a pod termination, in which case the envoy container will simply terminate without first draining active connections. Updates projectcontour#4851. Signed-off-by: Steve Kriss <[email protected]> Signed-off-by: yy <[email protected]>
Also recommends setting resource requests on containers. Updates projectcontour#4851. Signed-off-by: Steve Kriss <[email protected]> Signed-off-by: yy <[email protected]>
The probe can currently cause problems when it fails by causing the shutdown-manager container to be restarted by itself, which then results in the envoy container getting stuck in a "DRAINING" state indefinitely. Not having the probe is less bad overall because envoy pods are less likely to get stuck in "DRAINING", and the worst case without it is that shutdown-manager is truly unresponsive during a pod termination, in which case the envoy container will simply terminate without first draining active connections. Updates projectcontour#4851. Signed-off-by: Steve Kriss <[email protected]> Signed-off-by: yy <[email protected]>
Also recommends setting resource requests on containers. Updates projectcontour#4851. Signed-off-by: Steve Kriss <[email protected]> Signed-off-by: yy <[email protected]>
The probe can currently cause problems when it fails by causing the shutdown-manager container to be restarted by itself, which then results in the envoy container getting stuck in a "DRAINING" state indefinitely. Not having the probe is less bad overall because envoy pods are less likely to get stuck in "DRAINING", and the worst case without it is that shutdown-manager is truly unresponsive during a pod termination, in which case the envoy container will simply terminate without first draining active connections. Updates projectcontour#4851. Signed-off-by: Steve Kriss <[email protected]>
Also recommends setting resource requests on containers. Updates projectcontour#4851. Signed-off-by: Steve Kriss <[email protected]>
The Contour project currently lacks enough contributors to adequately respond to all Issues. This bot triages Issues according to the following rules:
You can:
Please send feedback to the #contour channel in the Kubernetes Slack |
An additional issue encountered when running Gateway API conformance tests is that when an Envoy pod is created and then deleted before the shutdown manager becomes ready, the pre-stop HTTP call to the shutdown manager fails which leaves the pod in Terminating state until the graceful cleanup timout |
The Contour project currently lacks enough contributors to adequately respond to all Issues. This bot triages Issues according to the following rules:
You can:
Please send feedback to the #contour channel in the Kubernetes Slack |
The Contour project currently lacks enough contributors to adequately respond to all Issues. This bot triages Issues according to the following rules:
You can:
Please send feedback to the #contour channel in the Kubernetes Slack |
/reopen ? |
xref. #3192
xref. #4322
xref. #4812
We've encountered a number of interrelated issues with the graceful Envoy shutdown workflow that I want to capture in one place.
The first issue goes something like the following:
contour envoy shutdown
command. This command tells the associated Envoy to start draining all Listeners, including the Listener that is the target of the Envoy container's readiness probe. So now, Envoy is draining HTTP/S connections as well as reporting unready.A few thoughts about this issue:
Secondly, #4322 describes issues with draining nodes due to the emptyDir that is used to allow the shutdown-manager and envoy containers to communicate (via UDS for the envoy admin interface, and by file to communicate when the Listener drain is done). I won't repeat that discussion here, just linking it for reference.
The text was updated successfully, but these errors were encountered: