-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve non-graceful restarts #10075
Comments
Yes, it should be. I went with first option, although I didn't know this issue existed:
I expect we shouldn't be leaking Unix sockets any longer, but I didn't check the scenario above. |
In #31869 I asked for SIGTERM to exit after the short timeout passed, but the result of that is that we still don't wait for |
Description
Non-graceful restart (SIGTERM or SIGINT) of a Teleport Node has the following functionality:
TeleportExit
event to any spawned goroutines/subprocessesstorage
,localAuth
, andkeygen
Although the
TeleportExit
event is broadcasted, the process doesn't wait for the event to be handled and for spawned goroutine to exit, so they only have the few microseconds afforded by step 2 to handle the event.Most cleanup will still be handled by standard process termination, e.g. process's file descriptors are all closed. However, unix sockets are not cleaned up without an explicit call to
l.Close
orsyscall.Unlink
, since the abstract socket name is not tied to the process directly. This failed cleanup can be seen with both X11 forwarding and Agent forwarding, as their sockets remain unusable in the/tmp
directory. They will get cleaned up after a system reboot, but without one the 1000 unix sockets used for X11 forwarding could become exhausted due to this issue.Reproduction Steps
As minimally and precisely as possible, describe step-by-step how to reproduce the problem.
tsh ssh -o "ForwardAgent yes"
printenv SSH_AGENT_SOCK
ctrl-C
the Teleport node.Suggestion
It in unclear to me what functionality is intended, as the current implementation has a code path through every sub-routine for non-graceful shutdown which will never be reached. These paths use a
Close
function rather than aShutdown
function. example1 example2.It looks like these code paths are intended to take short cuts to a faster, less graceful shutdown, but instead the process exits before any of these are even run. We should decide on what the intended functionality is and clarify/update the logic.
Options:
TeleportExit
event to be consumed by subscribers. Users can still SIGKILL to skip this.TeleportExit
event broadcast in non-graceful restart which currently only serves as a source of confusion / false security that cleanup is being handled.The first option would be my suggestion as we already have all the code paths, we just need to add a timeout context, but again it's unclear if this is the intended functionality.
The text was updated successfully, but these errors were encountered: