Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve non-graceful restarts #10075

Closed
Joerger opened this issue Feb 1, 2022 · 3 comments · Fixed by #39203
Closed

Improve non-graceful restarts #10075

Joerger opened this issue Feb 1, 2022 · 3 comments · Fixed by #39203
Labels

Comments

@Joerger
Copy link
Contributor

Joerger commented Feb 1, 2022

Description

Non-graceful restart (SIGTERM or SIGINT) of a Teleport Node has the following functionality:

  1. Broadcast a TeleportExit event to any spawned goroutines/subprocesses
  2. Close process storage, localAuth, and keygen
  3. Process exits immediately

Although the TeleportExit event is broadcasted, the process doesn't wait for the event to be handled and for spawned goroutine to exit, so they only have the few microseconds afforded by step 2 to handle the event.

Most cleanup will still be handled by standard process termination, e.g. process's file descriptors are all closed. However, unix sockets are not cleaned up without an explicit call to l.Close or syscall.Unlink, since the abstract socket name is not tied to the process directly. This failed cleanup can be seen with both X11 forwarding and Agent forwarding, as their sockets remain unusable in the /tmp directory. They will get cleaned up after a system reboot, but without one the 1000 unix sockets used for X11 forwarding could become exhausted due to this issue.

Reproduction Steps

As minimally and precisely as possible, describe step-by-step how to reproduce the problem.

  1. Start a Teleport node
  2. Connect with tsh ssh -o "ForwardAgent yes"
  3. Get the SSH_AGENT_SOCK - printenv SSH_AGENT_SOCK
  4. ctrl-C the Teleport node.
  5. The socket is still open and needs to be manually removed/unlinked.

Suggestion

It in unclear to me what functionality is intended, as the current implementation has a code path through every sub-routine for non-graceful shutdown which will never be reached. These paths use a Close function rather than a Shutdown function. example1 example2.

It looks like these code paths are intended to take short cuts to a faster, less graceful shutdown, but instead the process exits before any of these are even run. We should decide on what the intended functionality is and clarify/update the logic.

Options:

  1. Add a short timeout during which the process will wait for the TeleportExit event to be consumed by subscribers. Users can still SIGKILL to skip this.
  2. Remove the TeleportExit event broadcast in non-graceful restart which currently only serves as a source of confusion / false security that cleanup is being handled.
  3. Add more resources to Step 2, including opened unix sockets.

The first option would be my suggestion as we already have all the code paths, we just need to add a timeout context, but again it's unclear if this is the intended functionality.

@zmb3
Copy link
Collaborator

zmb3 commented Mar 11, 2024

@Tener was this fixed by #31869?

@Tener
Copy link
Contributor

Tener commented Mar 11, 2024

@Tener was this fixed by #31869?

Yes, it should be. I went with first option, although I didn't know this issue existed:

Add a short timeout during which the process will wait for the TeleportExit event to be consumed by subscribers. Users can still SIGKILL to skip this.

I expect we shouldn't be leaking Unix sockets any longer, but I didn't check the scenario above.

@espadolini
Copy link
Contributor

In #31869 I asked for SIGTERM to exit after the short timeout passed, but the result of that is that we still don't wait for Shutdown to return, if we hit the timeout. I've created #39203 to add a little bit of extra wait while we wait for Shutdown to return - which should mean that everything is cleaned up as it should.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants