Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGHUP of proxy results in TLS errors until original process exits. #6945

Closed
Tracked by #8745
jdconti opened this issue May 19, 2021 · 7 comments · Fixed by #11802
Closed
Tracked by #8745

SIGHUP of proxy results in TLS errors until original process exits. #6945

jdconti opened this issue May 19, 2021 · 7 comments · Fixed by #11802
Assignees
Labels
bug c-ju Internal Customer Reference regression

Comments

@jdconti
Copy link

jdconti commented May 19, 2021

Description

What happened: In order to avoid interrupting sessions traversing the proxy we systemctl reload teleport regularly and have observed client connectivity/TLS issues until the process in graceful shutdown is terminated.

[PROXY:SER] ERRO "proxy2021/05/17 19:05:24 http: TLS handshake error from 192.168.200.1:54962: cache is closed\n" utils/cli.go:304

What you expected to happen: Clients should only hit the newly forked teleport process and not the process in graceful shutdown which is waiting for client sessions to terminate.

Reproduction Steps

As minimally and precisely as possible, describe step-by-step how to reproduce the problem.

  1. systemctl reload teleport on proxy
  2. curl https://proxy:3080/webapi/ping observe internal tls error messages or see tsh login output in client section below, and cache is closed errors in proxy logs.
  3. Terminate "old" teleport process and intermittent TLS errors cease.

Server Details

  • Teleport version (run teleport version): Tested on 6.1.2 and 6.1.5
  • Server OS (e.g. from /etc/os-release):
NAME="Red Hat Enterprise Linux Server"
VERSION="7.7 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.7"
PRETTY_NAME="Red Hat Enterprise Linux Server 7.7 (Maipo)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.7:GA:server"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.7
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.7"
  • Where are you running Teleport? (e.g. AWS, GCP, Dedicated Hardware): Dedicated hardware

Client Details

  • Tsh version (tsh version): 6.1.2 and 6.1.5
  • Computer OS (e.g. Linux, macOS, Windows): Linux
  • Browser version (for UI-related issues): All browser versions result in SSL protocol errors.
  • Installed via (e.g. apt, yum, brew, website download): yum
  • Additional details:
$ tsh login --proxy=proxy.example.com --auth=azure_ad
ERROR: Get "https://proxy.example.com:3080/webapi/ping/azure_ad": remote error: tls: internal error

Debug Logs

Unfortunately I don't have debug logs from the recent occurrences but will reproduce this today/tomorrow and update with some logs if necessary.

@jdconti jdconti added the bug label May 19, 2021
@jdconti
Copy link
Author

jdconti commented May 21, 2021

I've been unable to reproduce this... closing for now.

@jdconti jdconti closed this as completed May 21, 2021
@jdconti
Copy link
Author

jdconti commented Jul 21, 2021

Alright, I've just hit this again...

@jdconti jdconti reopened this Jul 21, 2021
@russjones russjones added the c-ju Internal Customer Reference label Sep 11, 2021
@jdconti
Copy link
Author

jdconti commented Jan 31, 2022

The new failure mode on 8.1.1 is an initial connect() attempt is made and the client just hangs (can repro using curl/openssl). This results in a ERROR: Post "https://proxy:3080/v1/webapi/oidc/login/console": net/http: TLS handshake timeout from tsh.

@espadolini
Copy link
Contributor

espadolini commented Mar 21, 2022

@jdconti do you have process logs around the time the issue happened? I'd like to know if it's caused by some general restart-related weirdness that we've fixed or we're fixing (#10706, #11074, #11022) or if it's caused by something new.

@jdconti
Copy link
Author

jdconti commented Mar 22, 2022

@espadolini I don't have the process logs for the latest issue on 8.1.1 but it should be easily reproducible by performing a SIGHUP on a proxy (running proxy and ssh services). If you're unable to reproduce this let me know and I can give you precise steps and/or reproduce in our dev environment.

@russjones
Copy link
Contributor

@espadolini Were you able to reproduce this? I wonder if 8.3.5 fixes this? If not, we can get more information on how to reproduce from @jdconti.

@espadolini
Copy link
Contributor

Sorry about not following up!

I've been able to reproduce the issue (in the current master, at least), it's caused by the proxy shutdown task shutting down things serially, so the old proxy keeps accepting connections (especially the ALPN forwarder) but half of the internal services have actually stopped.

I'm currently figuring out what breaks with regards to existing connections if we close all listening sockets immediately - I believe that's ultimately the only correct solution however, as we shouldn't really mix and match services between old and new proxy (even if it somehow worked, which it currently doesn't).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug c-ju Internal Customer Reference regression
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants