SIGHUP of proxy results in TLS errors until original process exits. #6945

jdconti · 2021-05-19T17:06:52Z

Description

What happened: In order to avoid interrupting sessions traversing the proxy we systemctl reload teleport regularly and have observed client connectivity/TLS issues until the process in graceful shutdown is terminated.

[PROXY:SER] ERRO "proxy2021/05/17 19:05:24 http: TLS handshake error from 192.168.200.1:54962: cache is closed\n" utils/cli.go:304

What you expected to happen: Clients should only hit the newly forked teleport process and not the process in graceful shutdown which is waiting for client sessions to terminate.

Reproduction Steps

As minimally and precisely as possible, describe step-by-step how to reproduce the problem.

systemctl reload teleport on proxy
curl https://proxy:3080/webapi/ping observe internal tls error messages or see tsh login output in client section below, and cache is closed errors in proxy logs.
Terminate "old" teleport process and intermittent TLS errors cease.

Server Details

Teleport version (run teleport version): Tested on 6.1.2 and 6.1.5
Server OS (e.g. from /etc/os-release):

NAME="Red Hat Enterprise Linux Server"
VERSION="7.7 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.7"
PRETTY_NAME="Red Hat Enterprise Linux Server 7.7 (Maipo)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.7:GA:server"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.7
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.7"

Where are you running Teleport? (e.g. AWS, GCP, Dedicated Hardware): Dedicated hardware

Client Details

Tsh version (tsh version): 6.1.2 and 6.1.5
Computer OS (e.g. Linux, macOS, Windows): Linux
Browser version (for UI-related issues): All browser versions result in SSL protocol errors.
Installed via (e.g. apt, yum, brew, website download): yum
Additional details:

$ tsh login --proxy=proxy.example.com --auth=azure_ad
ERROR: Get "https://proxy.example.com:3080/webapi/ping/azure_ad": remote error: tls: internal error

Debug Logs

Unfortunately I don't have debug logs from the recent occurrences but will reproduce this today/tomorrow and update with some logs if necessary.

The text was updated successfully, but these errors were encountered:

jdconti · 2021-05-21T14:52:08Z

I've been unable to reproduce this... closing for now.

jdconti · 2021-07-21T19:51:54Z

Alright, I've just hit this again...

jdconti · 2022-01-31T22:17:03Z

The new failure mode on 8.1.1 is an initial connect() attempt is made and the client just hangs (can repro using curl/openssl). This results in a ERROR: Post "https://proxy:3080/v1/webapi/oidc/login/console": net/http: TLS handshake timeout from tsh.

espadolini · 2022-03-21T16:11:47Z

@jdconti do you have process logs around the time the issue happened? I'd like to know if it's caused by some general restart-related weirdness that we've fixed or we're fixing (#10706, #11074, #11022) or if it's caused by something new.

jdconti · 2022-03-22T15:51:25Z

@espadolini I don't have the process logs for the latest issue on 8.1.1 but it should be easily reproducible by performing a SIGHUP on a proxy (running proxy and ssh services). If you're unable to reproduce this let me know and I can give you precise steps and/or reproduce in our dev environment.

russjones · 2022-03-29T20:09:33Z

@espadolini Were you able to reproduce this? I wonder if 8.3.5 fixes this? If not, we can get more information on how to reproduce from @jdconti.

espadolini · 2022-03-30T09:04:35Z

Sorry about not following up!

I've been able to reproduce the issue (in the current master, at least), it's caused by the proxy shutdown task shutting down things serially, so the old proxy keeps accepting connections (especially the ALPN forwarder) but half of the internal services have actually stopped.

I'm currently figuring out what breaks with regards to existing connections if we close all listening sockets immediately - I believe that's ultimately the only correct solution however, as we shouldn't really mix and match services between old and new proxy (even if it somehow worked, which it currently doesn't).

jdconti added the bug label May 19, 2021

jdconti closed this as completed May 21, 2021

jdconti reopened this Jul 21, 2021

russjones added the c-ju Internal Customer Reference label Sep 11, 2021

russjones mentioned this issue Feb 1, 2022

Feature Requests #8745

Open

14 tasks

russjones added the regression label Feb 1, 2022

russjones assigned espadolini Feb 9, 2022

This was referenced Apr 1, 2022

sql: database is closed errors after bouncing teleport process on proxy server #5083

Closed

CA rotation is unstable #10332

Closed

espadolini mentioned this issue Apr 12, 2022

Proxy restart fixes #11802

Merged

espadolini closed this as completed in #11802 May 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIGHUP of proxy results in TLS errors until original process exits. #6945

SIGHUP of proxy results in TLS errors until original process exits. #6945

jdconti commented May 19, 2021

jdconti commented May 21, 2021

jdconti commented Jul 21, 2021

jdconti commented Jan 31, 2022

espadolini commented Mar 21, 2022 •

edited

Loading

jdconti commented Mar 22, 2022

russjones commented Mar 29, 2022

espadolini commented Mar 30, 2022

SIGHUP of proxy results in TLS errors until original process exits. #6945

SIGHUP of proxy results in TLS errors until original process exits. #6945

Comments

jdconti commented May 19, 2021

Description

Reproduction Steps

Server Details

Client Details

Debug Logs

jdconti commented May 21, 2021

jdconti commented Jul 21, 2021

jdconti commented Jan 31, 2022

espadolini commented Mar 21, 2022 • edited Loading

jdconti commented Mar 22, 2022

russjones commented Mar 29, 2022

espadolini commented Mar 30, 2022

espadolini commented Mar 21, 2022 •

edited

Loading