Stale DynamoDB Auth Service Items #10019

corkrean · 2022-01-28T16:06:01Z

A customer has a need to frequently restart auth/proxy pods that are deployed by the teleport-cluster Helm chart. This results in stale DynamoDB auth service items that persist for a time period typically not exceeding one hour. All issues are resolved when the stale DynamoDB auth service items are deleted manually or the DynamoDB auth service items expire.

The following are logs that are observed when the issue is occurring:

Error from client:

ERROR REPORT:
Original Error: *trace.TraceErr connection error: desc = &#34;transport: authentication handshake failed: read tcp 192.168.248.193:65080-&gt;172.31.150.10:443: i/o timeout&#34;
Stack Trace:
        /tmp/build-darwin-amd64/go/src/github.com/gravitational/teleport/vendor/github.com/gravitational/teleport/api/client/client.go:2340 github.com/gravitational/teleport/api/client.(*Client).GetResources
        /tmp/build-darwin-amd64/go/src/github.com/gravitational/teleport/vendor/github.com/gravitational/teleport/api/client/client.go:1095 github.com/gravitational/teleport/api/client.(*Client).GetDatabaseServers
        /tmp/build-darwin-amd64/go/src/github.com/gravitational/teleport/lib/auth/clt.go:1589 github.com/gravitational/teleport/lib/auth.(*Client).GetDatabaseServers
        /tmp/build-darwin-amd64/go/src/github.com/gravitational/teleport/lib/client/client.go:617 github.com/gravitational/teleport/lib/client.(*ProxyClient).GetDatabaseServers
        /tmp/build-darwin-amd64/go/src/github.com/gravitational/teleport/lib/client/api.go:1925 github.com/gravitational/teleport/lib/client.(*TeleportClient).ListDatabaseServers
        /tmp/build-darwin-amd64/go/src/github.com/gravitational/teleport/lib/client/api.go:1930 github.com/gravitational/teleport/lib/client.(*TeleportClient).ListDatabases
        /tmp/build-darwin-amd64/go/src/github.com/gravitational/teleport/tool/tsh/db.go:50 main.onListDatabases.func1
        /tmp/build-darwin-amd64/go/src/github.com/gravitational/teleport/lib/client/api.go:504 github.com/gravitational/teleport/lib/client.RetryWithRelogin
        /tmp/build-darwin-amd64/go/src/github.com/gravitational/teleport/tool/tsh/db.go:49 main.onListDatabases
        /tmp/build-darwin-amd64/go/src/github.com/gravitational/teleport/tool/tsh/tsh.go:659 main.Run
        /tmp/build-darwin-amd64/go/src/github.com/gravitational/teleport/tool/tsh/tsh.go:285 main.main
        /var/folders/ys/8czjjsys38x504kj8172pd_m0000gp/T/drone-c5rOl7CA9nUzyqDD/home/drone/build-9935-1643075385-toolchains/go/src/runtime/proc.go:255 runtime.main
        /var/folders/ys/8czjjsys38x504kj8172pd_m0000gp/T/drone-c5rOl7CA9nUzyqDD/home/drone/build-9935-1643075385-toolchains/go/src/runtime/asm_amd64.s:1581 runtime.goexit
User Message: connection error: desc = &#34;transport: authentication handshake failed: read tcp 192.168.248.193:65080-&gt;172.31.150.10:443: i/o timeout&#34;

error from kube-agent:

teleport-kube-agent-9bc9d67f9-8gg5g teleport 2022-01-27T14:49:18Z WARN [DB:SERVIC] Failed to create remote tunnel: failed to dial: all auth methods failed, conn: <nil>. leaseID:900 target:redacted.com:443 reversetunnel/agent.go:384

gz#4346

gz#3996

The text was updated successfully, but these errors were encountered:

pschisa · 2022-01-28T19:44:40Z

Seems similar to #8793

russjones · 2022-03-08T18:47:42Z

@corkrean Can you check if DynamoDB is being throttled or here is enough capacity.

deusxanima · 2022-03-11T21:04:56Z

@russjones,

I was able to repro this today while working another issue with a customer. This customer has similar setup with the only difference being that they're using multiplexing on the proxies and are using EC2 ASGs for auth instances instead of containers. Backend is DynamoDB in both cases.

Repro steps were as follows:

Launch auth/proxy instance in EC2/container behind an LB and connect to dynamo.
Connect resources to auth/proxy and confirm reachability.
Launch second auth/proxy behind same LB and connect to dynamo.
Check resource connectivity (this should be fine).
Stop service on one of the auth/proxy nodes.
Try to connect to resources and observe instability for up to 10 minutes (or until the auth entry expires on the backend in DynamoDB).

Trying to connect to ssh nodes leads to EOF messages (see logs below) and other resources like DBs show the errors @corkrean linked above. As soon as I manually remove the auth entry in DynamoDB, everything stabilizes instantly and I can connect to everything again.

Log snippets listed below (slightly different than ones linked above). I'm also happy to share my repro env if it would be helpful to take a live look in that environment.

Logs from the remaining auth/proxy instance:

Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [NODE]      newProxySubsys({default 1743b34b-927a-4f6a-ae09-7649739f54c0 0 ip-172-31-72-82-ec2-internal}). regular/proxy.go:198
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [PROXY]     Subsystem request: proxySubsys(cluster=default/ip-172-31-72-82-ec2-internal, host=1743b34b-927a-4f6a-ae09-7649739f54c0, port=0). id:210 idle:2m0s local:127.0.0.1:443 login:root remote:127.0.0.1:56764 teleportUser:[email protected] regular/sshserver.go:1577
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [SUBSYSTEM] Starting subsystem trace.fields:map[dst:127.0.0.1:443 src:127.0.0.1:56764] regular/proxy.go:227
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [SUBSYSTEM] proxy connecting to host=1743b34b-927a-4f6a-ae09-7649739f54c0 port=0, exact port=false, strategy=UNAMBIGUOUS_MATCH trace.fields:map[dst:127.0.0.1:443 src:127.0.0.1:56764] regular/proxy.go:358
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [PROXY:SER] Dialing from: "216.63.184.33:56217" to: "@local-node". trace.fields:map[cluster:ip-172-31-72-82-ec2-internal] reversetunnel/localsite.go:198
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [PROXY:SER] Tunnel dialing to 1743b34b-927a-4f6a-ae09-7649739f54c0.ip-172-31-72-82-ec2-internal. trace.fields:map[cluster:ip-172-31-72-82-ec2-internal] reversetunnel/localsite.go:282
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [PROXY:SER] Connecting to 54.174.162.54:53450 through tunnel. trace.fields:map[cluster:ip-172-31-72-82-ec2-internal] reversetunnel/localsite.go:473
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [PROXY:SER] Succeeded dialing from: "216.63.184.33:56217" to: "@local-node". trace.fields:map[cluster:ip-172-31-72-82-ec2-internal] reversetunnel/localsite.go:204
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [PROXY]     Handling request subsystem, want reply true. id:211 idle:2m0s local:127.0.0.1:443 login:root remote:127.0.0.1:56764 teleportUser:Aharic regular/sshserver.go:1397
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [PROXY]     Subsystem request: proxySites(). id:211 idle:2m0s local:127.0.0.1:443 login:root remote:127.0.0.1:56764 teleportUser:Aharic regular/sshserver.go:1577
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [NODE]      proxysites.start(ServerContext(127.0.0.1:56764->127.0.0.1:443, user=root, id=211)) regular/sites.go:54
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [NODE]      Subsystem proxySites() finished with result: <nil>. regular/sshserver.go:1587
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [PROXY]     Close session request: <nil>. id:211 idle:2m0s local:127.0.0.1:443 login:root remote:127.0.0.1:56764 teleportUser:Aharic regular/sshserver.go:1359
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 WARN [PROXY]     Lock watcher subscription was closed. error:[
                                                          ERROR REPORT:
                                                          Original Error: *errors.errorString watcher closed
                                                          Stack Trace:
                                                                  /go/src/github.com/gravitational/teleport/lib/services/fanout.go:366 github.com/gravitational/teleport/lib/services.(*fanoutWatcher).Error
                                                                  /go/src/github.com/gravitational/teleport/lib/srv/monitor.go:245 github.com/gravitational/teleport/lib/srv.(*Monitor).start
                                                                  /opt/go/src/runtime/asm_amd64.s:1581 runtime.goexit
                                                          User Message: watcher closed] id:211 idle:2m0s local:127.0.0.1:443 login:root remote:127.0.0.1:56764 teleportUser:Aharic srv/monitor.go:245
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [PROXY]     Releasing associated resources - context has been closed. id:211 idle:2m0s local:127.0.0.1:443 login:root remote:127.0.0.1:56764 teleportUser:Aharic srv/monitor.go:253
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [KEEPALIVE] Starting keep-alive loop with with interval 30s and max count 2. srv/keepalive.go:65
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [CLIENT]    Found clusters: [{"name":"ip-172-31-72-82-ec2-internal","lastconnected":"2022-03-11T15:14:13.303982502-05:00","status":"online"}] client/client.go:110
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 WARN [ALPN:PROX] Failed to handle client connection. error:[
                                                          ERROR REPORT:
                                                          Original Error: *net.OpError dial tcp 172.31.87.241:3025: connect: connection refused
                                                          Stack Trace:
                                                                  /go/src/github.com/gravitational/teleport/lib/srv/alpnproxy/auth/auth_proxy.go:120 github.com/gravitational/teleport/lib/srv/alpnproxy/auth.(*AuthProxyDialerService).dialLocalAuthServer
                                                                  /go/src/github.com/gravitational/teleport/lib/srv/alpnproxy/auth/auth_proxy.go:100 github.com/gravitational/teleport/lib/srv/alpnproxy/auth.(*AuthProxyDialerService).dialAuthServer
                                                                  /go/src/github.com/gravitational/teleport/lib/srv/alpnproxy/auth/auth_proxy.go:67 github.com/gravitational/teleport/lib/srv/alpnproxy/auth.(*AuthProxyDialerService).HandleConnection
                                                                  /go/src/github.com/gravitational/teleport/lib/srv/alpnproxy/proxy.go:177 github.com/gravitational/teleport/lib/srv/alpnproxy.(*HandlerDecs).handle
                                                                  /go/src/github.com/gravitational/teleport/lib/srv/alpnproxy/proxy.go:322 github.com/gravitational/teleport/lib/srv/alpnproxy.(*Proxy).handleConn
                                                                  /go/src/github.com/gravitational/teleport/lib/srv/alpnproxy/proxy.go:268 github.com/gravitational/teleport/lib/srv/alpnproxy.(*Proxy).Serve.func1
                                                                  /opt/go/src/runtime/asm_amd64.s:1581 runtime.goexit
                                                          User Message: dial tcp 172.31.87.241:3025: connect: connection refused] alpnproxy/proxy.go:276
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [WEBSOCKET] Terminating websocket ping loop. web/terminal.go:467
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [SSH:PROXY] Closed connection 127.0.0.1:56764. sshutils/server.go:471
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [PROXY]     Releasing associated resources - context has been closed. id:210 idle:2m0s local:127.0.0.1:443 login:root remote:127.0.0.1:56764 teleportUser:Aharic srv/monitor.go:253

Error seen in WebUi when attempting to connect:

connection error: desc = "transport: authentication handshake failed: EOF”

Logs from localhost when running tsh login during this time:

❯ tsh login --proxy proxy.example.com
Welcome to the Root Cluster
Press [ENTER] to continue.
If browser window does not open automatically, open it by clicking on the link:
 http://127.0.0.1:56349/8942d876-08d2-5678-1234-bc38397e93s6
ERROR: Get "https://teleport.cluster.local/v2/authorities/host?load_keys=false": EOF

Currently we use random auth server from the list but if it's unavailable (for example it was restarted but there's still entry in cache, dynamodb backend etc) we return error. This change tries all servers (in random order) and uses first that is available. Closes #10019

Currently we use random auth server from the list but if it's unavailable (for example it was restarted but there's still entry in cache, dynamodb backend etc) we return error. This change tries all servers (in random order) and uses first that is available. Closes #10019 (cherry picked from commit 35a9bbc)

corkrean mentioned this issue Jan 28, 2022

Trim output of ListNodes to not show nodes that have hit TTL #8834

Closed

fspmarshall mentioned this issue Feb 3, 2022

active node inventory cleanup #10134

Merged

corkrean added the c-jm Internal Customer Reference label Mar 4, 2022

russjones added the regression label Mar 15, 2022

russjones assigned zmb3 Mar 15, 2022

pschisa added c-di Internal Customer Reference c-pg Internal Customer Reference labels Mar 15, 2022

russjones assigned probakowski Mar 16, 2022

probakowski mentioned this issue Mar 17, 2022

Use first available auth server #11229

Merged

probakowski closed this as completed in #11229 Mar 18, 2022

probakowski mentioned this issue Mar 18, 2022

Update gosaml2 so it supports AES256GCM #11272

Merged

russjones mentioned this issue Apr 19, 2022

Connectivity timeout after auth/proxy rotation #12063

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stale DynamoDB Auth Service Items #10019

Stale DynamoDB Auth Service Items #10019

corkrean commented Jan 28, 2022 •

edited by programmerq

Loading

pschisa commented Jan 28, 2022

russjones commented Mar 8, 2022

deusxanima commented Mar 11, 2022

Stale DynamoDB Auth Service Items #10019

Stale DynamoDB Auth Service Items #10019

Comments

corkrean commented Jan 28, 2022 • edited by programmerq Loading

pschisa commented Jan 28, 2022

russjones commented Mar 8, 2022

deusxanima commented Mar 11, 2022

corkrean commented Jan 28, 2022 •

edited by programmerq

Loading