Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stale DynamoDB Auth Service Items #10019

Closed
corkrean opened this issue Jan 28, 2022 · 3 comments · Fixed by #11229
Closed

Stale DynamoDB Auth Service Items #10019

corkrean opened this issue Jan 28, 2022 · 3 comments · Fixed by #11229
Assignees
Labels
c-di Internal Customer Reference c-ex Internal Customer Reference c-fwh Internal Customer Reference c-jm Internal Customer Reference c-nt Internal Customer Reference c-pg Internal Customer Reference c-sr Internal Customer Reference regression

Comments

@corkrean
Copy link
Contributor

corkrean commented Jan 28, 2022

A customer has a need to frequently restart auth/proxy pods that are deployed by the teleport-cluster Helm chart. This results in stale DynamoDB auth service items that persist for a time period typically not exceeding one hour. All issues are resolved when the stale DynamoDB auth service items are deleted manually or the DynamoDB auth service items expire.

The following are logs that are observed when the issue is occurring:

Error from client:

ERROR REPORT:
Original Error: *trace.TraceErr connection error: desc = "transport: authentication handshake failed: read tcp 192.168.248.193:65080->172.31.150.10:443: i/o timeout"
Stack Trace:
        /tmp/build-darwin-amd64/go/src/github.com/gravitational/teleport/vendor/github.com/gravitational/teleport/api/client/client.go:2340 github.com/gravitational/teleport/api/client.(*Client).GetResources
        /tmp/build-darwin-amd64/go/src/github.com/gravitational/teleport/vendor/github.com/gravitational/teleport/api/client/client.go:1095 github.com/gravitational/teleport/api/client.(*Client).GetDatabaseServers
        /tmp/build-darwin-amd64/go/src/github.com/gravitational/teleport/lib/auth/clt.go:1589 github.com/gravitational/teleport/lib/auth.(*Client).GetDatabaseServers
        /tmp/build-darwin-amd64/go/src/github.com/gravitational/teleport/lib/client/client.go:617 github.com/gravitational/teleport/lib/client.(*ProxyClient).GetDatabaseServers
        /tmp/build-darwin-amd64/go/src/github.com/gravitational/teleport/lib/client/api.go:1925 github.com/gravitational/teleport/lib/client.(*TeleportClient).ListDatabaseServers
        /tmp/build-darwin-amd64/go/src/github.com/gravitational/teleport/lib/client/api.go:1930 github.com/gravitational/teleport/lib/client.(*TeleportClient).ListDatabases
        /tmp/build-darwin-amd64/go/src/github.com/gravitational/teleport/tool/tsh/db.go:50 main.onListDatabases.func1
        /tmp/build-darwin-amd64/go/src/github.com/gravitational/teleport/lib/client/api.go:504 github.com/gravitational/teleport/lib/client.RetryWithRelogin
        /tmp/build-darwin-amd64/go/src/github.com/gravitational/teleport/tool/tsh/db.go:49 main.onListDatabases
        /tmp/build-darwin-amd64/go/src/github.com/gravitational/teleport/tool/tsh/tsh.go:659 main.Run
        /tmp/build-darwin-amd64/go/src/github.com/gravitational/teleport/tool/tsh/tsh.go:285 main.main
        /var/folders/ys/8czjjsys38x504kj8172pd_m0000gp/T/drone-c5rOl7CA9nUzyqDD/home/drone/build-9935-1643075385-toolchains/go/src/runtime/proc.go:255 runtime.main
        /var/folders/ys/8czjjsys38x504kj8172pd_m0000gp/T/drone-c5rOl7CA9nUzyqDD/home/drone/build-9935-1643075385-toolchains/go/src/runtime/asm_amd64.s:1581 runtime.goexit
User Message: connection error: desc = "transport: authentication handshake failed: read tcp 192.168.248.193:65080->172.31.150.10:443: i/o timeout"

error from kube-agent:

teleport-kube-agent-9bc9d67f9-8gg5g teleport 2022-01-27T14:49:18Z WARN [DB:SERVIC] Failed to create remote tunnel: failed to dial: all auth methods failed, conn: <nil>. leaseID:900 target:redacted.com:443 reversetunnel/agent.go:384

gz#4346

gz#3996

@pschisa
Copy link
Contributor

pschisa commented Jan 28, 2022

Seems similar to #8793

@corkrean corkrean added the c-jm Internal Customer Reference label Mar 4, 2022
@russjones
Copy link
Contributor

@corkrean Can you check if DynamoDB is being throttled or here is enough capacity.

@deusxanima
Copy link
Contributor

@russjones,

I was able to repro this today while working another issue with a customer. This customer has similar setup with the only difference being that they're using multiplexing on the proxies and are using EC2 ASGs for auth instances instead of containers. Backend is DynamoDB in both cases.

Repro steps were as follows:

  1. Launch auth/proxy instance in EC2/container behind an LB and connect to dynamo.
  2. Connect resources to auth/proxy and confirm reachability.
  3. Launch second auth/proxy behind same LB and connect to dynamo.
  4. Check resource connectivity (this should be fine).
  5. Stop service on one of the auth/proxy nodes.
  6. Try to connect to resources and observe instability for up to 10 minutes (or until the auth entry expires on the backend in DynamoDB).

Trying to connect to ssh nodes leads to EOF messages (see logs below) and other resources like DBs show the errors @corkrean linked above. As soon as I manually remove the auth entry in DynamoDB, everything stabilizes instantly and I can connect to everything again.

Log snippets listed below (slightly different than ones linked above). I'm also happy to share my repro env if it would be helpful to take a live look in that environment.

Logs from the remaining auth/proxy instance:

Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [NODE]      newProxySubsys({default 1743b34b-927a-4f6a-ae09-7649739f54c0 0 ip-172-31-72-82-ec2-internal}). regular/proxy.go:198
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [PROXY]     Subsystem request: proxySubsys(cluster=default/ip-172-31-72-82-ec2-internal, host=1743b34b-927a-4f6a-ae09-7649739f54c0, port=0). id:210 idle:2m0s local:127.0.0.1:443 login:root remote:127.0.0.1:56764 teleportUser:[email protected] regular/sshserver.go:1577
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [SUBSYSTEM] Starting subsystem trace.fields:map[dst:127.0.0.1:443 src:127.0.0.1:56764] regular/proxy.go:227
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [SUBSYSTEM] proxy connecting to host=1743b34b-927a-4f6a-ae09-7649739f54c0 port=0, exact port=false, strategy=UNAMBIGUOUS_MATCH trace.fields:map[dst:127.0.0.1:443 src:127.0.0.1:56764] regular/proxy.go:358
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [PROXY:SER] Dialing from: "216.63.184.33:56217" to: "@local-node". trace.fields:map[cluster:ip-172-31-72-82-ec2-internal] reversetunnel/localsite.go:198
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [PROXY:SER] Tunnel dialing to 1743b34b-927a-4f6a-ae09-7649739f54c0.ip-172-31-72-82-ec2-internal. trace.fields:map[cluster:ip-172-31-72-82-ec2-internal] reversetunnel/localsite.go:282
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [PROXY:SER] Connecting to 54.174.162.54:53450 through tunnel. trace.fields:map[cluster:ip-172-31-72-82-ec2-internal] reversetunnel/localsite.go:473
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [PROXY:SER] Succeeded dialing from: "216.63.184.33:56217" to: "@local-node". trace.fields:map[cluster:ip-172-31-72-82-ec2-internal] reversetunnel/localsite.go:204
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [PROXY]     Handling request subsystem, want reply true. id:211 idle:2m0s local:127.0.0.1:443 login:root remote:127.0.0.1:56764 teleportUser:Aharic regular/sshserver.go:1397
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [PROXY]     Subsystem request: proxySites(). id:211 idle:2m0s local:127.0.0.1:443 login:root remote:127.0.0.1:56764 teleportUser:Aharic regular/sshserver.go:1577
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [NODE]      proxysites.start(ServerContext(127.0.0.1:56764->127.0.0.1:443, user=root, id=211)) regular/sites.go:54
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [NODE]      Subsystem proxySites() finished with result: <nil>. regular/sshserver.go:1587
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [PROXY]     Close session request: <nil>. id:211 idle:2m0s local:127.0.0.1:443 login:root remote:127.0.0.1:56764 teleportUser:Aharic regular/sshserver.go:1359
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 WARN [PROXY]     Lock watcher subscription was closed. error:[
                                                          ERROR REPORT:
                                                          Original Error: *errors.errorString watcher closed
                                                          Stack Trace:
                                                                  /go/src/github.com/gravitational/teleport/lib/services/fanout.go:366 github.com/gravitational/teleport/lib/services.(*fanoutWatcher).Error
                                                                  /go/src/github.com/gravitational/teleport/lib/srv/monitor.go:245 github.com/gravitational/teleport/lib/srv.(*Monitor).start
                                                                  /opt/go/src/runtime/asm_amd64.s:1581 runtime.goexit
                                                          User Message: watcher closed] id:211 idle:2m0s local:127.0.0.1:443 login:root remote:127.0.0.1:56764 teleportUser:Aharic srv/monitor.go:245
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [PROXY]     Releasing associated resources - context has been closed. id:211 idle:2m0s local:127.0.0.1:443 login:root remote:127.0.0.1:56764 teleportUser:Aharic srv/monitor.go:253
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [KEEPALIVE] Starting keep-alive loop with with interval 30s and max count 2. srv/keepalive.go:65
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [CLIENT]    Found clusters: [{"name":"ip-172-31-72-82-ec2-internal","lastconnected":"2022-03-11T15:14:13.303982502-05:00","status":"online"}] client/client.go:110
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 WARN [ALPN:PROX] Failed to handle client connection. error:[
                                                          ERROR REPORT:
                                                          Original Error: *net.OpError dial tcp 172.31.87.241:3025: connect: connection refused
                                                          Stack Trace:
                                                                  /go/src/github.com/gravitational/teleport/lib/srv/alpnproxy/auth/auth_proxy.go:120 github.com/gravitational/teleport/lib/srv/alpnproxy/auth.(*AuthProxyDialerService).dialLocalAuthServer
                                                                  /go/src/github.com/gravitational/teleport/lib/srv/alpnproxy/auth/auth_proxy.go:100 github.com/gravitational/teleport/lib/srv/alpnproxy/auth.(*AuthProxyDialerService).dialAuthServer
                                                                  /go/src/github.com/gravitational/teleport/lib/srv/alpnproxy/auth/auth_proxy.go:67 github.com/gravitational/teleport/lib/srv/alpnproxy/auth.(*AuthProxyDialerService).HandleConnection
                                                                  /go/src/github.com/gravitational/teleport/lib/srv/alpnproxy/proxy.go:177 github.com/gravitational/teleport/lib/srv/alpnproxy.(*HandlerDecs).handle
                                                                  /go/src/github.com/gravitational/teleport/lib/srv/alpnproxy/proxy.go:322 github.com/gravitational/teleport/lib/srv/alpnproxy.(*Proxy).handleConn
                                                                  /go/src/github.com/gravitational/teleport/lib/srv/alpnproxy/proxy.go:268 github.com/gravitational/teleport/lib/srv/alpnproxy.(*Proxy).Serve.func1
                                                                  /opt/go/src/runtime/asm_amd64.s:1581 runtime.goexit
                                                          User Message: dial tcp 172.31.87.241:3025: connect: connection refused] alpnproxy/proxy.go:276
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [WEBSOCKET] Terminating websocket ping loop. web/terminal.go:467
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [SSH:PROXY] Closed connection 127.0.0.1:56764. sshutils/server.go:471
Mar 11 15:14:13 rc-node-1 /usr/local/bin/teleport[29301]: 2022-03-11T15:14:13-05:00 DEBU [PROXY]     Releasing associated resources - context has been closed. id:210 idle:2m0s local:127.0.0.1:443 login:root remote:127.0.0.1:56764 teleportUser:Aharic srv/monitor.go:253

Error seen in WebUi when attempting to connect:

connection error: desc = "transport: authentication handshake failed: EOF”

Logs from localhost when running tsh login during this time:

❯ tsh login --proxy proxy.example.com
Welcome to the Root Cluster
Press [ENTER] to continue.
If browser window does not open automatically, open it by clicking on the link:
 http://127.0.0.1:56349/8942d876-08d2-5678-1234-bc38397e93s6
ERROR: Get "https://teleport.cluster.local/v2/authorities/host?load_keys=false": EOF

@deusxanima deusxanima added c-nt Internal Customer Reference c-gs Internal Customer Reference c-sr Internal Customer Reference c-fw Internal Customer Reference c-fwh Internal Customer Reference c-ex Internal Customer Reference and removed c-gs Internal Customer Reference c-fw Internal Customer Reference labels Mar 11, 2022
@pschisa pschisa added c-di Internal Customer Reference c-pg Internal Customer Reference labels Mar 15, 2022
probakowski added a commit that referenced this issue Mar 18, 2022
Currently we use random auth server from the list but if it's unavailable (for example it was restarted but there's still entry in cache, dynamodb backend etc) we return error.
This change tries all servers (in random order) and uses first that is available.

Closes #10019
probakowski added a commit that referenced this issue Mar 31, 2022
Currently we use random auth server from the list but if it's unavailable (for example it was restarted but there's still entry in cache, dynamodb backend etc) we return error.
This change tries all servers (in random order) and uses first that is available.

Closes #10019

(cherry picked from commit 35a9bbc)
probakowski added a commit that referenced this issue Mar 31, 2022
Currently we use random auth server from the list but if it's unavailable (for example it was restarted but there's still entry in cache, dynamodb backend etc) we return error.
This change tries all servers (in random order) and uses first that is available.

Closes #10019

(cherry picked from commit 35a9bbc)
probakowski added a commit that referenced this issue Mar 31, 2022
Currently we use random auth server from the list but if it's unavailable (for example it was restarted but there's still entry in cache, dynamodb backend etc) we return error.
This change tries all servers (in random order) and uses first that is available.

Closes #10019

(cherry picked from commit 35a9bbc)
probakowski added a commit that referenced this issue Mar 31, 2022
Currently we use random auth server from the list but if it's unavailable (for example it was restarted but there's still entry in cache, dynamodb backend etc) we return error.
This change tries all servers (in random order) and uses first that is available.

Closes #10019

(cherry picked from commit 35a9bbc)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c-di Internal Customer Reference c-ex Internal Customer Reference c-fwh Internal Customer Reference c-jm Internal Customer Reference c-nt Internal Customer Reference c-pg Internal Customer Reference c-sr Internal Customer Reference regression
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants