-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stale DynamoDB Auth Service Items #10019
Comments
Seems similar to #8793 |
@corkrean Can you check if DynamoDB is being throttled or here is enough capacity. |
I was able to repro this today while working another issue with a customer. This customer has similar setup with the only difference being that they're using multiplexing on the proxies and are using EC2 ASGs for auth instances instead of containers. Backend is DynamoDB in both cases. Repro steps were as follows:
Trying to connect to ssh nodes leads to EOF messages (see logs below) and other resources like DBs show the errors @corkrean linked above. As soon as I manually remove the auth entry in DynamoDB, everything stabilizes instantly and I can connect to everything again. Log snippets listed below (slightly different than ones linked above). I'm also happy to share my repro env if it would be helpful to take a live look in that environment. Logs from the remaining auth/proxy instance:
Error seen in WebUi when attempting to connect:
Logs from localhost when running
|
Currently we use random auth server from the list but if it's unavailable (for example it was restarted but there's still entry in cache, dynamodb backend etc) we return error. This change tries all servers (in random order) and uses first that is available. Closes #10019
Currently we use random auth server from the list but if it's unavailable (for example it was restarted but there's still entry in cache, dynamodb backend etc) we return error. This change tries all servers (in random order) and uses first that is available. Closes #10019 (cherry picked from commit 35a9bbc)
Currently we use random auth server from the list but if it's unavailable (for example it was restarted but there's still entry in cache, dynamodb backend etc) we return error. This change tries all servers (in random order) and uses first that is available. Closes #10019 (cherry picked from commit 35a9bbc)
Currently we use random auth server from the list but if it's unavailable (for example it was restarted but there's still entry in cache, dynamodb backend etc) we return error. This change tries all servers (in random order) and uses first that is available. Closes #10019 (cherry picked from commit 35a9bbc)
Currently we use random auth server from the list but if it's unavailable (for example it was restarted but there's still entry in cache, dynamodb backend etc) we return error. This change tries all servers (in random order) and uses first that is available. Closes #10019 (cherry picked from commit 35a9bbc)
A customer has a need to frequently restart auth/proxy pods that are deployed by the
teleport-cluster
Helm chart. This results in stale DynamoDB auth service items that persist for a time period typically not exceeding one hour. All issues are resolved when the stale DynamoDB auth service items are deleted manually or the DynamoDB auth service items expire.The following are logs that are observed when the issue is occurring:
Error from client:
error from kube-agent:
gz#4346
gz#3996
The text was updated successfully, but these errors were encountered: