-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/crypto/ssh: Dial hangs in kexLoop indefinitely - ignoring ClientConfig.Timeout #51926
Comments
CC @FiloSottile |
I have noticed that this seems to happen only on subsequent/concurrent connections. I could not reproduce it ever happening on the first connection. So potentially this is somehow related to #27140. |
The underlying SSH connections are kept open and are reused across several SSH sessions. This is due to upstream issues in which concurrent/parallel SSH connections may lead to instability. golang/go#51926 golang/go#27140 Signed-off-by: Paulo Gomes <[email protected]>
The underlying SSH connections are kept open and are reused across several SSH sessions. This is due to upstream issues in which concurrent/parallel SSH connections may lead to instability. golang/go#51926 golang/go#27140 Signed-off-by: Paulo Gomes <[email protected]>
The underlying SSH connections are kept open and are reused across several SSH sessions. This is due to upstream issues in which concurrent/parallel SSH connections may lead to instability. golang/go#51926 golang/go#27140 Signed-off-by: Paulo Gomes <[email protected]>
The underlying SSH connections are kept open and are reused across several SSH sessions. This is due to upstream issues in which concurrent/parallel SSH connections may lead to instability. golang/go#51926 golang/go#27140 Signed-off-by: Paulo Gomes <[email protected]>
The underlying SSH connections are kept open and are reused across several SSH sessions. This is due to upstream issues in which concurrent/parallel SSH connections may lead to instability. golang/go#51926 golang/go#27140 Signed-off-by: Paulo Gomes <[email protected]>
The underlying SSH connections are kept open and are reused across several SSH sessions. This is due to upstream issues in which concurrent/parallel SSH connections may lead to instability. golang/go#51926 golang/go#27140 Signed-off-by: Paulo Gomes <[email protected]>
By ensuring the session's StdoutPipe is serviced quickly, seems to resolve the problem, as mentioned on crypto/ssh comments: |
I ran into the issue that diff --git a/ssh/client.go b/ssh/client.go
index 6fd1994..a1300eb 100644
--- a/ssh/client.go
+++ b/ssh/client.go
@@ -174,6 +174,14 @@ func Dial(network, addr string, config *ClientConfig) (*Client, error) {
if err != nil {
return nil, err
}
+ if config.Timeout > 0 {
+ if err := conn.SetDeadline(time.Now().Add(config.Timeout)); err != nil {
+ return nil, err
+ }
+ defer func() {
+ conn.SetDeadline(time.Time{})
+ }()
+ }
c, chans, reqs, err := NewClientConn(conn, addr, config)
if err != nil {
return nil, err
@@ -231,7 +239,7 @@ type ClientConfig struct {
// any of the CertAlgoXxxx and KeyAlgoXxxx constants.
HostKeyAlgorithms []string
- // Timeout is the maximum amount of time for the TCP connection to establish.
+ // Timeout is the maximum amount of time to establish the connection.
//
// A Timeout of zero means no timeout.
Timeout time.Duration is a reasonable way to address it? It fixes the issue for me with the hangs I was seeing. If there is a chance that upstream is interested I could do a PR with a proper test etc but |
This commit adds a thin wrapper around the real ssh.Dial() that additionally sets a deadline on the underlying connection. It is needed because a ssh.Dial() can happens right after the reboot command is issued. The net.Dial() itself is successful but then then during the ssh session setup the TCP connection ends because of the reboot. The golang "ssh" package has no concpt of "ssh -o ServerAliveInterval=10" or simialr so the code will just hang in a read forever. This was observed running the spread "cerberus" tests on ubuntu 23.04. Note that half of the function is just a copy of golang.org/x/crypto/ssh/client.go:func Dial() and only the conn.SetDeadline() bits are new. See also e.g. golang/go#51926 for various bugreports about the golang "ssh" package and hangs.
`ssh.Dial()` took in a context that was used to establish the tcp connection, however that context doesn't cover the ssh handshake which can easily block indefinitely. This approximates context support for ssh.NewClientConn() by having a go routine listen for context cancellation and closing the connection. We can then check for ctx.Err() and return that (i.e if the context was canceled). Note that there is a `Timeout` field in `ssh.ClientConfig` but that also only covers the TCP connection. See golang/go#51926 Fixes: #53
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes, as this is library related using version:
I can confirm the issue also happens with previous versions:
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
The application implements a golang ssh transport that hangs indefinitely at
ssh.Dial
every so often.The current timeout is set to 30 seconds, which
ssh.Dial
does not uphold (https://github.com/fluxcd/source-controller/blob/main/pkg/git/libgit2/managed/ssh.go#L251-L255 https://github.com/fluxcd/source-controller/blob/main/pkg/git/libgit2/managed/init.go#L30).This is a low concurrency (2-4 parallel workers) application which creates multiple ssh connections to execute simple git operations.
The
ssh.Dial
uses thessh.ClientConfig
as below:Actual code can be seen at:
https://github.com/fluxcd/source-controller/blob/main/pkg/git/libgit2/managed/ssh.go#L166
What did you expect to see?
The
ssh.Dial
operation error if the Dial operation took longer than the pre-configured timeout.What did you see instead?
The goroutine hangs indefinitely.
pprof
shows the culprit being:The text was updated successfully, but these errors were encountered: