-
-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added dial/handshake timeout for non-responsive connections #273
base: master
Are you sure you want to change the base?
Conversation
Change ipc solaris to use SyscallConn which enables deadlines/close/abort to work
Hey Nick. I'll have to look over this a bit to fully understand what you're doing. Thanks for this. |
@@ -50,6 +51,10 @@ type dialer struct { | |||
// Dial implements the Dialer Dial method | |||
func (d *dialer) Dial() (transport.Pipe, error) { | |||
|
|||
// TODO: It might be good to pass a context here to abort the dial after |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that that would be a better approach.
@@ -1,3 +1,4 @@ | |||
//go:build solaris && cgo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my vscode editor added this
// This change was necessary to support SetDeadline and Close | ||
// which aborts pending reads/writes. | ||
// The prior code was calling c.File().Fd(). It was not closing | ||
// the file. Additionally the docs say Fd() causes deadlines to not work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I originally wrote this, the Control method didn't exist. I need to review the docs to understand what is happening here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here was an attempt to call Close on the connection without this change. Close blocked forever instead of aborting the read.
1 @ 0x43ce36 0x44d0be 0x44d095 0x4654c5 0x4b27cd 0x50ea18 0x519ae5 0x528177 0x4696c1
# 0x4654c4 internal/poll.runtime_Semacquire+0x24 c:/bsros/toolchain/3.4.1/lib/go/src/runtime/sema.go:67
# 0x4b27cc internal/poll.(*FD).Close+0x6c c:/bsros/toolchain/3.4.1/lib/go/src/internal/poll/fd_unix.go:116
# 0x50ea17 net.(*netFD).Close+0x37 c:/bsros/toolchain/3.4.1/lib/go/src/net/fd_posix.go:37
# 0x519ae4 net.(*conn).Close+0x44 c:/bsros/toolchain/3.4.1/lib/go/src/net/net.go:207
@@ -252,6 +273,8 @@ func (h *connHandshaker) Close() { | |||
h.closed = true | |||
h.cv.Broadcast() | |||
for conn := range h.workq { | |||
// This does not do anything because conn.open is | |||
// false and conn.Close is a no-op |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this comment is necessarily true. But I need to review ... it may be that I've missed something here. The intent is to be able to abort the handshaker via Close, and that should close any connections that might not be fully established yet. This code was "recently" (in terms of release numbers, but not necessarily dates) refactored to better deal with stalled handshakes, and I may have messed this up.
@@ -166,10 +171,26 @@ type connHeader struct { | |||
// As a side effect, the peer's protocol number is stored in the conn. | |||
// Also, various properties are initialized. | |||
func (p *conn) handshake() error { | |||
|
|||
// Rational for a timeout mechanism (which may need to move elsewhere) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Rational for a timeout mechanism (which may need to move elsewhere) | |
// Rationale for a timeout mechanism (which may need to move elsewhere) |
// if the socket remains open. Closing the dialer does not work | ||
// because the handshake is not complete (check for p.open). | ||
|
||
// TODO: Timeout should be configurable however that results in a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should be configurable. The handshake should go quickly, and 5 seconds is more than enough time to deal with even the crappiest networks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HOWEVER, I don't think we want to keep this deadline in place once the handshake is completed. Because we could have large BULK transfers (think a 4 GB message, which is perfectly legal) that might take quite a while to send. Or the connection might be backed up due to other large messages in the pipe ahead of it. Therefore we need to ensure that the timeout does not persist once the handshake is complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. That is why I defer SetDeadline to 0. That should disable the current deadline per docs "The deadline applies to all future and pending I/O". I figure the deadline set during handshake is done right after dialer connection creation which is prior to the conn being available for transport usage. After the dial/handshake completes other deadlines could be set on it specific to how mangos wants to use the connection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not completely certain that the changes requested here are sufficient -- I need to spend more time thinking about them -- but I believe that at a minimum we need to prevent the connection deadline from outliving the handshake.
// if the socket remains open. Closing the dialer does not work | ||
// because the handshake is not complete (check for p.open). | ||
|
||
// TODO: Timeout should be configurable however that results in a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HOWEVER, I don't think we want to keep this deadline in place once the handshake is completed. Because we could have large BULK transfers (think a 4 GB message, which is perfectly legal) that might take quite a while to send. Or the connection might be backed up due to other large messages in the pipe ahead of it. Therefore we need to ensure that the timeout does not persist once the handshake is complete.
Thanks for taking a look. Mangos has been great. |
Also change ipc solaris to use SyscallConn which enables deadlines/close/abort to work