Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added dial/handshake timeout for non-responsive connections #273

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

nmezei
Copy link

@nmezei nmezei commented Feb 1, 2023

Also change ipc solaris to use SyscallConn which enables deadlines/close/abort to work

mezei added 2 commits February 1, 2023 17:07
Change ipc solaris to use SyscallConn which enables deadlines/close/abort to work
@gdamore
Copy link
Contributor

gdamore commented Feb 1, 2023

Hey Nick. I'll have to look over this a bit to fully understand what you're doing. Thanks for this.

@@ -50,6 +51,10 @@ type dialer struct {
// Dial implements the Dialer Dial method
func (d *dialer) Dial() (transport.Pipe, error) {

// TODO: It might be good to pass a context here to abort the dial after
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that that would be a better approach.

@@ -1,3 +1,4 @@
//go:build solaris && cgo
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my vscode editor added this

// This change was necessary to support SetDeadline and Close
// which aborts pending reads/writes.
// The prior code was calling c.File().Fd(). It was not closing
// the file. Additionally the docs say Fd() causes deadlines to not work.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I originally wrote this, the Control method didn't exist. I need to review the docs to understand what is happening here.

Copy link
Author

@nmezei nmezei Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here was an attempt to call Close on the connection without this change. Close blocked forever instead of aborting the read.

1 @ 0x43ce36 0x44d0be 0x44d095 0x4654c5 0x4b27cd 0x50ea18 0x519ae5 0x528177 0x4696c1
#       0x4654c4        internal/poll.runtime_Semacquire+0x24                           c:/bsros/toolchain/3.4.1/lib/go/src/runtime/sema.go:67
#       0x4b27cc        internal/poll.(*FD).Close+0x6c                                  c:/bsros/toolchain/3.4.1/lib/go/src/internal/poll/fd_unix.go:116
#       0x50ea17        net.(*netFD).Close+0x37                                         c:/bsros/toolchain/3.4.1/lib/go/src/net/fd_posix.go:37
#       0x519ae4        net.(*conn).Close+0x44                                          c:/bsros/toolchain/3.4.1/lib/go/src/net/net.go:207

@@ -252,6 +273,8 @@ func (h *connHandshaker) Close() {
h.closed = true
h.cv.Broadcast()
for conn := range h.workq {
// This does not do anything because conn.open is
// false and conn.Close is a no-op
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this comment is necessarily true. But I need to review ... it may be that I've missed something here. The intent is to be able to abort the handshaker via Close, and that should close any connections that might not be fully established yet. This code was "recently" (in terms of release numbers, but not necessarily dates) refactored to better deal with stalled handshakes, and I may have messed this up.

@@ -166,10 +171,26 @@ type connHeader struct {
// As a side effect, the peer's protocol number is stored in the conn.
// Also, various properties are initialized.
func (p *conn) handshake() error {

// Rational for a timeout mechanism (which may need to move elsewhere)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Rational for a timeout mechanism (which may need to move elsewhere)
// Rationale for a timeout mechanism (which may need to move elsewhere)

// if the socket remains open. Closing the dialer does not work
// because the handshake is not complete (check for p.open).

// TODO: Timeout should be configurable however that results in a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be configurable. The handshake should go quickly, and 5 seconds is more than enough time to deal with even the crappiest networks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HOWEVER, I don't think we want to keep this deadline in place once the handshake is completed. Because we could have large BULK transfers (think a 4 GB message, which is perfectly legal) that might take quite a while to send. Or the connection might be backed up due to other large messages in the pipe ahead of it. Therefore we need to ensure that the timeout does not persist once the handshake is complete.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. That is why I defer SetDeadline to 0. That should disable the current deadline per docs "The deadline applies to all future and pending I/O". I figure the deadline set during handshake is done right after dialer connection creation which is prior to the conn being available for transport usage. After the dial/handshake completes other deadlines could be set on it specific to how mangos wants to use the connection.

Copy link
Contributor

@gdamore gdamore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not completely certain that the changes requested here are sufficient -- I need to spend more time thinking about them -- but I believe that at a minimum we need to prevent the connection deadline from outliving the handshake.

// if the socket remains open. Closing the dialer does not work
// because the handshake is not complete (check for p.open).

// TODO: Timeout should be configurable however that results in a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HOWEVER, I don't think we want to keep this deadline in place once the handshake is completed. Because we could have large BULK transfers (think a 4 GB message, which is perfectly legal) that might take quite a while to send. Or the connection might be backed up due to other large messages in the pipe ahead of it. Therefore we need to ensure that the timeout does not persist once the handshake is complete.

@nmezei
Copy link
Author

nmezei commented Feb 2, 2023

Thanks for taking a look. Mangos has been great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants