Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: backoff before reconnecting if an HTTP2 server preface was not received #1648

Merged
merged 11 commits into from
Dec 1, 2017

Conversation

MakMukhi
Copy link
Contributor

@MakMukhi MakMukhi commented Nov 3, 2017

fixes #1444

@dfawley dfawley changed the title Client must backoff before retrying if an HTTP2 connection wasn't established( server preface wasn't received). client: backoff before reconnecting if an HTTP2 server preface was not received Nov 3, 2017
@dfawley dfawley self-requested a review November 6, 2017 22:07
@MakMukhi MakMukhi assigned dfawley and unassigned MakMukhi Nov 7, 2017
@MakMukhi MakMukhi force-pushed the backoff_please branch 3 times, most recently from 9a60fe7 to 31151d2 Compare November 9, 2017 18:49
clientconn.go Outdated
@@ -114,6 +115,16 @@ func UseCompressor(name string) CallOption {
})
}

// WithWaitForServerSettings makes the Dial wait until the client receives initial settings (server preface)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

With...Settings blocks until the initial settings frame is received from the
server before assigning RPCs to the connection.

Also, how about WithWaitForHandshake? I think I'll be renaming the ServerOption to HandshakeTimeout to match Java/C.

clientconn.go Outdated
// ready is closed and becomes nil when a new transport is up or failed
// due to timeout.
ready chan struct{}
transport transport.ClientTransport

// The reason this addrConn is torn down.
tearDownErr error

retries int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please give this a more specific name; I'm implementing a feature called "retry" too. 😄

clientconn.go Outdated
// backoffDeadline is the time until which resetTransport needs to
// wait before increasing retries count.
backoffDeadline time.Time
// connectDeadline is the time by which dial and tls handshake should
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and also the server preface, right?

How about "connectDeadline is the time by which all connection negotiation must complete"?

clientconn.go Outdated
// The created transport must receive initial settings frame from the server.
// In case that doesnt happen, transportMonitor will kill the newly created
// transport after connectDeadline has expired.
// In case there was an error on the tranpsort before the settings frame was
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*transport

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

clientconn.go Outdated
// transport after connectDeadline has expired.
// In case there was an error on the tranpsort before the settings frame was
// received, resetTransport resumes connecting to backends after the one that
// is previously connected to. In case end of the list is reached, resetTransport
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is->was

clientconn.go Outdated
newTransport, err := transport.NewClientTransport(connectCtx, ac.cc.ctx, sinfo, copts, func() {
close(done)
ac.mu.Lock()
if !ac.backoffDeadline.IsZero() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be faster (and is simpler, code-wise) to just clear everything unconditionally.

clientconn.go Outdated
ac.mu.Unlock()
// Block until we receive a goaway or an error occurs.
select {
case <-t.GoAway():
case <-t.Error():
case <-cdeadline:
timer.Stop()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timer fired; stopping it is unnecessary.

clientconn.go Outdated
timer = nil
// No server preface received until deadline.
// Kill the connection.
grpclog.Errorf("grpc: addrConn.transportMonitor didn't get server preface after waiting. Closing the new transport now.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning?

bootstrapAddrs []resolver.Address
}

// BootstrapWithAddrs adds resolved addresses to the resolver so that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"InitialAddrs"?

vet.sh Outdated
@@ -64,7 +64,7 @@ trap cleanup EXIT
git ls-files "*.go" | xargs sed -i 's:"golang.org/x/net/context":"context":'
set +o pipefail
# TODO: Stop filtering pb.go files once golang/protobuf#214 is fixed.
go tool vet -all . 2>&1 | grep -vF '.pb.go:' | tee /dev/stderr | (! read)
go tool vet -all . 2>&1 | grep -vE 'cancel (function|var)' | grep -vF '.pb.go:' | tee /dev/stderr | (! read)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are you filtering out with this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a successful dial we can't call cancel on the context used for it's dialing because of Go 1.6 issue. But Go 1.9 vet catches that and throws an error with a message that this grep captures.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to do this only for the file it was in, IIRC. Can you make this grep for 'clientconn.go:.*cancel ........ ?

clientconn.go Outdated
// It returns true if a connection was established.
func (ac *addrConn) createTransport(connectRetryNum int, backoffDeadline, connectDeadline time.Time, addrs []resolver.Address, prevAddr resolver.Address, copts transport.ConnectOptions) (bool, error) {
startFrom := 0
for idx, addr := range addrs {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put this inside a if prevAddr != resolver.Address{} for clarity?

Could prevAddr be an int instead to avoid the need for this search? If the address list is changed during a connection attempt, which is pretty unlikely, we could just restart from the top of the list and reset the backoff - what do you think?

// ac.tearDown(...) has been invoked.
ac.mu.Unlock()
return errConnClosing
if ac.state != connectivity.Shutdown {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be TransientFailure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the state was Shutdown we don't want to put it to TransientFailure

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. If we are here the only possibilities are Connecting or Shutdown.

clientconn.go Outdated
case <-done:
case <-connectCtx.Done():
// Didn't receive server preface, must kill this new
// transport now.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Unwrap please.

clientconn.go Outdated
case <-connectCtx.Done():
// Didn't receive server preface, must kill this new
// transport now.
grpclog.Warningf("grpc: addrConn.createTransport will close the newly established transport since no server preface was received.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about: grpc: addrConn.createTransport failed to receive server preface before deadline ?

clientconn.go Outdated
// transport now.
grpclog.Warningf("grpc: addrConn.createTransport will close the newly established transport since no server preface was received.")
newTr.Close()
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not return false?

clientconn.go Outdated
ac.cc.handleSubConnStateChange(ac.acbw, ac.state)
if ac.transport != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete these lines; ac.transport must be nil.

select {
case <-sent:
default:
t.Fatalf("Dial returned before server settings were sent")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this still a little racy? We write the settings immediately after closing sent. I think we should sleep for a little bit after closing sent, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dial waits until the settings are sent and the channel is closed before they are.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see.. So it is still possible it returned before the settings were sent in this case, because we close sent before sending, but it's unlikely because of the sleep before sending.

vet.sh Outdated
@@ -64,7 +64,7 @@ trap cleanup EXIT
git ls-files "*.go" | xargs sed -i 's:"golang.org/x/net/context":"context":'
set +o pipefail
# TODO: Stop filtering pb.go files once golang/protobuf#214 is fixed.
go tool vet -all . 2>&1 | grep -vF '.pb.go:' | tee /dev/stderr | (! read)
go tool vet -all . 2>&1 | grep -vE 'cancel (function|var)' | grep -vF '.pb.go:' | tee /dev/stderr | (! read)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to do this only for the file it was in, IIRC. Can you make this grep for 'clientconn.go:.*cancel ........ ?

Copy link
Member

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 nit and 1 minor cleanup, otherwise LGTM.

// ac.tearDown(...) has been invoked.
ac.mu.Unlock()
return errConnClosing
if ac.state != connectivity.Shutdown {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. If we are here the only possibilities are Connecting or Shutdown.

select {
case <-sent:
default:
t.Fatalf("Dial returned before server settings were sent")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see.. So it is still possible it returned before the settings were sent in this case, because we close sent before sending, but it's unlikely because of the sleep before sending.

clientconn.go Outdated
state connectivity.State
mu sync.Mutex
curAddr resolver.Address
reconnectIdx int // The index in adder list to start reconnecting from.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "addrs"

clientconn.go Outdated
case <-cdeadline:
var isConnected bool
ac.mu.Lock()
if ac.backoffDeadline.IsZero() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isConnected := ac.backoffDeadline.IsZero() // Client received server preface.

@MakMukhi MakMukhi merged commit ddbb27e into grpc:master Dec 1, 2017
@dfawley dfawley added the Type: Behavior Change Behavior changes not categorized as bugs label Jan 2, 2018
@dfawley dfawley added this to the 1.9 Release milestone Jan 2, 2018
@MakMukhi MakMukhi deleted the backoff_please branch May 4, 2018 01:53
dfawley added a commit to dfawley/grpc that referenced this pull request Oct 25, 2018
When security is disabled, not waiting for the HTTP/2 handshake can lead to
DoS-style behavior.  For details, see:
grpc/grpc-go#954.  This requirement will incur an
extra half-RTT latency before the first RPC can be sent under plaintext, but
this is negligible and unencrypted connections are rarer than secure ones.

Under TLS, the server will effectively send its part of the HTTP/2 handshake
along with its final TLS "server finished" message, which the client must wait
for before transmitting any data securely.  This means virtually no extra
latency is incurred by this requirement.

Go had attempted to separate "connection ready" with "connection successful"
(Issue: grpc/grpc-go#1444 PR:
grpc/grpc-go#1648).  However, this is confusing to
users and introduces an arbitrary distinction between these two events.  It has
led to several bugs in our reconnection logic (e.g.s
grpc/grpc-go#2380,
grpc/grpc-go#2391,
grpc/grpc-go#2392), due to the complexity, and it makes
custom transports (grpc/proposal#103) more difficult
for users to implement.

We are aware of some use cases (in particular,
https://github.com/soheilhy/cmux) expecting the behavior of transmitting an RPC
before the HTTP/2 handshake is completed.  Before making behavior changes to
implement this, we will reach out to our users to the best of our abilities.
@lock lock bot locked as resolved and limited conversation to collaborators Jan 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Type: Behavior Change Behavior changes not categorized as bugs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Client connection phase should optionally wait for SETTINGS frame and set deadlines
3 participants