-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc accept loop: added backoff on logging #4974
Changes from 1 commit
22a4bcd
591cdb0
bbe531c
1a34550
44a6d9d
33b9f9d
3ee692c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -84,6 +84,7 @@ type RPCContext struct { | |
// listen is used to listen for incoming RPC connections | ||
func (r *rpcHandler) listen(ctx context.Context) { | ||
defer close(r.listenerCh) | ||
var tempDelay time.Duration | ||
for { | ||
select { | ||
case <-ctx.Done(): | ||
|
@@ -105,9 +106,21 @@ func (r *rpcHandler) listen(ctx context.Context) { | |
default: | ||
} | ||
|
||
r.logger.Error("failed to accept RPC conn", "error", err) | ||
if ne, ok := err.(net.Error); ok && ne.Temporary() { | ||
if tempDelay == 0 { | ||
tempDelay = 5 * time.Millisecond | ||
} else { | ||
tempDelay *= 2 | ||
} | ||
if max := 1 * time.Second; tempDelay > max { | ||
tempDelay = max | ||
} | ||
r.logger.Error("failed to accept RPC conn", "error", err, "delay", tempDelay) | ||
time.Sleep(tempDelay) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Technically we could also exit if the context is canceled, but with a max timeout of 1s it's not a huge deal. select {
case <-ctx.Done():
case <-time.After(tempDelay):
} There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. good call, will do. |
||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So the open question is: What to do for non-Temporary() errors? We handle Temporary above fine, but we tight loop on permanent errors which seems bad. I think we exit, but that requires trusting There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looking through the @schmichael Would it be possible/good idea to restart the listener? If we error at net.Listen then its pretty safe to say we're hosed and should shutdown. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Possible? Yes, although it's not necessary for this PR or to fix the underlying issue. Ideally on permanent errors we don't spin at all and wait for a SIGHUP to hopefully change things and create new, valid listeners. In reality I'm not sure anything SIGHUP changes can impact permanent errors. They seem to mostly deal with conditions that are programming errors like trying to Accept on a non-socket FD, invalid FD, negative listen backlog, etc. The only permanent error I even think is possible to encounter in Nomad is Since as far as I know permanent errors are unreachable, I'm not too worried about how we handle them. It should never matter, but we should be conservative in our approach in case it does. |
||
continue | ||
} | ||
tempDelay = 0 | ||
|
||
go r.handleConn(ctx, conn, &RPCContext{Conn: conn}) | ||
metrics.IncrCounter([]string{"nomad", "rpc", "accept_conn"}, 1) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be a more descriptive name.
acceptLoopDelay
or something along those lines