-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc accept loop: added backoff on logging #4974
Conversation
…case there is a fast fail loop (NMD-1173)
nomad/rpc.go
Outdated
tempDelay = max | ||
} | ||
r.logger.Error("failed to accept RPC conn", "error", err, "delay", tempDelay) | ||
time.Sleep(tempDelay) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically we could also exit if the context is canceled, but with a max timeout of 1s it's not a huge deal.
select {
case <-ctx.Done():
case <-time.After(tempDelay):
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call, will do.
nomad/rpc.go
Outdated
} | ||
r.logger.Error("failed to accept RPC conn", "error", err, "delay", tempDelay) | ||
time.Sleep(tempDelay) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the open question is: What to do for non-Temporary() errors?
We handle Temporary above fine, but we tight loop on permanent errors which seems bad. I think we exit, but that requires trusting net.Error#Temporary
a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking through the net
code. It looks like a non temporary failure is fatal in all the cases I'll traced.
@schmichael Would it be possible/good idea to restart the listener? If we error at net.Listen then its pretty safe to say we're hosed and should shutdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@schmichael Would it be possible/good idea to restart the listener? If we error at net.Listen then its pretty safe to say we're hosed and should shutdown.
Possible? Yes, although it's not necessary for this PR or to fix the underlying issue.
Good idea? In theory yes, in practice I don't think it matters.
Ideally on permanent errors we don't spin at all and wait for a SIGHUP to hopefully change things and create new, valid listeners. In reality I'm not sure anything SIGHUP changes can impact permanent errors. They seem to mostly deal with conditions that are programming errors like trying to Accept on a non-socket FD, invalid FD, negative listen backlog, etc.
The only permanent error I even think is possible to encounter in Nomad is EINVAL
because we've closed the listening socket on shutdown. We already check for the shutdown condition, so we'll never need to rely on the error check in that case.
Since as far as I know permanent errors are unreachable, I'm not too worried about how we handle them. It should never matter, but we should be conservative in our approach in case it does.
…t() errors as well (with a higher delay cap)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job @cgbaker!
I've added a few comments to define some consts to get rid of the magic numbers.
nomad/rpc.go
Outdated
case <-ctx.Done(): | ||
return | ||
default: | ||
if tempDelay == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this code is just complex enough to warrant it's own r.handleAcceptErr(err)
func. It looks to me we're safe to stick the tempDelay into the struct (although I think it should be renamed to be more descriptive), and do the sleep in the new func.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nickethier , the return on r.shutdown
and the continue (if we keep it) have to stay inside if the if err
block, so only the sleep logic can go into handleAcceptErr
. however, the r.acceptLoopDelay
has to stay outside of the if block to reset the counter if err == nil
, and i would be worried about the logic for acceptLoopDelay
living in two places.
for example:
https://gist.github.com/cgbaker/ab90f40bcbdd97c28c057e768f39bfc9
nomad/rpc.go
Outdated
@@ -84,6 +84,7 @@ type RPCContext struct { | |||
// listen is used to listen for incoming RPC connections | |||
func (r *rpcHandler) listen(ctx context.Context) { | |||
defer close(r.listenerCh) | |||
var tempDelay time.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be a more descriptive name. acceptLoopDelay
or something along those lines
nomad/rpc.go
Outdated
return | ||
default: | ||
if tempDelay == 0 { | ||
tempDelay = 5 * time.Millisecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets use a const, ex:
baseAcceptLoopDelay = 5 * time.Millisecond
nomad/rpc.go
Outdated
} | ||
|
||
r.logger.Error("failed to accept RPC conn", "error", err) | ||
maxDelay := 5 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const:
maxAcceptLoopDelay = 5 * time.Second
nomad/rpc.go
Outdated
r.logger.Error("failed to accept RPC conn", "error", err) | ||
maxDelay := 5 * time.Second | ||
if ne, ok := err.(net.Error); ok && ne.Temporary() { | ||
maxDelay = 1 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const
maxAcceptLoopTemporaryDelay = 5 * time.Second
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic looks great. Sorry for the long road to get here. 😅
My comments are just Go idioms and style, so fixup and merge as you see fit! No blockers.
nomad/rpc.go
Outdated
|
||
go r.handleConn(ctx, conn, &RPCContext{Conn: conn}) | ||
metrics.IncrCounter([]string{"nomad", "rpc", "accept_conn"}, 1) | ||
} | ||
} | ||
|
||
// Sleep to avoid spamming the log, with a maximum delay according to whether or not the error is temporary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go has an idiom of starting method comments with the name of the method. No big deal.
// Sleep to avoid spamming the log, with a maximum delay according to whether or not the error is temporary | |
// handleAcceptErr sleeps to avoid spamming the log, with a maximum delay | |
// according to whether or not the error is temporary. |
nomad/rpc.go
Outdated
|
||
go r.handleConn(ctx, conn, &RPCContext{Conn: conn}) | ||
metrics.IncrCounter([]string{"nomad", "rpc", "accept_conn"}, 1) | ||
} | ||
} | ||
|
||
// Sleep to avoid spamming the log, with a maximum delay according to whether or not the error is temporary | ||
func (r *rpcHandler) handleAcceptErr(err error, ctx context.Context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go has an idiom of making ctx
the first argument in method calls.
nomad/rpc.go
Outdated
func (r *rpcHandler) handleAcceptErr(err error, ctx context.Context) { | ||
const baseAcceptLoopDelay = 5 * time.Millisecond | ||
const maxAcceptLoopDelay = 5 * time.Second | ||
const maxAcceptLoopDelayTemporaryError = 1 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can shorten the names since they're locals: baseDelay, maxDelay, maxDelayTempError
No need to change.
nomad/rpc.go
Outdated
gologger *golog.Logger | ||
logger log.Logger | ||
gologger *golog.Logger | ||
acceptLoopDelay time.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Define this within listen()
and pass it to the error handler method unless you put it here for testing.
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
see NMD-1173 or #3686 (comment)
adds throttling to error logging on the Accept() loop in the RPC listener, to prevent from overloading the disk in case some problem (out of file handles, out of memory) causes a tight loop
this change is also needed in memberlist