-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collect driver stats when driver plugins are restarted #5948
Conversation
The driver plugin stub client must call `grpcutils.HandleGrpcErr` to handle plugin shutdown similar to other functions. This ensures that TaskStats returns `ErrPluginShutdown` when plugin shutdown.
Previously, if a channel is closed, we retry the Stats call. But, if that call fails, we go in a backoff loop without calling Stats ever again. Here, we use a utility function for calling driverHandle.Stats call that retries as one expects. I aimed to preserve the logging formats but made small improvements as I saw fit.
// on the stop channel is the correct behavior | ||
if err != bstructs.ErrPluginShutdown { | ||
h.logger.Debug("error fetching stats of task", "error", err) | ||
goto RETRY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my rewrite, I flipped this conditional logic. The comment doesn't match the conditional. I assume the comment intends to say that if the plugin is shutdown, we want to log in debug level rather than an error level (done in L130). If so, the check here should be err == bstructs.ErrPluginShutdown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment is correct - its saying that if the err is ErrPluginShutdown
we should not be logging because that logging is unnecessary. I am not sure about retrying here vs one layer up to call TaskStats
once the plugin restarts though.
Not an expert in this code though, so @nickethier should chime in..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the old code, if err != bstructs.ErrPluginShutdown
we log in debug mode in L121, but if it's ErrPluginShutdown
, then we will log with a Warn level in L130. I'm having a hard time reconciling that with comment.
The old logic for logging is in:
nomad/client/allocrunner/taskrunner/stats_hook.go
Lines 120 to 130 in 8da0b91
if err != bstructs.ErrPluginShutdown { | |
h.logger.Debug("error fetching stats of task", "error", err) | |
goto RETRY | |
} | |
// check if the error is terminal otherwise it's likely a | |
// transport error and we should retry | |
re, ok = err.(*structs.RecoverableError) | |
if ok && re.IsUnrecoverable() { | |
return | |
} | |
h.logger.Warn("stats collection for task failed", "error", err) |
|
||
limit := time.Second * 5 | ||
backoff := 1 << (2 * uint64(retry)) * time.Second | ||
if backoff > limit || retry > 5 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a retry check mostly to avoid dealing with integer overflow. If retry count is large enough, we get some wrapping effect, and backoff becomes a negative value! You can see the effects in https://play.golang.org/p/JYvHeAsSTUk .
The backoff computation computes the following backoff value for retry values.
0 1s
1 4s
2 16s
3 1m4s
4 4m16s
5 17m4s
6 1h8m16s
7 4h33m4s
8 18h12m16s
9 72h49m4s
10 291h16m16s
11 1165h5m4s
12 4660h20m16s
13 18641h21m4s
14 74565h24m16s
15 298261h37m4s
16 1193046h28m16s
17 -351909h41m29.709551616s
18 -1407638h45m58.838206464s
19 -506459h29m21.64327424s
Here, I avoided adding a retry cap to minimize changes compared to old code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM on small comment about error handling.
// check if the error is terminal otherwise it's likely a | ||
// transport error and we should retry | ||
if re, ok := err.(*structs.RecoverableError); ok && re.IsUnrecoverable() { | ||
return nil, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error gets swallowed. Is it useful enough to log it even if debug? When callStatsWithRetry
is called and returns with an error it just returns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we should! I logged them as an error just like the other case where we retry.
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Fixes #5936 .
This fixes two issues in handling stats in the case where a driver plugin is restarted.
Interestingly, there are two retry layers for driver handler calls. Once in LazyHandle that special cases plugin shutdown and once again in the hook. Due to the bugs above, both layers fail to recover. The LazyHandle logic is here:
nomad/client/allocrunner/taskrunner/lazy_handle.go
Lines 141 to 149 in 8da0b91
I considered, but decided against, making more substantial refactoring in LazyHandle. LazyHandle tracks its own handlers with its own locking. We can probably simplify the logic to always call
tr.getDriverHandle
, which is a simple synchornized field lookup - the refresh logic is probably unnecessary. I punted on that for another PR as I want to keep this PR less risky.