Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect driver stats when driver plugins are restarted #5948

Merged
merged 3 commits into from
Jul 17, 2019

Conversation

notnoop
Copy link
Contributor

@notnoop notnoop commented Jul 11, 2019

Fixes #5936 .

This fixes two issues in handling stats in the case where a driver plugin is restarted.

  • a bug where we don't properly detect that driver plugin was shutdown during a TaskStats call; so we don't retry properly.
  • a bug in StatsHook where retry logic doesn't retry calling Stats! So it loops and then wait until context is done.

Interestingly, there are two retry layers for driver handler calls. Once in LazyHandle that special cases plugin shutdown and once again in the hook. Due to the bugs above, both layers fail to recover. The LazyHandle logic is here:

out, err := h.Stats(ctx, interval)
if err == bstructs.ErrPluginShutdown && first {
first = false
h, err = l.refreshHandle()
if err == nil {
goto TRY
}
}

I considered, but decided against, making more substantial refactoring in LazyHandle. LazyHandle tracks its own handlers with its own locking. We can probably simplify the logic to always call tr.getDriverHandle, which is a simple synchornized field lookup - the refresh logic is probably unnecessary. I punted on that for another PR as I want to keep this PR less risky.

Mahmood Ali added 2 commits July 11, 2019 13:57
The driver plugin stub client must call `grpcutils.HandleGrpcErr` to handle plugin
shutdown similar to other functions.  This ensures that TaskStats returns
`ErrPluginShutdown` when plugin shutdown.
Previously, if a channel is closed, we retry the Stats call.  But, if that call
fails, we go in a backoff loop without calling Stats ever again.

Here, we use a utility function for calling driverHandle.Stats call that retries
as one expects.

I aimed to preserve the logging formats but made small improvements as I saw fit.
// on the stop channel is the correct behavior
if err != bstructs.ErrPluginShutdown {
h.logger.Debug("error fetching stats of task", "error", err)
goto RETRY
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my rewrite, I flipped this conditional logic. The comment doesn't match the conditional. I assume the comment intends to say that if the plugin is shutdown, we want to log in debug level rather than an error level (done in L130). If so, the check here should be err == bstructs.ErrPluginShutdown

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment is correct - its saying that if the err is ErrPluginShutdown we should not be logging because that logging is unnecessary. I am not sure about retrying here vs one layer up to call TaskStats once the plugin restarts though.

Not an expert in this code though, so @nickethier should chime in..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the old code, if err != bstructs.ErrPluginShutdown we log in debug mode in L121, but if it's ErrPluginShutdown, then we will log with a Warn level in L130. I'm having a hard time reconciling that with comment.

The old logic for logging is in:

if err != bstructs.ErrPluginShutdown {
h.logger.Debug("error fetching stats of task", "error", err)
goto RETRY
}
// check if the error is terminal otherwise it's likely a
// transport error and we should retry
re, ok = err.(*structs.RecoverableError)
if ok && re.IsUnrecoverable() {
return
}
h.logger.Warn("stats collection for task failed", "error", err)


limit := time.Second * 5
backoff := 1 << (2 * uint64(retry)) * time.Second
if backoff > limit || retry > 5 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a retry check mostly to avoid dealing with integer overflow. If retry count is large enough, we get some wrapping effect, and backoff becomes a negative value! You can see the effects in https://play.golang.org/p/JYvHeAsSTUk .

The backoff computation computes the following backoff value for retry values.

0 	 1s
1 	 4s
2 	 16s
3 	 1m4s
4 	 4m16s
5 	 17m4s
6 	 1h8m16s
7 	 4h33m4s
8 	 18h12m16s
9 	 72h49m4s
10 	 291h16m16s
11 	 1165h5m4s
12 	 4660h20m16s
13 	 18641h21m4s
14 	 74565h24m16s
15 	 298261h37m4s
16 	 1193046h28m16s
17 	 -351909h41m29.709551616s
18 	 -1407638h45m58.838206464s
19 	 -506459h29m21.64327424s

Here, I avoided adding a retry cap to minimize changes compared to old code.

Copy link
Member

@nickethier nickethier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on small comment about error handling.

// check if the error is terminal otherwise it's likely a
// transport error and we should retry
if re, ok := err.(*structs.RecoverableError); ok && re.IsUnrecoverable() {
return nil, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error gets swallowed. Is it useful enough to log it even if debug? When callStatsWithRetry is called and returns with an error it just returns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we should! I logged them as an error just like the other case where we retry.

@notnoop notnoop merged commit b352af9 into master Jul 17, 2019
@notnoop notnoop deleted the b-stats-recover-plugin-shutdown branch July 18, 2019 03:37
notnoop pushed a commit that referenced this pull request Jul 18, 2019
@github-actions
Copy link

github-actions bot commented Feb 7, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Plugin API, TaskStats not called after RecoverTask after plugin restart
3 participants