-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recurring device driver stats failure causes zero backoff (and a lot of wasted CPU). #18199
Comments
Hi @stswidwinski! Yeah this looks pretty straightforwardly a bug. Looks like in some quick testing that it overflows at only 24 iterations and that we should be using an approach like this: backoff := time.Duration(0)
attempt := 0
for {
if backoff < statsBackoffLimit {
backoff = (1 << (2 * uint64(attempt))) * statsBackoffBaseline
if backoff > statsBackoffLimit {
backoff = statsBackoffLimit
}
attempt++
}
} This caps the backoff and avoids the overflow. And you're right there's a good number of places we're doing this incorrectly. I can get a patch up for this shortly. Thanks! |
We use capped exponential backoff in several places in the code when handling failures. The code we've copy-and-pasted all over has a check to see if the backoff is greater than the limit, but this check happens after the bitshift and we always increment the number of attempts. This causes an overflow with a fairly small number of failures (ex. at one place I tested it occurs after only 24 iterations), resulting in a negative backoff which then never recovers. The backoff becomes a tight loop consuming resources and/or DoS'ing a Nomad RPC handler or an external API such as Vault. Note this doesn't occur in places where we cap the number of iterations so the loop breaks (usually to return an error), so long as the number of iterations is reasonable. Introduce a check on the cap before the bitshift to avoid overflow in all places this can occur. Fixes: #18199
Patch is here: #18200 waiting for review. We'll get that into the next version of Nomad (with backports to 1.6.x, 1.5.x, and 1.4.x). |
Thank you! I proposed a somewhat more modular option to help us avoid making a similar mistake in the future (PR linked in your PR)e. Hope that helps! |
We use capped exponential backoff in several places in the code when handling failures. The code we've copy-and-pasted all over has a check to see if the backoff is greater than the limit, but this check happens after the bitshift and we always increment the number of attempts. This causes an overflow with a fairly small number of failures (ex. at one place I tested it occurs after only 24 iterations), resulting in a negative backoff which then never recovers. The backoff becomes a tight loop consuming resources and/or DoS'ing a Nomad RPC handler or an external API such as Vault. Note this doesn't occur in places where we cap the number of iterations so the loop breaks (usually to return an error), so long as the number of iterations is reasonable. Introduce a helper with a check on the cap before the bitshift to avoid overflow in all places this can occur. Fixes: #18199 Co-authored-by: stswidwinski <[email protected]>
We use capped exponential backoff in several places in the code when handling failures. The code we've copy-and-pasted all over has a check to see if the backoff is greater than the limit, but this check happens after the bitshift and we always increment the number of attempts. This causes an overflow with a fairly small number of failures (ex. at one place I tested it occurs after only 24 iterations), resulting in a negative backoff which then never recovers. The backoff becomes a tight loop consuming resources and/or DoS'ing a Nomad RPC handler or an external API such as Vault. Note this doesn't occur in places where we cap the number of iterations so the loop breaks (usually to return an error), so long as the number of iterations is reasonable. Introduce a helper with a check on the cap before the bitshift to avoid overflow in all places this can occur. Fixes: #18199 Co-authored-by: stswidwinski <[email protected]>
We use capped exponential backoff in several places in the code when handling failures. The code we've copy-and-pasted all over has a check to see if the backoff is greater than the limit, but this check happens after the bitshift and we always increment the number of attempts. This causes an overflow with a fairly small number of failures (ex. at one place I tested it occurs after only 24 iterations), resulting in a negative backoff which then never recovers. The backoff becomes a tight loop consuming resources and/or DoS'ing a Nomad RPC handler or an external API such as Vault. Note this doesn't occur in places where we cap the number of iterations so the loop breaks (usually to return an error), so long as the number of iterations is reasonable. Introduce a helper with a check on the cap before the bitshift to avoid overflow in all places this can occur. Fixes: #18199 Co-authored-by: stswidwinski <[email protected]>
We use capped exponential backoff in several places in the code when handling failures. The code we've copy-and-pasted all over has a check to see if the backoff is greater than the limit, but this check happens after the bitshift and we always increment the number of attempts. This causes an overflow with a fairly small number of failures (ex. at one place I tested it occurs after only 24 iterations), resulting in a negative backoff which then never recovers. The backoff becomes a tight loop consuming resources and/or DoS'ing a Nomad RPC handler or an external API such as Vault. Note this doesn't occur in places where we cap the number of iterations so the loop breaks (usually to return an error), so long as the number of iterations is reasonable. Introduce a helper with a check on the cap before the bitshift to avoid overflow in all places this can occur. Fixes: #18199 Co-authored-by: stswidwinski <[email protected]>
Nomad version
tip.
Operating system and Environment details
Linux.
Issue
Repeated failure to obtain device stats causes the backoff to become
0
and burn CPU. We've noticed the CPU being entirely pegged due to this logic. This is because the backoff logic:https://github.com/hashicorp/nomad/blob/main/client/devicemanager/instance.go#L497-L504
Can easily overflow. Here are the logs of repeated failures and the computed backoff:
Sadly, this logic is copy-pasted in multiple different places across the codebase. Some examples:
nomad/nomad/worker.go
Lines 876 to 883 in 65d562b
nomad/nomad/state/state_store.go
Lines 282 to 286 in 65d562b
So many different places may cause a CPU to spin on repeated failures
Reproduction steps
Compile any plugin which always fails to return stats. Run Nomad Client with said plugin and given it some time. It will end up producing bad data and eventually will burn CPU.
The text was updated successfully, but these errors were encountered: