Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix multiple overflow errors in exponential backoff #18200

Merged
merged 8 commits into from
Aug 15, 2023
Merged

Conversation

tgross
Copy link
Member

@tgross tgross commented Aug 14, 2023

We use capped exponential backoff in several places in the code when handling failures. The code we've copy-and-pasted all over has a check to see if the backoff is greater than the limit, but this check happens after the bitshift and we always increment the number of attempts. This causes an overflow with a fairly small number of failures (ex. at one place I tested it occurs after only 24 iterations), resulting in a negative backoff which then never recovers. The backoff becomes a tight loop consuming resources and/or DoS'ing a Nomad RPC handler or an external API such as Vault. Note this doesn't occur in places where we cap the number of iterations so the loop breaks (usually to return an error), so long as the number of iterations is reasonable.

Introduce a check on the cap before the bitshift to avoid overflow in all places this can occur.

Fixes: #18199

We use capped exponential backoff in several places in the code when handling
failures. The code we've copy-and-pasted all over has a check to see if the
backoff is greater than the limit, but this check happens after the bitshift and
we always increment the number of attempts. This causes an overflow with a
fairly small number of failures (ex. at one place I tested it occurs after only
24 iterations), resulting in a negative backoff which then never recovers. The
backoff becomes a tight loop consuming resources and/or DoS'ing a Nomad RPC
handler or an external API such as Vault. Note this doesn't occur in places
where we cap the number of iterations so the loop breaks (usually to return an
error), so long as the number of iterations is reasonable.

Introduce a check on the cap before the bitshift to avoid overflow in all places
this can occur.

Fixes: #18199
Copy link
Member

@gulducat gulducat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple bits (heh) for consideration, which may be a result of my eyes glazing over reading these almost-but-not-quite-identical implementations.

nomad/state/state_store.go Outdated Show resolved Hide resolved
nomad/worker.go Outdated Show resolved Hide resolved
@stswidwinski
Copy link
Contributor

I thought I'd propose a little bit more structured approach: #18201. Perhaps we can use that across the callsites to avoid future issues resulting from repeated logic?

@tgross
Copy link
Member Author

tgross commented Aug 15, 2023

I've pulled in the proposed changes from #18201

Copy link
Member

@gulducat gulducat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay for reusable helper!

I have two concerns in the docker driver, one small and one large.
Other than that, LGTM!

drivers/docker/driver.go Outdated Show resolved Hide resolved
drivers/docker/driver.go Show resolved Hide resolved
drivers/docker/driver.go Show resolved Hide resolved
@tgross tgross merged commit f00bff0 into main Aug 15, 2023
@tgross tgross deleted the backoff-overflow branch August 15, 2023 18:38
tgross added a commit that referenced this pull request Aug 15, 2023
We use capped exponential backoff in several places in the code when handling
failures. The code we've copy-and-pasted all over has a check to see if the
backoff is greater than the limit, but this check happens after the bitshift and
we always increment the number of attempts. This causes an overflow with a
fairly small number of failures (ex. at one place I tested it occurs after only
24 iterations), resulting in a negative backoff which then never recovers. The
backoff becomes a tight loop consuming resources and/or DoS'ing a Nomad RPC
handler or an external API such as Vault. Note this doesn't occur in places
where we cap the number of iterations so the loop breaks (usually to return an
error), so long as the number of iterations is reasonable.

Introduce a helper with a check on the cap before the bitshift to avoid overflow in all 
places this can occur.

Fixes: #18199
Co-authored-by: stswidwinski <[email protected]>
tgross added a commit that referenced this pull request Aug 15, 2023
We use capped exponential backoff in several places in the code when handling
failures. The code we've copy-and-pasted all over has a check to see if the
backoff is greater than the limit, but this check happens after the bitshift and
we always increment the number of attempts. This causes an overflow with a
fairly small number of failures (ex. at one place I tested it occurs after only
24 iterations), resulting in a negative backoff which then never recovers. The
backoff becomes a tight loop consuming resources and/or DoS'ing a Nomad RPC
handler or an external API such as Vault. Note this doesn't occur in places
where we cap the number of iterations so the loop breaks (usually to return an
error), so long as the number of iterations is reasonable.

Introduce a helper with a check on the cap before the bitshift to avoid overflow in all 
places this can occur.

Fixes: #18199
Co-authored-by: stswidwinski <[email protected]>
tgross added a commit that referenced this pull request Aug 15, 2023
We use capped exponential backoff in several places in the code when handling
failures. The code we've copy-and-pasted all over has a check to see if the
backoff is greater than the limit, but this check happens after the bitshift and
we always increment the number of attempts. This causes an overflow with a
fairly small number of failures (ex. at one place I tested it occurs after only
24 iterations), resulting in a negative backoff which then never recovers. The
backoff becomes a tight loop consuming resources and/or DoS'ing a Nomad RPC
handler or an external API such as Vault. Note this doesn't occur in places
where we cap the number of iterations so the loop breaks (usually to return an
error), so long as the number of iterations is reasonable.

Introduce a helper with a check on the cap before the bitshift to avoid overflow in all
places this can occur.

Fixes: #18199
Co-authored-by: stswidwinski <[email protected]>
@tgross
Copy link
Member Author

tgross commented Aug 15, 2023

Backported to 1.6.x, 1.5.x, and 1.4.x

lhaig added a commit to lhaig/nomad that referenced this pull request Dec 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging this pull request may close these issues.

Recurring device driver stats failure causes zero backoff (and a lot of wasted CPU).
3 participants