-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add retry mechanism for put call #18137
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. Added some questions and think it would be good to have some more tests to cover functions such as the calculateDelay
function.
api/retry.go
Outdated
for attempt := uint64(0); attempt < rc.MaxRetries; attempt++ { | ||
iDelay = rc.calculateDelay(attempt) | ||
|
||
t := time.NewTimer(iDelay) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also will need a defer t.Stop()
here, otherwise we could leak timers if the context is closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing that might make sense is to instantiate the timer outside the loop and then call Reset
on it when needed. Our helper.SafeTimer
would be nice here but we don't want to pull that into this package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, reuse the time, will do!
api/retry.go
Outdated
return 0 | ||
} | ||
|
||
new := rc.DelayBase << (attempt - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename this variable as it collides with the builtin function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
api/retry.go
Outdated
type retryOptions struct { | ||
MaxRetries uint64 // Optional, defaults to 3 | ||
MaxBetweenCalls time.Duration // Optional, defaults to 0, meaning no time cap | ||
MaxToLastCall time.Duration // Optional, defaults to 0, meaning no time cap | ||
FixedDelay time.Duration // Optional, defaults to 0, meaning Delay is exponential, starting at 1sec | ||
DelayBase time.Duration // Optional, defaults to 1sec | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The struct isn't exported but the parameters are, do we want to alter the parameters? This also makes me wonder if this is something we would want to make available to external users of the API package?
Could we also add some added descriptions on the parameters? I had to read the code to understand MaxBetweenCalls the first time.
return wm, err | ||
} | ||
|
||
func isCallRetriable(statusCode int) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing as this function is available to the whole API package, I wonder if it needs a more descriptive name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was the best I could come up with jejej any ideas?
api/retry.go
Outdated
var err error | ||
var wm *WriteMeta | ||
|
||
iDelay := time.Duration(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iDelay
for initial delay? If so, we use it for all delays, so maybe delayDuration
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was for iteration delay, but lets change it for attempt delay :)
api/retry.go
Outdated
func newRetryClient(c client, opts retryOptions) *retryClient { | ||
rc := &retryClient{ | ||
c: c, | ||
retryOptions: retryOptions{ | ||
MaxRetries: 3, | ||
DelayBase: time.Second, | ||
}, | ||
} | ||
|
||
if opts.DelayBase != 0 { | ||
rc.DelayBase = opts.DelayBase | ||
} | ||
|
||
if opts.MaxRetries != 0 { | ||
rc.MaxRetries = opts.MaxRetries | ||
} | ||
|
||
if opts.MaxBetweenCalls != 0 { | ||
rc.MaxBetweenCalls = opts.MaxBetweenCalls | ||
} | ||
|
||
if opts.MaxToLastCall != 0 { | ||
rc.MaxToLastCall = opts.MaxToLastCall | ||
} | ||
|
||
if opts.FixedDelay != 0 { | ||
rc.FixedDelay = opts.FixedDelay | ||
} | ||
|
||
return rc | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
retryClient
isn't exported, so it's only available for Nomad-internal uses. I think that's probably fine for now (until we're happy with our own uses of it). But from the standpoint of API design, maybe it would make sense to have all this functionality in the standard Client
by adding it to the Config
struct as an unexported field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im not sure I know what you mean by adding as a Config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if we added the retryOptions
to api.Config
then we wouldn't need to have a special retryClient
struct, and the standard api.Client
could use the retry options whenever it's been configured that way.
type Config struct {
// ... there's a bunch of existing field here
retry retryOptions
}
And then later if we want to, we could make the retry options public for all api
package consumers (ex. could the Terraform provider use this?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about this last night and there's no real reason to hold this PR up for this though. The new code you've added isn't exported, so it's not part of the public API and will be safe to update later on without breaking backwards compatibility. Let's ship and do any changes like this later if we want.
api/retry.go
Outdated
for attempt := uint64(0); attempt < rc.MaxRetries; attempt++ { | ||
iDelay = rc.calculateDelay(attempt) | ||
|
||
t := time.NewTimer(iDelay) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing that might make sense is to instantiate the timer outside the loop and then call Reset
on it when needed. Our helper.SafeTimer
would be nice here but we don't want to pull that into this package.
return statusCode > http.StatusInternalServerError && | ||
statusCode < http.StatusNetworkAuthenticationRequired || | ||
statusCode == http.StatusTooManyRequests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately there are still a bunch of places in the HTTP server where we return 500 when we should be returning 400. 😊 But I don't think we can fix all of those in this PR. So long as we're not exporting this for now I think that's fine.
api/retry.go
Outdated
type retryOptions struct { | ||
MaxRetries uint64 // Optional, defaults to 3 | ||
MaxBetweenCalls time.Duration // Optional, defaults to 0, meaning no time cap | ||
MaxToLastCall time.Duration // Optional, defaults to 0, meaning no time cap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we really need this if we're passing context.Context
to retryPut
. The caller can cancel the work however they see fit there, and it would be easy to misconfigure this to be shorter than any context timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of the lock lease on this, how it doesn't make sense to retry a call after the TTL passed, the servir will expire the lock by then anyways.
api/retry.go
Outdated
return 0 | ||
} | ||
|
||
newDelay := c.config.retryOptions.delayBase << (attempt - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this will overflow quite quickly similarly to what is shown here: #18200
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. @Juanadelacuesta note that we don't want to pull in the helper
into this api
package, so we'll need to duplicate the logic we added in #18200 here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we set a limit to the max number of attempts?
api/retry.go
Outdated
// Ensure that a big attempt number or a big delayBase number will not cause | ||
// a negative delay by overflowing the delay increase. | ||
if math.MaxInt64/c.config.retryOptions.delayBase.Nanoseconds() < (1 << attempt) { | ||
return defaultMaxBackoffDelay | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work to protect against overflow in the bitshift, because it does the bitshift in order to do the check. Without a low cap on the number of attempts it'd be easy to cause this to overflow. See https://go.dev/play/p/DztXnyHL9hk for the results.
Instead, we should be tracking the current delay in the caller and returning early if that delay exceeds the max backoff (or default max backoff, if no max backoff is set).
…hat wont cause the delay to overflow
* func: add retry mechanism for put call * style: remove debg print * fix: move timer outside of main loop, add descriptions * func: make the retryPut a method of the API client directly * func: add verification to avoid negative delays * func: change the delay capping mechanism to checking for an attempt that wont cause the delay to overflow
* func: add retry mechanism for put call * style: remove debg print * fix: move timer outside of main loop, add descriptions * func: make the retryPut a method of the API client directly * func: add verification to avoid negative delays * func: change the delay capping mechanism to checking for an attempt that wont cause the delay to overflow
* func: add retry mechanism for put call * style: remove debg print * fix: move timer outside of main loop, add descriptions * func: make the retryPut a method of the API client directly * func: add verification to avoid negative delays * func: change the delay capping mechanism to checking for an attempt that wont cause the delay to overflow
The SDK for the locking could benefit from a retry mechanism for calls made to the nomad server, this PR introduces this mechanism on top of the put call already present on the api.