async retries timeout race condition #528
Labels
priority: p2
Moderately-important priority. Fix may not be included in next release.
type: bug
Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Problem
Async retries use an
asyncio.wait_for
line to break out early when the deadline is hit. The problem is, in practice, retries typically wrap gapic grpc calls, which implement their own timeout logic. This means there is a natural race condition when the timeout is reached, where one of two things can happen:wait_for
line in the wrapper notices the timeout first:RetryError
DeadlineExceeded
error is passed to the wrapperRetryError
, if DeadlineExceeded is in the list of retryable errors, otherwise it will be raised directly through the wrapper.On top of the race condition, there are a couple other issues with wait_for:
wait_for
creates a new task, which can add significant performance overheadSolutions:
We can add a new
interrupt_on_timeout
argument to the async retry function, and only enable wait_for when the flag is True. To avoid breaking changes, we can have the flag default to TrueBut personally, I think we should consider just removing the wait_for entirely
Streaming Retries:
Streaming retries add an extra complication, since there's the added possibility to check for expired timeouts after each yield in the stream, on top of the option to break the stream mid-yield. We we could add another flag to control for to control these timeout options.
But adding any extra checks in between each yield can significantly harm performance, especially if we add a wait_for to interrupt mid-yield, like async_retry currently does. For this reason. I suggest that we keep these timeout options out of streaming retries for now, and only keep the async_retries wait_for for compatibility reasons.
In general, I think the retry wrappers should expect the wrapped function to handle timeout intrruptions on their own, and the wrappers should only handle timeouts between attempts
The text was updated successfully, but these errors were encountered: