async retries timeout race condition #528

daniel-sanche · 2023-09-07T20:56:35Z

Problem

Async retries use an asyncio.wait_for line to break out early when the deadline is hit. The problem is, in practice, retries typically wrap gapic grpc calls, which implement their own timeout logic. This means there is a natural race condition when the timeout is reached, where one of two things can happen:

the wait_for line in the wrapper notices the timeout first:
- the wrapper emitts a RetryError
- a CancelledError is sent down to the wrapped function, which the wrapped function may have to account for properly
The wrapped grpc call times out on its own:
- a DeadlineExceeded error is passed to the wrapper
  - This may result in the wrapper raising a RetryError, if DeadlineExceeded is in the list of retryable errors, otherwise it will be raised directly through the wrapper.

On top of the race condition, there are a couple other issues with wait_for:

wait_for creates a new task, which can add significant performance overhead
This behaviour differs from the sync retry function, which only checks for timeouts after the wrapped function attempt is complete

Solutions:

We can add a new interrupt_on_timeout argument to the async retry function, and only enable wait_for when the flag is True. To avoid breaking changes, we can have the flag default to True

But personally, I think we should consider just removing the wait_for entirely

Streaming Retries:

Streaming retries add an extra complication, since there's the added possibility to check for expired timeouts after each yield in the stream, on top of the option to break the stream mid-yield. We we could add another flag to control for to control these timeout options.

But adding any extra checks in between each yield can significantly harm performance, especially if we add a wait_for to interrupt mid-yield, like async_retry currently does. For this reason. I suggest that we keep these timeout options out of streaming retries for now, and only keep the async_retries wait_for for compatibility reasons.

In general, I think the retry wrappers should expect the wrapped function to handle timeout intrruptions on their own, and the wrappers should only handle timeouts between attempts

The text was updated successfully, but these errors were encountered:

daniel-sanche · 2023-12-13T18:24:46Z

Fixed in #495

daniel-sanche added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p3 Desirable enhancement or fix. May not be included in next release. labels Sep 7, 2023

daniel-sanche self-assigned this Sep 7, 2023

daniel-sanche added priority: p2 Moderately-important priority. Fix may not be included in next release. and removed priority: p3 Desirable enhancement or fix. May not be included in next release. labels Sep 7, 2023

daniel-sanche mentioned this issue Dec 1, 2023

feat: retry and retry_async support streaming rpcs #495

Merged

vchudnov-g mentioned this issue Dec 11, 2023

ensure we use retry classes correctly for {unary,streaming}×{sync,async} methods googleapis/gapic-generator-python#1892

Open

daniel-sanche mentioned this issue Dec 11, 2023

chore: update api_core submodule googleapis/python-bigtable#897

Merged

daniel-sanche closed this as completed Dec 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

async retries timeout race condition #528

async retries timeout race condition #528

daniel-sanche commented Sep 7, 2023 •

edited

Loading

daniel-sanche commented Dec 13, 2023

async retries timeout race condition #528

async retries timeout race condition #528

Comments

daniel-sanche commented Sep 7, 2023 • edited Loading

Problem

Solutions:

Streaming Retries:

daniel-sanche commented Dec 13, 2023

daniel-sanche commented Sep 7, 2023 •

edited

Loading