GrpcRetryer now respects DEADLINE_EXCEEDED as non-retryable #654

Spikhalskiy · 2021-08-23T04:03:58Z

What was changed

GrpcRetryer now respects DEADLINE_EXCEEDED as non-retryable
Refactor GrpcRetryer into GrpcAsyncRetryer and GrpcSyncRetryer
Async rewrite code was rewritten to be much simpler and have fewer wrappers and handlers
Implementation is reworked to be more unit-testable

Why?

As described in issue #653, right now when we reach DEADLINE from GRPC context we continue to retry while it already doesn't make sense, all requests will continue to fail with DEADLINE_EXCEEDED

Closes WorkflowExecutionUtils and GrpcRetryer don't respect reaching deadline from GRPC context and downstream DEADLINE_EXCEEDED exceptions #653
Refactoring helps also to address Add unit tests for WorkflowExecutionUtils#getInstanceCloseEvent #650
How was this tested:
Integration tests for workflows with GRPC deadline expiring before the start of the workflow and in the middle of workflow execution

temporal-serviceclient/src/main/java/io/temporal/internal/retryer/GrpcAsyncRetryer.java

temporal-serviceclient/src/main/java/io/temporal/internal/retryer/GrpcRetryerUtils.java

vitarb · 2021-08-24T07:08:43Z

temporal-serviceclient/src/main/java/io/temporal/internal/retryer/GrpcRetryerUtils.java

+   * @param currentTimeMillis current timestamp
+   * @return true if we out of attempts or time to retry
+   */
+  static boolean ranOutOfRetries(


Should we name it positively like canRetry or isRetryAllowed?

I understand where you are coming from, but from readability perspective "ran our of retries" describes much better what happened and what do we actually check.
canRetry or isRetryAllowed can check an exception if it's retryable for example, these names are not very clear.
Renaming it to notRanOutOfRetries and inverting the result doesn't look like a good idea.

vitarb · 2021-08-24T07:12:12Z

temporal-serviceclient/src/main/java/io/temporal/internal/retryer/GrpcSyncRetryer.java

+        Thread.currentThread().interrupt();
+        throw new CancellationException();
+      } catch (StatusRuntimeException e) {
+        lastException = e;


Do deadline exceeded exceptions have any cause? If not, should we init the cause and specify previous error there? This way we'll have context of what has caused deadline exceeded error in the first place.

I feel a little bit uneasy about it.
First of all, StatusRuntimeException actually doesn't have a cause at all. It doesn't expose any constructor accepting cause. I think it's done for it being serializable.
Second, I feel that such usage of cause can be very misleading for users. The previous exception didn't really cause DeadlineExceeded, it was caused by a deadline timeout.
I understand what you are coming from, but to do that we probably should make a new class of exceptions. Let's catch up on that and make it in a separate PR if we really want it.

...ceclient/src/main/java/io/temporal/serviceclient/DefaultServiceOperationRpcRetryOptions.java

vitarb · 2021-08-24T07:16:01Z

...ceclient/src/main/java/io/temporal/serviceclient/DefaultServiceOperationRpcRetryOptions.java

+  public static final Duration RETRY_SERVICE_OPERATION_INITIAL_INTERVAL = Duration.ofMillis(20);
+  public static final Duration RETRY_SERVICE_OPERATION_EXPIRATION_INTERVAL = Duration.ofMinutes(1);
+  public static final Duration RETRY_SERVICE_OPERATION_MAXIMUM_INTERVAL;
+  public static final double RETRY_SERVICE_OPERATION_BACKOFF = 1.2;


We've recently reduced backoff in go to 2x. Probably would make sense to do same here.

Thanks for pointing it out! I don't want to put too many meaningful changes and parameters tuning in this PR, because it's heavy in refactoring. I will make this change in a follow-up PR.

vitarb · 2021-08-24T07:16:43Z

...ceclient/src/main/java/io/temporal/serviceclient/DefaultServiceOperationRpcRetryOptions.java

+import java.time.Duration;
+
+public class DefaultServiceOperationRpcRetryOptions {
+  public static final Duration RETRY_SERVICE_OPERATION_INITIAL_INTERVAL = Duration.ofMillis(20);


Initial interval has been recently increased to 200ms in go to avoid spamming the server too much.

Thanks for pointing it out! I don't want to put too many meaningful changes and parameters tuning in this PR, because it's heavy in refactoring. I will make this change in a follow-up PR.

vitarb · 2021-08-24T07:18:54Z

...ceclient/src/main/java/io/temporal/serviceclient/DefaultServiceOperationRpcRetryOptions.java

+        .addDoNotRetry(Status.Code.PERMISSION_DENIED, null)
+        .addDoNotRetry(Status.Code.UNAUTHENTICATED, null)
+        .addDoNotRetry(Status.Code.UNIMPLEMENTED, null)
+        .addDoNotRetry(Status.Code.INTERNAL, QueryFailedFailure.class);


I believe Internal should be retryable. See https://www.notion.so/temporalio/Retryable-gRPC-error-codes-e1324307c4a745839b6c4a84a27373b8
We may need to check with @alexshtin to confirm.

Thanks for pointing it out! I don't want to put too many meaningful changes and parameters tuning in this PR, because it's heavy in refactoring. I will make this change in a follow-up PR.

Refactor GrpcRetryer into GrpcAsyncRetryer and GrpcSyncRetryer Async rewrite code was rewritten to be much simpler and have fewer wrappers and handlers Implementation is reworked to be more unit-testable Issue temporalio#653

vitarb

LGTM, let's fork the discussion about numbers into a separate thread.

Spikhalskiy requested review from mfateev, Sushisource and vitarb as code owners August 23, 2021 04:03

Spikhalskiy force-pushed the issue-653 branch from 90c2d54 to b80672c Compare August 23, 2021 15:19

vitarb reviewed Aug 24, 2021

View reviewed changes

temporal-serviceclient/src/main/java/io/temporal/internal/retryer/GrpcAsyncRetryer.java Outdated Show resolved Hide resolved

vitarb reviewed Aug 24, 2021

View reviewed changes

GrpcRetryer now respects DEADLINE_EXCEEDED as non-retryable

0c16390

Refactor GrpcRetryer into GrpcAsyncRetryer and GrpcSyncRetryer Async rewrite code was rewritten to be much simpler and have fewer wrappers and handlers Implementation is reworked to be more unit-testable Issue temporalio#653

Spikhalskiy force-pushed the issue-653 branch from 1e820bb to 0c16390 Compare August 24, 2021 21:13

vitarb approved these changes Aug 24, 2021

View reviewed changes

Spikhalskiy merged commit 0c09e62 into temporalio:master Aug 24, 2021

vitarb mentioned this pull request Aug 24, 2021

Handle deadline exceeded error in the core sdk temporalio/sdk-core#181

Closed

Spikhalskiy deleted the issue-653 branch April 15, 2022 20:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GrpcRetryer now respects DEADLINE_EXCEEDED as non-retryable #654

GrpcRetryer now respects DEADLINE_EXCEEDED as non-retryable #654

Spikhalskiy commented Aug 23, 2021 •

edited

Loading

vitarb Aug 24, 2021

Spikhalskiy Aug 24, 2021

vitarb Aug 24, 2021

Spikhalskiy Aug 24, 2021

vitarb Aug 24, 2021

Spikhalskiy Aug 24, 2021

vitarb Aug 24, 2021

Spikhalskiy Aug 24, 2021 •

edited

Loading

vitarb Aug 24, 2021

Spikhalskiy Aug 24, 2021 •

edited

Loading

vitarb left a comment

GrpcRetryer now respects DEADLINE_EXCEEDED as non-retryable #654

GrpcRetryer now respects DEADLINE_EXCEEDED as non-retryable #654

Conversation

Spikhalskiy commented Aug 23, 2021 • edited Loading

What was changed

Why?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Spikhalskiy Aug 24, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Spikhalskiy Aug 24, 2021 • edited Loading

Choose a reason for hiding this comment

vitarb left a comment

Choose a reason for hiding this comment

Spikhalskiy commented Aug 23, 2021 •

edited

Loading

Spikhalskiy Aug 24, 2021 •

edited

Loading

Spikhalskiy Aug 24, 2021 •

edited

Loading