-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GrpcRetryer now respects DEADLINE_EXCEEDED as non-retryable #654
Conversation
temporal-serviceclient/src/main/java/io/temporal/internal/retryer/GrpcAsyncRetryer.java
Outdated
Show resolved
Hide resolved
temporal-serviceclient/src/main/java/io/temporal/internal/retryer/GrpcRetryerUtils.java
Show resolved
Hide resolved
* @param currentTimeMillis current timestamp | ||
* @return true if we out of attempts or time to retry | ||
*/ | ||
static boolean ranOutOfRetries( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we name it positively like canRetry
or isRetryAllowed
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand where you are coming from, but from readability perspective "ran our of retries" describes much better what happened and what do we actually check.
canRetry or isRetryAllowed can check an exception if it's retryable for example, these names are not very clear.
Renaming it to notRanOutOfRetries and inverting the result doesn't look like a good idea.
Thread.currentThread().interrupt(); | ||
throw new CancellationException(); | ||
} catch (StatusRuntimeException e) { | ||
lastException = e; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do deadline exceeded exceptions have any cause? If not, should we init the cause and specify previous error there? This way we'll have context of what has caused deadline exceeded error in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel a little bit uneasy about it.
First of all, StatusRuntimeException actually doesn't have a cause at all. It doesn't expose any constructor accepting cause. I think it's done for it being serializable.
Second, I feel that such usage of cause
can be very misleading for users. The previous exception didn't really cause DeadlineExceeded, it was caused by a deadline timeout.
I understand what you are coming from, but to do that we probably should make a new class of exceptions. Let's catch up on that and make it in a separate PR if we really want it.
...ceclient/src/main/java/io/temporal/serviceclient/DefaultServiceOperationRpcRetryOptions.java
Outdated
Show resolved
Hide resolved
public static final Duration RETRY_SERVICE_OPERATION_INITIAL_INTERVAL = Duration.ofMillis(20); | ||
public static final Duration RETRY_SERVICE_OPERATION_EXPIRATION_INTERVAL = Duration.ofMinutes(1); | ||
public static final Duration RETRY_SERVICE_OPERATION_MAXIMUM_INTERVAL; | ||
public static final double RETRY_SERVICE_OPERATION_BACKOFF = 1.2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've recently reduced backoff in go to 2x. Probably would make sense to do same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing it out! I don't want to put too many meaningful changes and parameters tuning in this PR, because it's heavy in refactoring. I will make this change in a follow-up PR.
import java.time.Duration; | ||
|
||
public class DefaultServiceOperationRpcRetryOptions { | ||
public static final Duration RETRY_SERVICE_OPERATION_INITIAL_INTERVAL = Duration.ofMillis(20); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial interval has been recently increased to 200ms in go to avoid spamming the server too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing it out! I don't want to put too many meaningful changes and parameters tuning in this PR, because it's heavy in refactoring. I will make this change in a follow-up PR.
.addDoNotRetry(Status.Code.PERMISSION_DENIED, null) | ||
.addDoNotRetry(Status.Code.UNAUTHENTICATED, null) | ||
.addDoNotRetry(Status.Code.UNIMPLEMENTED, null) | ||
.addDoNotRetry(Status.Code.INTERNAL, QueryFailedFailure.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe Internal
should be retryable. See https://www.notion.so/temporalio/Retryable-gRPC-error-codes-e1324307c4a745839b6c4a84a27373b8
We may need to check with @alexshtin to confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing it out! I don't want to put too many meaningful changes and parameters tuning in this PR, because it's heavy in refactoring. I will make this change in a follow-up PR.
Refactor GrpcRetryer into GrpcAsyncRetryer and GrpcSyncRetryer Async rewrite code was rewritten to be much simpler and have fewer wrappers and handlers Implementation is reworked to be more unit-testable Issue temporalio#653
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, let's fork the discussion about numbers into a separate thread.
What was changed
GrpcRetryer now respects DEADLINE_EXCEEDED as non-retryable
Refactor GrpcRetryer into GrpcAsyncRetryer and GrpcSyncRetryer
Async rewrite code was rewritten to be much simpler and have fewer wrappers and handlers
Implementation is reworked to be more unit-testable
Why?
As described in issue #653, right now when we reach DEADLINE from GRPC context we continue to retry while it already doesn't make sense, all requests will continue to fail with DEADLINE_EXCEEDED
Closes WorkflowExecutionUtils and GrpcRetryer don't respect reaching deadline from GRPC context and downstream DEADLINE_EXCEEDED exceptions #653
Refactoring helps also to address Add unit tests for WorkflowExecutionUtils#getInstanceCloseEvent #650
How was this tested:
Integration tests for workflows with GRPC deadline expiring before the start of the workflow and in the middle of workflow execution