-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SessionTokenMismatchRetryPolicy optimization through customer supplied region switch hints #35292
Conversation
API change check APIView has identified API level changes in this PR and created following API reviews. |
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/CosmosClientBuilder.java
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/CosmosRetryStrategy.java
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/CosmosRegionSwitchHint.java
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/CosmosRetryStrategy.java
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/CosmosQueryRequestOptions.java
Outdated
Show resolved
Hide resolved
...s-test/src/main/java/com/azure/cosmos/test/faultinjection/FaultInjectionServerErrorType.java
Outdated
Show resolved
Hide resolved
...mos-tests/src/test/java/com/azure/cosmos/faultinjection/FaultInjectBadSessionTokenTests.java
Outdated
Show resolved
Hide resolved
...s/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RetryStrategyConfiguration.java
Outdated
Show resolved
Hide resolved
/azp run java - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run java - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
...zure-cosmos/src/main/java/com/azure/cosmos/implementation/DocumentServiceRequestContext.java
Outdated
Show resolved
Hide resolved
...re-cosmos/src/main/java/com/azure/cosmos/implementation/SessionTokenMismatchRetryPolicy.java
Outdated
Show resolved
Hide resolved
...c/main/java/com/azure/cosmos/implementation/directconnectivity/ReplicatedResourceClient.java
Outdated
Show resolved
Hide resolved
...ure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/StoreClient.java
Outdated
Show resolved
Hide resolved
/azp run java - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/CosmosRegionSwitchHint.java
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/CosmosClientBuilder.java
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/CosmosSessionRetryOptionsBuilder.java
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/CosmosSessionRetryOptionsBuilder.java
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/CosmosRegionSwitchHint.java
Show resolved
Hide resolved
/azp run java - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @jeet1995
Background
There have been customer reported incidents, where the customer has provided diagnostics wherein there is a storm of
404 / 1002
(NOT_FOUND / READ_SESSION_NOT_AVAILABLE
) errors returned by the service.Upon investigation of our SDK-internal retry policies, whenever the above error occurs, the SDK sends the request to each replica for the physical partition first for a specific region and this cycle of sending a request to each replica of the physical partition for the same region repeats courtesy the
SessionTokenMismatchRetryPolicy
in an exponential backoff manner over a certain time period. Only after the retry window elapses on theSessionTokenMismatchRetryPolicy
, is the request retried on a different region courtesy theClientRetryPolicy
(specific to read requests).This leads to the same replica being sent requests to multiple times which can cause additional latency before of leveraging a different region.
Why do
READ_SESSION_NOT_AVAILABLE
errors occur?GlobalLSN
value or a higherLocalLSN
value (specific to multi-write accounts) than that of the session token in the response. This points to either a lagging region or a lagging replica.Possible workarounds
The ask now becomes, to see if different regions can be leveraged quicker or if it is necessary to cycle through all replicas first in a given region.
What this PR adds
This PR will allow application developers to configure hints through a
SessionRetryOptions
instance which will signal to the SDK whether to pin retries on the local region or move quicker to a remote region especially whenREAD_SESSION_NOT_AVAILABLE
errors are thrown.Public API surface changes
SessionRetryOptions
instance.SessionRetryOptions
instance on theCosmosClientBuilder
instance.System config provided
REMOTE_REGION_PREFERRED
is set as the region switch hint, set the following JVM config as below:Reasoning about region switch hints
READ_SESSION_NOT_AVAILABLE
errors occurLOCAL_REGION_PREFERRED
REMOTE_REGION_PREFERRED
LOCAL_REGION_PREFERRED
REMOTE_REGION_PREFERRED
LOCAL_REGION_PREFERRED
REMOTE_REGION_PREFERRED
LOCAL_REGION_PREFERRED
REMOTE_REGION_PREFERRED
LOCAL_REGION_PREFERRED
REMOTE_REGION_PREFERRED
READ_SESSION_NOT_AVAILABLE
errors.