Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API for threshold based retries #35166

Merged

Conversation

mbhaskar
Copy link
Member

@mbhaskar mbhaskar commented May 26, 2023

Description

As end to end policy may effect the availability due to its faster timeouts, we need strategy to improve availability when endto end timeout is specified.
This PR introduces threshold based retry execution of the requests to improve availability when end to end timeout is specified.

API :

This needs

  • list of regions to be retried on
  • Time when this retry has to be triggered

List of regions to be excluded for the request/retries. Example "East US" or "East US, West US" These regions will be excluded from the preferred regions list when executing multi region retries.

The idea here is to provide an option to the user to give a list of regions that they dont want a request to be retried on. This list ideally has to be a sublist of the preferred regions.

CosmosItemRequestOptions#setExcludeRegions

CosmosItemRequestOptions options = new CosmosItemRequestOptions();
options.setExcludeRegions(List.of("East US", "West US");

exclude regions can be used in two ways.

  1. By setting exclude regions, the request would now be routed to the first effective region only (see below). So you can use this to hint a request to go to a particular region.
    Example scneario:
    Preferred regions: regionA, regionB, regionC
  • Have a request going to regionA with endToEndTimeout of 500ms.
  • Once the request fails, you can send a new request with regionA in exclude list.
  • Now the request only goes to regionB and skips regionA

This can help the scenarios where the user wants to specifically try the request on a particular region.
Note that this works only when EndToEndTimeoutPolicy is set.

  1. You can set an availability strategy on CosmosEndToEndOperationLatencyPolicy to get a better availability in cases when the original request takes very long to execute.

This PR introduces a strategy called ThresholdBasedAvailabilityStrategy

flowchart TD
    A[Request] -->B(Base Request with timeout policy)
    B --> Threshold{timespent < speculativeThreshold}
    Threshold -->| Yes | C{success <= EndToEndTimeout}
    Threshold --> | No | RemoteProcessing[Start requests to other regions in effective region list at T + Tstep*step-1]
    C -->|Yes| D[Return result]
    C -->|No response| E[Cancel and timeout]
    C --> |error response| F[Retry Flow]
    F --> G{timespent < EndToEndTimeout}
    G --> |No| E
    G --> |Yes| F
Loading

ThresholdBasedAvailabilityStrategy contains the following parameters

threshold = Threshold Duration in ms
thresholdStep = Threshold step in ms

Threshold is the duration in ms before which if the original request doesnt respond, we issue a request to next region from the list of effective regions.

Effective regions are computed as below. And retries happen only on available effective regions

effectiveRegions = (preferredRegions - excludeRegions)
AvailabilityStrategy thresholdStrategy = 
   new ThresholdBasedAvailabilityStrategy( /*threshold:*/ Duration.ofMillis(300),
                 /*thresholdStep:*/ Duration.ofMillis(100));

Example:
Prefrerred regions: "East US 1", "East US 2", "Central US", "West US 2"
Exclude list: "East US 2"
number of Regions = 2

Effective regions = "East US 1", "Central US"

How to use this ?

CosmosEndToEndOperationLatencyPolicyConfigBuilder builder 
   = new CosmosEndToEndOperationLatencyPolicyConfigBuilder(/*isEnabled:*/ true, 
               /*endToEndTimeout:*/ Duration.ofMillis(2000));
builder.setAvaiabilityStrategy(thresholdStrategy );
CosmosEndToEndOperationLatencyPolicy policyConfig = builder.build();

Enabling it on the client

CosmosAsyncClient cosmosAsyncClient = new CosmosClientBuilder()
    .endpoint(END_POINT)
    .key(KEY)
    .endToEndOperationLatencyPolicyConfig(policyConfig)
    .buildAsyncClient();

This can be enabled or disabled per operation

CosmosItemRequestOptions requestOptions = new CosmosItemRequestOptions();
requestOptions.setCosmosEndToEndOperationLatencyPolicyConfig(policyConfig);
cosmosAsyncClient.getDatabase(DATABASE)
            .getContainer(CONTAINER)
            .readItem("id1", new PartitionKey("id1"), requestOptions, Person.class).block();

Per operation option always overrides the client option.

How does this work?

Initial execution on the requested region starts at t0;
If no response has been received at t0 + threshold milliseconds, it starts another request on the first region from effectiveRegionList
if no response has been received at t0 + threshold+ threshold_step milliseconds, start another request on the region from the next region from effectiveRegionList

This continues only until the number that is set by the user in the options on the number of regions to request to.

Future:

There is a possibility of extending this to configure individual timeout and availability strategy for different kind of operations like point and non point operations

Possible API on the client

CosmosE2EPLatencyPolicyConfigs newGranularConfigs = new CosmosE2ELatencyPolicyConfigsBuilder()
.pointTimeoutConfig(CosmosEndToEndOperationLatencyPolicyConfige)
.queryTimeoutConfig (CosmosEndToEndOperationLatencyPolicyConfige)
.build();

cosmosClientBuilder.setEndToEndLatencyPolicyConfigs(newGranularConfigs);

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

…ide regions to skip from the retry execution
mbhaskar added 6 commits June 6, 2023 08:37
Refactoring
…based-retries-with-excludelists

# Conflicts:
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ImplementationBridgeHelpers.java
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/CosmosQueryRequestOptions.java
@mbhaskar mbhaskar marked this pull request as ready for review June 8, 2023 15:46
@mbhaskar mbhaskar changed the title Draft of API for threshold based retries API for threshold based retries Jun 8, 2023
@azure-sdk
Copy link
Collaborator

API change check

APIView has identified API level changes in this PR and created following API reviews.

azure-cosmos

// Now try the same request with West US 2 excluded
options.setExcludeRegions(ImmutableList.of("West US 2"));
cosmosItemResponseMono =
createdContainer.readItem(itemToRead.getId(), new PartitionKey(itemToRead.getMypk()), options, EndToEndTimeOutValidationTests.TestObject.class);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we verify the contactedRegion?

Copy link
Member

@kushagraThapar kushagraThapar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @mbhaskar

@kushagraThapar kushagraThapar dismissed FabianMeiswinkel’s stale review June 22, 2023 00:01

Dismissing the review as the feedback comments have been incorporated and resolved.

@mbhaskar
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - LGTM

@mbhaskar
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12
Copy link
Member

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12
Copy link
Member

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995
Copy link
Member

/check-enforcer override

@mbhaskar mbhaskar merged commit bb3e409 into Azure:main Jun 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants