Override Config aws-smithy-client #567

ymwjbxxq · 2022-06-29T07:09:28Z

Describe the feature

Hi all,

as a follow-up of this issue #558
where we can set the timeout for the client to be latency aware with the retries

#[tokio::main]
async fn main() -> Result<(), Error> {
    tracing_subscriber::fmt()
        .with_ansi(false)
        .without_time()
        .with_max_level(tracing_subscriber::filter::LevelFilter::DEBUG)
        .init();

    let timeout_config = timeout::Config::new()
        .with_api_timeouts(
            timeout::Api::new()
                .with_call_timeout(TriState::Set(Duration::from_millis(150)))
                .with_call_attempt_timeout(TriState::Set(Duration::from_millis(50))),
        );
    let config = aws_config::from_env()
        .retry_config(RetryConfig::new().with_max_attempts(3))
        .timeout_config(timeout_config)

        .load()
        .await;

    let dynamodb_client = aws_sdk_dynamodb::Client::new(&config);

I have noticed the following:

DEBUG aws_smithy_client::retry: attempt 1 failed with Error(TransientError); retrying after 435.692227ms

And the aws-smithy-client/src/retry.rs is configured as:

const MAX_ATTEMPTS: u32 = 3;
const INITIAL_RETRY_TOKENS: usize = 500;
const RETRY_COST: usize = 5;

And I have no way to overwrite the default parameters.

From the test, I cannot get exactly where the initial value of the backoff is coming from
https://github.com/awslabs/smithy-rs/blob/a0539e20b069a7de021c84521d8f3c7ba098ad6d/rust-runtime/aws-smithy-client/src/retry.rs#L276

It seems to me that this line of code is not correct:

let backoff = b * (r.pow(self.local.attempts - 1) as f64);
let backoff = Duration::from_secs_f64(backoff).min(self.config.max_backoff);

Because:

it does not consider a value to calculate the retry
convert all in seconds with a max of 20s

It should be based on milliseconds https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=7deed195b0811c97887dbcd6c98f1a6e

let exponential_delay = (ms * (r.pow(attempts - 1) as f64)) as u64;
let backoff =  Duration::from_millis(exponential_delay).min(Duration::from_secs(20));

If we can pass the backoff value it could become

let retry_config = RetryConfig::new()
  .with_max_attempts(5);
  .with_backoff_delay(100); //ms

Use Case

When you are latency aware, you should be able to overwrite the default parameters to configure the backoff.

In my configuration case, I have the following:

3 retries
50ms a for a single attempt
overall duration 150ms

the back-off moves the second retry after 435.692227ms
DEBUG aws_smithy_client::retry: attempt 1 failed with Error(TransientError); retrying after 435.692227ms

This means that it will never retry a second time.

Proposed Solution

Be able to overwrite the backoff policy.

If I wish to have this configuration

3 retries
50ms a for a single attempt

I could setup the backoff policy at 10ms. For example, I would have the following after the first failed attempt.

Retry 1 -> retrying after 20ms
Retry 2 -> retrying after 40ms
Retry 3 -> retrying after 60ms

Now I can setup my overall duration "with_call_timeout" to 210ms, giving the retries time to happen.

Other Information

No response

Acknowledgements

I may be able to implement this feature request
This feature might incur a breaking change

A note for the community

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue, please leave a comment

The text was updated successfully, but these errors were encountered:

rcoh · 2022-07-01T15:42:19Z

thanks! We should figure out the right knobs to enable quicker retries for services like DynamoDb.

ymwjbxxq · 2022-07-05T08:06:44Z

Considering AWS JavaScript SDK as a reference for timeouts and usage, the current Retry logic is incorrect (bug).
After the first failure, the Rust SDK generate a random instead of following the 100ms default setting for all services and the 50ms retry for DynamoDB.

Velfi · 2022-07-05T17:43:53Z

Hey @ymwjbxxq, I'm going to look into this issue. While we (the Rust SDK team) aim for parity with the other SDKs, we don't necessarily want to copy their behavior exactly. There is a relatively new internal-only AWS proposal doc that defines how SDKs should handle retries and also covers service-specific defaults. I'd like to stick to their recommendations as much as possible. I'm also going to do a survey of how multiple SDKs currently handle retries and retry configuration.

While I appreciate your PR, I think removing the random element is not the right approach, and I hope you don't mind me proposing and alternative. This might take me some time, but I want to make sure we solve this problem thoroughly enough that we don't have to revisit it later and cause churn for people with custom retry configuration.

Some extra context on backoff and jitter:

ymwjbxxq · 2022-07-05T18:01:48Z

It is ok do not worry.

Maybe it could be random if I do not overwrite the configuration but it does not really make sense if the random is bigger than the overall timeout.

I will report again how I would use it to achieve high performance regardless from the service

  let timeout_config = timeout::Config::new()
       .with_api_timeouts(
           timeout::Api::new()
               .with_call_timeout(TriState::Set(Duration::from_millis(270)))
               .with_call_attempt_timeout(TriState::Set(Duration::from_millis(30))),
       );
   let config = aws_config::from_env()
       .retry_config(RetryConfig::new()
                                  .with_max_attempts(3)
                                  .with_base(50)) -> My backoffretries interval
       .timeout_config(timeout_config)
       .load()
       .await;

Given these settings:

3 retries
backoff at 50ms interval
a single attempt at 30ms
total timeout with all retries included at 270ms

I would expect:

original attempt -> 30ms -> fails
1 attempts -> after 50ms -> 80ms + (30ms) -> fails
2 attempts -> after 50ms -> 160ms + (30ms) -> fails
3 attempts -> after 50ms -> 240ms + (30ms) -> fails

Current implementation with these settings:

3 retries
a single attempt at 30ms
total timeout with all retries included at 270ms
original attempt -> 30ms -> fails
1 attempts -> after 500ms because it is random

The current implementation goes against the idea of timeout and in fact, I do not have any retry because I timeout before unless I let the SDK do their things and eventually come back with some results.

To sum up:
I agree with you about the research but my alternative proposal would be:

You can use your random logic if nothing is set and the SDK should work within their default values (that you must declare somewhere).
You should allow me to overwrite it and have a predictable timing.
If you want to also use a jitter it is fine let me overwrite the values so that I can calculate the predictable time (the max value)

ymwjbxxq · 2022-07-05T19:57:36Z

Hi @Velfi

I read your links about backoff and jitter, and I think a FullJitterBackoffStrategy would be:

let exp_backoff = self.config.base as f64 * (r.pow(self.local.attempts - 1) as f64);
let mut rng = rand::thread_rng();
let jitter = rand::Rng::gen_range(&mut rng, 0.0..exp_backoff);
let final_backoff = exp_backoff + jitter;
let backoff = Duration::from_millis(final_backoff as u64).min(self.config.max_backoff);

this will generate something like:

exp_backoff 1.0
jitter 0.5013046685201143
final_backoff 1.5013046685201143

exp_backoff 2.0
jitter 1.0432303030223897
final_backoff 3.0432303030223897

exp_backoff 4.0
jitter 2.2986709031456165
final_backoff 6.2986709031456165

or to change the code:

let exp_backoff = self.config.base * (r.pow(self.local.attempts - 1) as u64);
let jitter = fastrand::u64(..exp_backoff);
let final_backoff = exp_backoff + jitter;
let backoff = Duration::from_millis(final_backoff).min(self.config.max_backoff);

At least the overall timeout with retry included is a value that can be calculated considering the worst-case scenario.

Anyway I will wait for you guys

Velfi · 2022-07-06T14:11:25Z

@ymwjbxxq Thank you for doing so much leg work! These posts are helpful.

ymwjbxxq · 2022-07-07T12:53:25Z

@Velfi I did extra tests:

I run DynamoDB without the SDK retries
I have implemented (hopefully correctly) FullJitterBackoffStrategy and EqualJitterBackoffStrategy based on DynamoDB error.

The results are with my expectations. A faster response!!!

My configuration for example, with max 2 retries:

const MAX_ATTEMPTS: u32 = 3;
const BASE: u32 = 10;
with_call_attempt_timeout: 30ms

What I conclude is the SDK should apply maybe the:

EqualJitterBackoffStrategy (I think is the current implementation) for throttling error and the BASE as you have now is 500ms up to 20 seconds (I think the 4XX exceptions)
FullJitterBackoffStrategy for all the rest (5XX exceptions)

The same rule should applies to all services.

To make the SDK life easier, I would not apply any strategy based on errors but use the FullJitterBackoffStrategy with a BASE of 500ms and allow the user to overwrite the base. This should fix all the needs of SDK users.

Velfi · 2022-07-07T20:46:22Z

Hey @ymwjbxxq, here's the latest:

I've been poring over internal specs and documentation on retries and I think I finally have a pretty good view on the state of things. Here are my findings:

The various SDKs have various strategies for how they handle retries. This creates a negative experience for customers because it's inconsistent. I'll refer to these various retry strategies as "legacy" retries. There is published documentation that directs users how to configure "legacy" retries. These docs are SDK-specific and not exactly helpful when trying to understand the retry strategy of SDKs other than the one they were written for.
There is an AWS initiative to standardize all SDKs around a retry strategy based off of jitter and exponential backoff. I'll refer to this strategy as "standard" retries. Because the Rust SDK is so new, we implemented the new "standard" retry strategy. We don't plan to support "legacy" retries.
Customers (like you) have expressed that the "standard" retry initial backoff delay is longer than they'd expect for services like DynamoDB

In the Rust SDK, we have implemented backoff delay calculation this way:

// b is a random number between 0 and 1
// r is a constant that is always set to 2
let backoff = b * (r.pow(self.local.attempts - 1) as f64);
let backoff = Duration::from_secs_f64(backoff).min(self.config.max_backoff);

Duration::from_secs_f64 is the intended method for creating the backoff delay because the delay is calculated on the scale of seconds, not milliseconds. As you found, this does not conform to the behavior of the Javascript SDK. However, the Javascript SDK's retry strategy is the "legacy" strategy, while the calculation above is defined as part of the new "standard" strategy. It's not incorrect, but I can see how it would be surprising.

My next subject of investigation is to find out why AWS used to recommend retries on the order of milliseconds when the new strategy is acting on the scale of seconds. Once I can get a solid answer on that, I believe I'll have enough information to decide on a resolution.

Velfi · 2022-07-18T14:19:39Z

Hey @ymwjbxxq , my retry PR got merged and is now pending release. We'll try to get a release out this week and then you can let me know if the update improves things for your use case.

See the PR for more info

ymwjbxxq · 2022-07-18T15:40:14Z

Great @Velfi to hear it.

I will test it out once it is out.

jdisanti · 2022-07-22T01:29:33Z

This went out in the July 21st SDK release.

github-actions · 2022-07-22T01:29:57Z

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

ymwjbxxq added feature-request A feature should be added or improved. needs-triage This issue or PR still needs to be triaged. labels Jun 29, 2022

ymwjbxxq mentioned this issue Jul 5, 2022

Override smithy client retry config smithy-lang/smithy-rs#1528

Closed

1 task

Velfi removed the needs-triage This issue or PR still needs to be triaged. label Jul 5, 2022

jdisanti added this to AWS Rust SDK Public Roadmap Jul 19, 2022

jdisanti assigned Velfi Jul 19, 2022

jdisanti moved this to Pending Release in AWS Rust SDK Public Roadmap Jul 19, 2022

jdisanti closed this as completed Jul 22, 2022

Repository owner moved this from Pending Release to Done in AWS Rust SDK Public Roadmap Jul 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Override Config aws-smithy-client #567

Override Config aws-smithy-client #567

ymwjbxxq commented Jun 29, 2022 •

edited

Loading

rcoh commented Jul 1, 2022

ymwjbxxq commented Jul 5, 2022

Velfi commented Jul 5, 2022

ymwjbxxq commented Jul 5, 2022 •

edited

Loading

ymwjbxxq commented Jul 5, 2022 •

edited

Loading

Velfi commented Jul 6, 2022

ymwjbxxq commented Jul 7, 2022

Velfi commented Jul 7, 2022

Velfi commented Jul 18, 2022

ymwjbxxq commented Jul 18, 2022 •

edited

Loading

jdisanti commented Jul 22, 2022

github-actions bot commented Jul 22, 2022

Override Config aws-smithy-client #567

Override Config aws-smithy-client #567

Comments

ymwjbxxq commented Jun 29, 2022 • edited Loading

Describe the feature

Use Case

Proposed Solution

Other Information

Acknowledgements

A note for the community

Community Note

rcoh commented Jul 1, 2022

ymwjbxxq commented Jul 5, 2022

Velfi commented Jul 5, 2022

ymwjbxxq commented Jul 5, 2022 • edited Loading

ymwjbxxq commented Jul 5, 2022 • edited Loading

Velfi commented Jul 6, 2022

ymwjbxxq commented Jul 7, 2022

Velfi commented Jul 7, 2022

Velfi commented Jul 18, 2022

ymwjbxxq commented Jul 18, 2022 • edited Loading

jdisanti commented Jul 22, 2022

github-actions bot commented Jul 22, 2022

⚠️COMMENT VISIBILITY WARNING⚠️

ymwjbxxq commented Jun 29, 2022 •

edited

Loading

ymwjbxxq commented Jul 5, 2022 •

edited

Loading

ymwjbxxq commented Jul 5, 2022 •

edited

Loading

ymwjbxxq commented Jul 18, 2022 •

edited

Loading