Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Override Config aws-smithy-client #567

Closed
2 tasks
ymwjbxxq opened this issue Jun 29, 2022 · 12 comments
Closed
2 tasks

Override Config aws-smithy-client #567

ymwjbxxq opened this issue Jun 29, 2022 · 12 comments
Assignees
Labels
feature-request A feature should be added or improved.

Comments

@ymwjbxxq
Copy link

ymwjbxxq commented Jun 29, 2022

Describe the feature

Hi all,

as a follow-up of this issue #558
where we can set the timeout for the client to be latency aware with the retries

#[tokio::main]
async fn main() -> Result<(), Error> {
    tracing_subscriber::fmt()
        .with_ansi(false)
        .without_time()
        .with_max_level(tracing_subscriber::filter::LevelFilter::DEBUG)
        .init();

    let timeout_config = timeout::Config::new()
        .with_api_timeouts(
            timeout::Api::new()
                .with_call_timeout(TriState::Set(Duration::from_millis(150)))
                .with_call_attempt_timeout(TriState::Set(Duration::from_millis(50))),
        );
    let config = aws_config::from_env()
        .retry_config(RetryConfig::new().with_max_attempts(3))
        .timeout_config(timeout_config)

        .load()
        .await;

    let dynamodb_client = aws_sdk_dynamodb::Client::new(&config);

I have noticed the following:

DEBUG aws_smithy_client::retry: attempt 1 failed with Error(TransientError); retrying after 435.692227ms

And the aws-smithy-client/src/retry.rs is configured as:

const MAX_ATTEMPTS: u32 = 3;
const INITIAL_RETRY_TOKENS: usize = 500;
const RETRY_COST: usize = 5;

And I have no way to overwrite the default parameters.

From the test, I cannot get exactly where the initial value of the backoff is coming from
https://github.com/awslabs/smithy-rs/blob/a0539e20b069a7de021c84521d8f3c7ba098ad6d/rust-runtime/aws-smithy-client/src/retry.rs#L276

It seems to me that this line of code is not correct:

let backoff = b * (r.pow(self.local.attempts - 1) as f64);
let backoff = Duration::from_secs_f64(backoff).min(self.config.max_backoff);

Because:

  1. it does not consider a value to calculate the retry
  2. convert all in seconds with a max of 20s

It should be based on milliseconds https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=7deed195b0811c97887dbcd6c98f1a6e

let exponential_delay = (ms * (r.pow(attempts - 1) as f64)) as u64;
let backoff =  Duration::from_millis(exponential_delay).min(Duration::from_secs(20));

If we can pass the backoff value it could become

let retry_config = RetryConfig::new()
  .with_max_attempts(5);
  .with_backoff_delay(100); //ms

Use Case

When you are latency aware, you should be able to overwrite the default parameters to configure the backoff.

In my configuration case, I have the following:

  • 3 retries
  • 50ms a for a single attempt
  • overall duration 150ms

the back-off moves the second retry after 435.692227ms
DEBUG aws_smithy_client::retry: attempt 1 failed with Error(TransientError); retrying after 435.692227ms

This means that it will never retry a second time.

Proposed Solution

Be able to overwrite the backoff policy.

If I wish to have this configuration

  • 3 retries
  • 50ms a for a single attempt

I could setup the backoff policy at 10ms. For example, I would have the following after the first failed attempt.

Retry 1 -> retrying after 20ms
Retry 2 -> retrying after 40ms
Retry 3 -> retrying after 60ms

Now I can setup my overall duration "with_call_timeout" to 210ms, giving the retries time to happen.

Other Information

No response

Acknowledgements

  • I may be able to implement this feature request
  • This feature might incur a breaking change

A note for the community

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue, please leave a comment
@ymwjbxxq ymwjbxxq added feature-request A feature should be added or improved. needs-triage This issue or PR still needs to be triaged. labels Jun 29, 2022
@rcoh
Copy link
Contributor

rcoh commented Jul 1, 2022

thanks! We should figure out the right knobs to enable quicker retries for services like DynamoDb.

@ymwjbxxq
Copy link
Author

ymwjbxxq commented Jul 5, 2022

Considering AWS JavaScript SDK as a reference for timeouts and usage, the current Retry logic is incorrect (bug).
After the first failure, the Rust SDK generate a random instead of following the 100ms default setting for all services and the 50ms retry for DynamoDB.

Screenshot 2022-07-02 at 17 01 59

@Velfi Velfi removed the needs-triage This issue or PR still needs to be triaged. label Jul 5, 2022
@Velfi
Copy link
Contributor

Velfi commented Jul 5, 2022

Hey @ymwjbxxq, I'm going to look into this issue. While we (the Rust SDK team) aim for parity with the other SDKs, we don't necessarily want to copy their behavior exactly. There is a relatively new internal-only AWS proposal doc that defines how SDKs should handle retries and also covers service-specific defaults. I'd like to stick to their recommendations as much as possible. I'm also going to do a survey of how multiple SDKs currently handle retries and retry configuration.

While I appreciate your PR, I think removing the random element is not the right approach, and I hope you don't mind me proposing and alternative. This might take me some time, but I want to make sure we solve this problem thoroughly enough that we don't have to revisit it later and cause churn for people with custom retry configuration.

Some extra context on backoff and jitter:

@ymwjbxxq
Copy link
Author

ymwjbxxq commented Jul 5, 2022

It is ok do not worry.

Maybe it could be random if I do not overwrite the configuration but it does not really make sense if the random is bigger than the overall timeout.

I will report again how I would use it to achieve high performance regardless from the service

  let timeout_config = timeout::Config::new()
       .with_api_timeouts(
           timeout::Api::new()
               .with_call_timeout(TriState::Set(Duration::from_millis(270)))
               .with_call_attempt_timeout(TriState::Set(Duration::from_millis(30))),
       );
   let config = aws_config::from_env()
       .retry_config(RetryConfig::new()
                                  .with_max_attempts(3)
                                  .with_base(50)) -> My backoffretries interval
       .timeout_config(timeout_config)
       .load()
       .await;

Given these settings:

  • 3 retries
  • backoff at 50ms interval
  • a single attempt at 30ms
  • total timeout with all retries included at 270ms

I would expect:

  • original attempt -> 30ms -> fails
  • 1 attempts -> after 50ms -> 80ms + (30ms) -> fails
  • 2 attempts -> after 50ms -> 160ms + (30ms) -> fails
  • 3 attempts -> after 50ms -> 240ms + (30ms) -> fails

Current implementation with these settings:

  • 3 retries

  • a single attempt at 30ms

  • total timeout with all retries included at 270ms

  • original attempt -> 30ms -> fails

  • 1 attempts -> after 500ms because it is random

The current implementation goes against the idea of timeout and in fact, I do not have any retry because I timeout before unless I let the SDK do their things and eventually come back with some results.

To sum up:
I agree with you about the research but my alternative proposal would be:

  • You can use your random logic if nothing is set and the SDK should work within their default values (that you must declare somewhere).
  • You should allow me to overwrite it and have a predictable timing.
  • If you want to also use a jitter it is fine let me overwrite the values so that I can calculate the predictable time (the max value)

@ymwjbxxq
Copy link
Author

ymwjbxxq commented Jul 5, 2022

Hi @Velfi

I read your links about backoff and jitter, and I think a FullJitterBackoffStrategy would be:

let exp_backoff = self.config.base as f64 * (r.pow(self.local.attempts - 1) as f64);
let mut rng = rand::thread_rng();
let jitter = rand::Rng::gen_range(&mut rng, 0.0..exp_backoff);
let final_backoff = exp_backoff + jitter;
let backoff = Duration::from_millis(final_backoff as u64).min(self.config.max_backoff);

this will generate something like:

exp_backoff 1.0
jitter 0.5013046685201143
final_backoff 1.5013046685201143

exp_backoff 2.0
jitter 1.0432303030223897
final_backoff 3.0432303030223897

exp_backoff 4.0
jitter 2.2986709031456165
final_backoff 6.2986709031456165

or to change the code:

let exp_backoff = self.config.base * (r.pow(self.local.attempts - 1) as u64);
let jitter = fastrand::u64(..exp_backoff);
let final_backoff = exp_backoff + jitter;
let backoff = Duration::from_millis(final_backoff).min(self.config.max_backoff);

At least the overall timeout with retry included is a value that can be calculated considering the worst-case scenario.

Anyway I will wait for you guys

@Velfi
Copy link
Contributor

Velfi commented Jul 6, 2022

@ymwjbxxq Thank you for doing so much leg work! These posts are helpful.

@ymwjbxxq
Copy link
Author

ymwjbxxq commented Jul 7, 2022

@Velfi I did extra tests:

  • I run DynamoDB without the SDK retries
  • I have implemented (hopefully correctly) FullJitterBackoffStrategy and EqualJitterBackoffStrategy based on DynamoDB error.

The results are with my expectations. A faster response!!!

My configuration for example, with max 2 retries:

const MAX_ATTEMPTS: u32 = 3;
const BASE: u32 = 10;
with_call_attempt_timeout: 30ms

What I conclude is the SDK should apply maybe the:

  • EqualJitterBackoffStrategy (I think is the current implementation) for throttling error and the BASE as you have now is 500ms up to 20 seconds (I think the 4XX exceptions)
  • FullJitterBackoffStrategy for all the rest (5XX exceptions)

The same rule should applies to all services.

To make the SDK life easier, I would not apply any strategy based on errors but use the FullJitterBackoffStrategy with a BASE of 500ms and allow the user to overwrite the base. This should fix all the needs of SDK users.

@Velfi
Copy link
Contributor

Velfi commented Jul 7, 2022

Hey @ymwjbxxq, here's the latest:

I've been poring over internal specs and documentation on retries and I think I finally have a pretty good view on the state of things. Here are my findings:

  • The various SDKs have various strategies for how they handle retries. This creates a negative experience for customers because it's inconsistent. I'll refer to these various retry strategies as "legacy" retries. There is published documentation that directs users how to configure "legacy" retries. These docs are SDK-specific and not exactly helpful when trying to understand the retry strategy of SDKs other than the one they were written for.
  • There is an AWS initiative to standardize all SDKs around a retry strategy based off of jitter and exponential backoff. I'll refer to this strategy as "standard" retries. Because the Rust SDK is so new, we implemented the new "standard" retry strategy. We don't plan to support "legacy" retries.
  • Customers (like you) have expressed that the "standard" retry initial backoff delay is longer than they'd expect for services like DynamoDB

In the Rust SDK, we have implemented backoff delay calculation this way:

// b is a random number between 0 and 1
// r is a constant that is always set to 2
let backoff = b * (r.pow(self.local.attempts - 1) as f64);
let backoff = Duration::from_secs_f64(backoff).min(self.config.max_backoff);

Duration::from_secs_f64 is the intended method for creating the backoff delay because the delay is calculated on the scale of seconds, not milliseconds. As you found, this does not conform to the behavior of the Javascript SDK. However, the Javascript SDK's retry strategy is the "legacy" strategy, while the calculation above is defined as part of the new "standard" strategy. It's not incorrect, but I can see how it would be surprising.

My next subject of investigation is to find out why AWS used to recommend retries on the order of milliseconds when the new strategy is acting on the scale of seconds. Once I can get a solid answer on that, I believe I'll have enough information to decide on a resolution.

@Velfi
Copy link
Contributor

Velfi commented Jul 18, 2022

Hey @ymwjbxxq , my retry PR got merged and is now pending release. We'll try to get a release out this week and then you can let me know if the update improves things for your use case.

See the PR for more info

@ymwjbxxq
Copy link
Author

ymwjbxxq commented Jul 18, 2022

Great @Velfi to hear it.

I will test it out once it is out.

@jdisanti
Copy link
Contributor

This went out in the July 21st SDK release.

Repository owner moved this from Pending Release to Done in AWS Rust SDK Public Roadmap Jul 22, 2022
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request A feature should be added or improved.
Projects
Archived in project
Development

No branches or pull requests

4 participants