-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standard retry backoff calculation may panic #1133
Comments
I looked into this a little more. If too many attempts are made, then the backoff will be f64::MAX from f64::MAX then is used immediately causing the panic. The work around for now is to never specify attempts and initial backoffs that could result in f64::MAX calculated. |
Thank you for filing this issue. |
Hi @trueb2, we're working on a fix to address this panic issue. In the meantime, we have a question for your use case. In general, specifying the number of retry attempts to be 100 does not seem to be a normal workflow. What is the motivation behind this number? Is that out of necessity or are you simply testing the SDK's behavior to see what happens with the number of attempts being 100? |
The use case for this is that there can be network failures or outages that are random but eventually should succeed. So the initial retry duration should be very low and many reattempts should be possible. Choosing the appropriate maximum input values that do not induce panic is opaque, so 100 seems a reasonable number of max attempts given that the max backoff is an input and the total operation timeout can bounded by an absolute maximum duration. Exponential backoff does not make sense after a certain number of backoffs, but the other parameters exposed by the SDK should make it possible to avoid panicking or unreasonable delays. I did find that the high retry attempt limits were out of necessity as I have hit the maximum backoff limit causing this panic multiple times over the last month. I was not testing the SDK's behavior. |
…rom a large float (#3621) ## Motivation and Context Avoids panic when [Duration for exponential backoff](https://github.com/smithy-lang/smithy-rs/blob/main/rust-runtime/aws-smithy-runtime/src/client/retries/strategy/standard.rs#L150) could not be created from a large float. ## Description [Duration::from_secs_f64](https://doc.rust-lang.org/core/time/struct.Duration.html#method.from_secs_f64) may panic. This PR switches to use a fallible sibling `Duration::try_from_secs_f64` to avoid panics. If `Duration::try_from_secs_f64` returns an `Err` we fallback to `max_backoff` for subsequent retries. Furthermore, we learned from internal discussion that jitter also needs to be applied to `max_backoff`. This PR updates `calculate_exponential_backoff` to handle all the said business logic in one place. ## Testing - Added a unit test `should_not_panic_when_exponential_backoff_duration_could_not_be_created` - Manually verified reproduction steps provided [in the original PR](awslabs/aws-sdk-rust#1133) <details> <summary>More details</summary> ``` #[tokio::test] async fn repro_1133() { let config = aws_config::from_env() .region(Region::new("non-existing-region")) // forces retries .retry_config( RetryConfig::standard() .with_initial_backoff(Duration::from_millis(1)) .with_max_attempts(100), ) .timeout_config( TimeoutConfig::builder() .operation_attempt_timeout(Duration::from_secs(180)) .operation_timeout(Duration::from_secs(600)) .build(), ) .load() .await; let client: Client = Client::new(&config); let res = client .list_objects_v2() .bucket("bucket-name-does-not-matter") .send() .await; dbg!(res); } ``` Without changes in this PR: ``` ---- repro_1133 stdout ---- thread 'repro_1133' panicked at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/time.rs:741:23: can not convert float seconds to Duration: value is either too big or NaN stack backtrace: ... failures: repro_1133 test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 9 filtered out; finished in 338.18s ``` With changes in this PR: ``` // no panic ---- repro_1133 stdout ---- res = Err( TimeoutError( TimeoutError { source: MaybeTimeoutError { kind: Operation, duration: 600s, }, }, ), ) ``` </details> ## Checklist <!--- If a checkbox below is not applicable, then please DELETE it rather than leaving it unchecked --> - [x] I have updated `CHANGELOG.next.toml` if I made changes to the smithy-rs codegen or runtime crates - [x] I have updated `CHANGELOG.next.toml` if I made changes to the AWS SDK, generated SDK code, or SDK runtime crates ## Appendix <details> <summary>runtime-versioner bug fix</summary> This PR also fixes a limitation in `runtime-versioner audit`. I included the fix in the PR because the issue occurred with special conditions, and we don't get to reproduce it every day. The way the issue manifests is this. 1. We have a branch from the main whose latest release tag at the time was `release-2024-04-30` 2. The main has moved ahead and a new `smithy-rs` release has been made `release-2024-05-08` 3. We perform `git merge main`, pre-commit hooks run, and we then get `audit` failures from `runtime-versioner`: ``` 2024-05-10T16:54:36.434968Z WARN runtime_versioner::command::audit: there are newer releases since 'release-2024-04-30' aws-config was changed and version bumped, but the new version number (1.4.0) has already been published to crates.io. Choose a new version number. aws-runtime was changed and version bumped, but the new version number (1.2.2) has already been published to crates.io. Choose a new version number. aws-smithy-runtime-api was changed and version bumped, but the new version number (1.6.0) has already been published to crates.io. Choose a new version number. aws-smithy-types was changed and version bumped, but the new version number (1.1.9) has already been published to crates.io. Choose a new version number. aws-types was changed and version bumped, but the new version number (1.2.1) has already been published to crates.io. Choose a new version number. Error: there are audit failures in the runtime crates ``` This happens because when the latest `main` is being merged to our branch, `runtime-versioner audit` should use `previous_release_tag` `release-2024-05-08` to perform audit but [pre-commit hooks run the tool](https://github.com/smithy-lang/smithy-rs/blob/main/.pre-commit-hooks/runtime-versioner.sh#L8) using the latest previous release tag that can be traced back from `HEAD` of our branch, which is `release-2024-04-30`. Hence the error. The fix adds an environment variable `SMITHY_RS_RUNTIME_VERSIONER_AUDIT_PREVIOUS_RELEASE_TAG` to specify a previous release tag override, in addition to a `--previous-release-tag` command-line argument. Furthermore, the fix has relaxed certain checks in `audit`. Taking our example for instance, when `HEAD` is now behind `release-2024-05-08`, it's OK to fail even if `release-2024-05-08` is not the ancestor of `HEAD` (as stated above, `git merge-base --is-ancestor` does not know that while main is being merged) as long as `release-2024-04-28` (the latest release seen from `HEAD`) is the ancestor of `release-2024-05-08`. ``` release-2024-04-28 release-2024-05-08 │ │ ────────────┼───────────────┬────────────────┼───────x─────► main │ │HEAD │ x │ of x │branch x │ x │ x │ xxxxxxxxxxx │ x │ x git merge main │ x │ ▼ │ └──────────────────► feature branch ``` To use the fix, set the environment variable with the new release tag and perform `git merge main`: ``` ➜ smithy-rs git:(ysaito/fix-panic-in-exp-backoff) ✗ export SMITHY_RS_RUNTIME_VERSIONER_AUDIT_PREVIOUS_RELEASE_TAG=release-2024-05-08 ➜ smithy-rs git:(ysaito/fix-panic-in-exp-backoff) ✗ git merge main ... Running `/Users/awsaito/src/smithy-rs/tools/target/debug/runtime-versioner audit` 2024-05-10T19:32:26.665578Z WARN runtime_versioner::tag: expected previous release to be 'release-2024-04-30', but 'release-2024-05-08' was specified. Proceeding with 'release-2024-05-08'. SUCCESS ``` </summary> ---- _By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice._
Thank you for your response. smithy-lang/smithy-rs#3621 will fix a panic in a way that it should still allow you to retry the desired number of times.
Correct, once an exponential backoff duration becomes large enough to exceed |
@ysaito1001 Thanks for your thorough resolution! |
The fix is included in release-2024-05-22. |
Comments on closed issues are hard for our team to see. |
Describe the bug
The standard retry calculation may cause panic by attempting to convert an invalid float into a Duration.
Expected Behavior
No panics. The operation should fail with an error gracefully.
Current Behavior
Reproduction Steps
Here is an example retry configuration that eventually hit panics on get object requests.
Possible Solution
Err if the duration is invalid before attempting to create the Duration at
aws-sdk-rust/sdk/aws-smithy-runtime/src/client/retries/strategy/standard.rs
Line 150 in 2f2715b
Additional Information/Context
Timeouts typically happen sporadically over time and depend on external factors. I noticed the panics after many requests ran happily through the common code paths. It is difficult to exercise these code paths in real world tests because 99.999 ... % of requests succeed within N attempts and a reasonable time window.
Version
Environment details (OS name and version, etc.)
macOS 14.3
Logs
The text was updated successfully, but these errors were encountered: