Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standard retry backoff calculation may panic #1133

Closed
trueb2 opened this issue Apr 24, 2024 · 8 comments
Closed

Standard retry backoff calculation may panic #1133

trueb2 opened this issue Apr 24, 2024 · 8 comments
Assignees
Labels
bug This issue is a bug. p2 This is a standard priority issue pending-release This issue will be fixed by an approved PR that hasn't been released yet.

Comments

@trueb2
Copy link

trueb2 commented Apr 24, 2024

Describe the bug

The standard retry calculation may cause panic by attempting to convert an invalid float into a Duration.

Expected Behavior

No panics. The operation should fail with an error gracefully.

Current Behavior

thread 'actix-server worker 7' panicked at /rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library/core/src/time.rs:762:23:
can not convert float seconds to Duration: value is either too big or NaN
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::time::Duration::from_secs_f64::panic_cold_display
   3: <aws_smithy_runtime::client::retries::strategy::standard::StandardRetryStrategy as aws_smithy_runtime_api::client::retries::RetryStrategy>::should_attempt_retry
   4: aws_smithy_runtime::client::orchestrator::try_op::{{closure}}::{{closure}}
   5: <aws_smithy_async::future::timeout::Timeout<T,S> as core::future::future::Future>::poll
   6: <aws_smithy_runtime::client::timeout::MaybeTimeoutFuture<InnerFuture> as core::future::future::Future>::poll
   7: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
   8: aws_smithy_runtime::client::orchestrator::invoke_with_stop_point::{{closure}}
   9: aws_sdk_s3::operation::get_object::builders::GetObjectFluentBuilder::send::{{closure}}

Reproduction Steps

Here is an example retry configuration that eventually hit panics on get object requests.

aws-config = { version = "1.1.3", features = ["behavior-version-latest"] }
aws-sdk-s3 = "1.13.0"
      let config: SdkConfig = aws_config::from_env()
          .region(region)
          .endpoint_url(host.clone())
          .retry_config(
              RetryConfig::standard()
                  .with_initial_backoff(Duration::from_millis(1))
                  .with_max_attempts(100)
                  .with_reconnect_mode(
                      aws_sdk_s3::config::retry::ReconnectMode::ReconnectOnTransientError,
                  ),
          )
          .timeout_config(
              TimeoutConfig::builder()
                  .operation_attempt_timeout(Duration::from_secs(180))
                  .operation_timeout(Duration::from_secs(600))
                  .build(),
          )
          .load()
          .await;

      let s3_client: Client = Client::new(&config);

Possible Solution

Err if the duration is invalid before attempting to create the Duration at

Ok(Duration::from_secs_f64(backoff).min(retry_cfg.max_backoff()))

Additional Information/Context

Timeouts typically happen sporadically over time and depend on external factors. I noticed the panics after many requests ran happily through the common code paths. It is difficult to exercise these code paths in real world tests because 99.999 ... % of requests succeed within N attempts and a reasonable time window.

Version

├── aws-config v1.1.4
│   ├── aws-credential-types v1.1.4
│   │   ├── aws-smithy-async v1.1.4
│   │   ├── aws-smithy-runtime-api v1.1.4
│   │   │   ├── aws-smithy-async v1.1.4 (*)
│   │   │   ├── aws-smithy-types v1.1.4
│   │   ├── aws-smithy-types v1.1.4 (*)
│   ├── aws-runtime v1.1.4
│   │   ├── aws-credential-types v1.1.4 (*)
│   │   ├── aws-sigv4 v1.1.4
│   │   │   ├── aws-credential-types v1.1.4 (*)
│   │   │   ├── aws-smithy-eventstream v0.60.4
│   │   │   │   ├── aws-smithy-types v1.1.4 (*)
│   │   │   ├── aws-smithy-http v0.60.4
│   │   │   │   ├── aws-smithy-eventstream v0.60.4 (*)
│   │   │   │   ├── aws-smithy-runtime-api v1.1.4 (*)
│   │   │   │   ├── aws-smithy-types v1.1.4 (*)
│   │   │   ├── aws-smithy-runtime-api v1.1.4 (*)
│   │   │   ├── aws-smithy-types v1.1.4 (*)
│   │   ├── aws-smithy-async v1.1.4 (*)
│   │   ├── aws-smithy-eventstream v0.60.4 (*)
│   │   ├── aws-smithy-http v0.60.4 (*)
│   │   ├── aws-smithy-runtime-api v1.1.4 (*)
│   │   ├── aws-smithy-types v1.1.4 (*)
│   │   ├── aws-types v1.1.4
│   │   │   ├── aws-credential-types v1.1.4 (*)
│   │   │   ├── aws-smithy-async v1.1.4 (*)
│   │   │   ├── aws-smithy-runtime-api v1.1.4 (*)
│   │   │   ├── aws-smithy-types v1.1.4 (*)
│   ├── aws-sdk-sso v1.12.0
│   │   ├── aws-credential-types v1.1.4 (*)
│   │   ├── aws-runtime v1.1.4 (*)
│   │   ├── aws-smithy-async v1.1.4 (*)
│   │   ├── aws-smithy-http v0.60.4 (*)
│   │   ├── aws-smithy-json v0.60.4
│   │   │   └── aws-smithy-types v1.1.4 (*)
│   │   ├── aws-smithy-runtime v1.1.4
│   │   │   ├── aws-smithy-async v1.1.4 (*)
│   │   │   ├── aws-smithy-http v0.60.4 (*)
│   │   │   ├── aws-smithy-runtime-api v1.1.4 (*)
│   │   │   ├── aws-smithy-types v1.1.4 (*)
│   │   ├── aws-smithy-runtime-api v1.1.4 (*)
│   │   ├── aws-smithy-types v1.1.4 (*)
│   │   ├── aws-types v1.1.4 (*)
│   ├── aws-sdk-ssooidc v1.12.0
│   │   ├── aws-credential-types v1.1.4 (*)
│   │   ├── aws-runtime v1.1.4 (*)
│   │   ├── aws-smithy-async v1.1.4 (*)
│   │   ├── aws-smithy-http v0.60.4 (*)
│   │   ├── aws-smithy-json v0.60.4 (*)
│   │   ├── aws-smithy-runtime v1.1.4 (*)
│   │   ├── aws-smithy-runtime-api v1.1.4 (*)
│   │   ├── aws-smithy-types v1.1.4 (*)
│   │   ├── aws-types v1.1.4 (*)
│   ├── aws-sdk-sts v1.12.0
│   │   ├── aws-credential-types v1.1.4 (*)
│   │   ├── aws-runtime v1.1.4 (*)
│   │   ├── aws-smithy-async v1.1.4 (*)
│   │   ├── aws-smithy-http v0.60.4 (*)
│   │   ├── aws-smithy-json v0.60.4 (*)
│   │   ├── aws-smithy-query v0.60.4
│   │   │   ├── aws-smithy-types v1.1.4 (*)
│   │   ├── aws-smithy-runtime v1.1.4 (*)
│   │   ├── aws-smithy-runtime-api v1.1.4 (*)
│   │   ├── aws-smithy-types v1.1.4 (*)
│   │   ├── aws-smithy-xml v0.60.4
│   │   ├── aws-types v1.1.4 (*)
│   ├── aws-smithy-async v1.1.4 (*)
│   ├── aws-smithy-http v0.60.4 (*)
│   ├── aws-smithy-json v0.60.4 (*)
│   ├── aws-smithy-runtime v1.1.4 (*)
│   ├── aws-smithy-runtime-api v1.1.4 (*)
│   ├── aws-smithy-types v1.1.4 (*)
│   ├── aws-types v1.1.4 (*)
├── aws-sdk-s3 v1.14.0
│   ├── aws-credential-types v1.1.4 (*)
│   ├── aws-runtime v1.1.4 (*)
│   ├── aws-sigv4 v1.1.4 (*)
│   ├── aws-smithy-async v1.1.4 (*)
│   ├── aws-smithy-checksums v0.60.4
│   │   ├── aws-smithy-http v0.60.4 (*)
│   │   ├── aws-smithy-types v1.1.4 (*)
│   ├── aws-smithy-eventstream v0.60.4 (*)
│   ├── aws-smithy-http v0.60.4 (*)
│   ├── aws-smithy-json v0.60.4 (*)
│   ├── aws-smithy-runtime v1.1.4 (*)
│   ├── aws-smithy-runtime-api v1.1.4 (*)
│   ├── aws-smithy-types v1.1.4 (*)
│   ├── aws-smithy-xml v0.60.4 (*)
│   ├── aws-types v1.1.4 (*)
│   ├── aws-config v1.1.4 (*)
│   ├── aws-sdk-s3 v1.14.0 (*)

Environment details (OS name and version, etc.)

macOS 14.3

Logs

thread 'actix-server worker 7' panicked at /rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library/core/src/time.rs:762:23:
can not convert float seconds to Duration: value is either too big or NaN
stack backtrace:
   2: core::time::Duration::from_secs_f64::panic_cold_display
   3: <aws_smithy_runtime::client::retries::strategy::standard::StandardRetryStrategy as aws_smithy_runtime_api::client::retries::RetryStrategy>::should_attempt_retry
@trueb2 trueb2 added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Apr 24, 2024
@trueb2
Copy link
Author

trueb2 commented Apr 24, 2024

I looked into this a little more. If too many attempts are made, then the backoff will be f64::MAX from

https://github.com/awslabs/aws-sdk-rust/blob/main/sdk/aws-smithy-runtime/src/client/retries/strategy/standard.rs#L291

f64::MAX then is used immediately causing the panic. The work around for now is to never specify attempts and initial backoffs that could result in f64::MAX calculated.

@jdisanti jdisanti added p2 This is a standard priority issue and removed needs-triage This issue or PR still needs to be triaged. labels Apr 24, 2024
@jdisanti
Copy link
Contributor

Thank you for filing this issue.

@ysaito1001
Copy link
Collaborator

Hi @trueb2, we're working on a fix to address this panic issue. In the meantime, we have a question for your use case.

In general, specifying the number of retry attempts to be 100 does not seem to be a normal workflow. What is the motivation behind this number? Is that out of necessity or are you simply testing the SDK's behavior to see what happens with the number of attempts being 100?

@trueb2
Copy link
Author

trueb2 commented May 13, 2024

The use case for this is that there can be network failures or outages that are random but eventually should succeed. So the initial retry duration should be very low and many reattempts should be possible. Choosing the appropriate maximum input values that do not induce panic is opaque, so 100 seems a reasonable number of max attempts given that the max backoff is an input and the total operation timeout can bounded by an absolute maximum duration. Exponential backoff does not make sense after a certain number of backoffs, but the other parameters exposed by the SDK should make it possible to avoid panicking or unreasonable delays.

I did find that the high retry attempt limits were out of necessity as I have hit the maximum backoff limit causing this panic multiple times over the last month. I was not testing the SDK's behavior.

github-merge-queue bot pushed a commit to smithy-lang/smithy-rs that referenced this issue May 13, 2024
…rom a large float (#3621)

## Motivation and Context
Avoids panic when [Duration for exponential
backoff](https://github.com/smithy-lang/smithy-rs/blob/main/rust-runtime/aws-smithy-runtime/src/client/retries/strategy/standard.rs#L150)
could not be created from a large float.

## Description

[Duration::from_secs_f64](https://doc.rust-lang.org/core/time/struct.Duration.html#method.from_secs_f64)
may panic. This PR switches to use a fallible sibling
`Duration::try_from_secs_f64` to avoid panics. If
`Duration::try_from_secs_f64` returns an `Err` we fallback to
`max_backoff` for subsequent retries. Furthermore, we learned from
internal discussion that jitter also needs to be applied to
`max_backoff`. This PR updates `calculate_exponential_backoff` to handle
all the said business logic in one place.

## Testing
- Added a unit test
`should_not_panic_when_exponential_backoff_duration_could_not_be_created`
- Manually verified reproduction steps provided [in the original
PR](awslabs/aws-sdk-rust#1133)
<details> 
<summary>More details</summary>

```
#[tokio::test]
async fn repro_1133() {
    let config = aws_config::from_env()
        .region(Region::new("non-existing-region")) // forces retries
        .retry_config(
            RetryConfig::standard()
                .with_initial_backoff(Duration::from_millis(1))
                .with_max_attempts(100),
        )
        .timeout_config(
            TimeoutConfig::builder()
                .operation_attempt_timeout(Duration::from_secs(180))
                .operation_timeout(Duration::from_secs(600))
                .build(),
        )
        .load()
        .await;

    let client: Client = Client::new(&config);
    let res = client
        .list_objects_v2()
        .bucket("bucket-name-does-not-matter")
        .send()
        .await;

    dbg!(res);
}
```

Without changes in this PR:
```
---- repro_1133 stdout ----
thread 'repro_1133' panicked at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/time.rs:741:23:
can not convert float seconds to Duration: value is either too big or NaN
stack backtrace:
...

failures:
    repro_1133

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 9 filtered out; finished in 338.18s
```

With changes in this PR:
```
// no panic
---- repro_1133 stdout ----
res = Err(
    TimeoutError(
        TimeoutError {
            source: MaybeTimeoutError {
                kind: Operation,
                duration: 600s,
            },
        },
    ),
)
```
</details> 

## Checklist
<!--- If a checkbox below is not applicable, then please DELETE it
rather than leaving it unchecked -->
- [x] I have updated `CHANGELOG.next.toml` if I made changes to the
smithy-rs codegen or runtime crates
- [x] I have updated `CHANGELOG.next.toml` if I made changes to the AWS
SDK, generated SDK code, or SDK runtime crates

## Appendix
<details> 
<summary>runtime-versioner bug fix</summary>

This PR also fixes a limitation in `runtime-versioner audit`. I included
the fix in the PR because the issue occurred with special conditions,
and we don't get to reproduce it every day. The way the issue manifests
is this.

1. We have a branch from the main whose latest release tag at the time
was `release-2024-04-30`
2. The main has moved ahead and a new `smithy-rs` release has been made
`release-2024-05-08`
3. We perform `git merge main`, pre-commit hooks run, and we then get
`audit` failures from `runtime-versioner`:
```
2024-05-10T16:54:36.434968Z  WARN runtime_versioner::command::audit: there are newer releases since 'release-2024-04-30'
aws-config was changed and version bumped, but the new version number (1.4.0) has already been published to crates.io. Choose a new version number.
aws-runtime was changed and version bumped, but the new version number (1.2.2) has already been published to crates.io. Choose a new version number.
aws-smithy-runtime-api was changed and version bumped, but the new version number (1.6.0) has already been published to crates.io. Choose a new version number.
aws-smithy-types was changed and version bumped, but the new version number (1.1.9) has already been published to crates.io. Choose a new version number.
aws-types was changed and version bumped, but the new version number (1.2.1) has already been published to crates.io. Choose a new version number.
Error: there are audit failures in the runtime crates
```
This happens because when the latest `main` is being merged to our
branch, `runtime-versioner audit` should use `previous_release_tag`
`release-2024-05-08` to perform audit but [pre-commit hooks run the
tool](https://github.com/smithy-lang/smithy-rs/blob/main/.pre-commit-hooks/runtime-versioner.sh#L8)
using the latest previous release tag that can be traced back from
`HEAD` of our branch, which is `release-2024-04-30`. Hence the error.

The fix adds an environment variable
`SMITHY_RS_RUNTIME_VERSIONER_AUDIT_PREVIOUS_RELEASE_TAG` to specify a
previous release tag override, in addition to a `--previous-release-tag`
command-line argument.
Furthermore, the fix has relaxed certain checks in `audit`. Taking our
example for instance, when `HEAD` is now behind `release-2024-05-08`,
it's OK to fail even if `release-2024-05-08` is not the ancestor of
`HEAD` (as stated above, `git merge-base --is-ancestor` does not know
that while main is being merged) as long as `release-2024-04-28` (the
latest release seen from `HEAD`) is the ancestor of
`release-2024-05-08`.

```
   release-2024-04-28               release-2024-05-08           
            │                                │                   
────────────┼───────────────┬────────────────┼───────x─────► main
            │               │HEAD            │       x           
                            │ of                     x           
                            │branch                  x           
                            │                        x           
                            │                        x           
                            │              xxxxxxxxxxx           
                            │              x                     
                            │              x  git merge main     
                            │              x                     
                            │              ▼                     
                            │                                    
                            └──────────────────► feature branch  
```
To use the fix, set the environment variable with the new release tag
and perform `git merge main`:
```
➜  smithy-rs git:(ysaito/fix-panic-in-exp-backoff) ✗ export SMITHY_RS_RUNTIME_VERSIONER_AUDIT_PREVIOUS_RELEASE_TAG=release-2024-05-08
➜  smithy-rs git:(ysaito/fix-panic-in-exp-backoff) ✗ git merge main
     ...
     Running `/Users/awsaito/src/smithy-rs/tools/target/debug/runtime-versioner audit`
2024-05-10T19:32:26.665578Z  WARN runtime_versioner::tag: expected previous release to be 'release-2024-04-30', but 'release-2024-05-08' was specified. Proceeding with 'release-2024-05-08'.
SUCCESS
```

</summary>

----

_By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice._
@ysaito1001
Copy link
Collaborator

ysaito1001 commented May 13, 2024

Thank you for your response. smithy-lang/smithy-rs#3621 will fix a panic in a way that it should still allow you to retry the desired number of times.

Exponential backoff does not make sense after a certain number of backoffs,

Correct, once an exponential backoff duration becomes large enough to exceed MAX_BACKOFF (by default 20 seconds), the SDK will use that duration for subsequent retries (jitter is applied to MAX_BACKOFF though).

@ysaito1001 ysaito1001 added the pending-release This issue will be fixed by an approved PR that hasn't been released yet. label May 13, 2024
@trueb2
Copy link
Author

trueb2 commented May 14, 2024

@ysaito1001 Thanks for your thorough resolution!

@ysaito1001
Copy link
Collaborator

The fix is included in release-2024-05-22.

Copy link

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. p2 This is a standard priority issue pending-release This issue will be fixed by an approved PR that hasn't been released yet.
Projects
None yet
Development

No branches or pull requests

3 participants