Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimum throughput detection incorrectly raises an error #1202

Closed
1 task
GeorgeHahn opened this issue Oct 1, 2024 · 6 comments
Closed
1 task

Minimum throughput detection incorrectly raises an error #1202

GeorgeHahn opened this issue Oct 1, 2024 · 6 comments
Assignees
Labels
bug This issue is a bug. pending-release This issue will be fixed by an approved PR that hasn't been released yet.

Comments

@GeorgeHahn
Copy link

GeorgeHahn commented Oct 1, 2024

Describe the bug

We observe frequent low throughput errors when reading a streamed S3 body in a lambda function. Our theory is that the decompression and business logic for processing the stream cause the async runtime to behave in a CPU-constrained manner and appears to break keepalive on the underlying stream. This appears to be a worst-case scenario for the throughput estimation logic. The attached reproduction returns an error even though reads separated by 120ms are able to receive data.

Regression Issue

  • Select this option if this issue appears to be a regression.

Expected Behavior

Minimum throughput errors should not be raised unless the connection is unable to provide data.

Current Behavior

A minimum throughput error is incorrectly raised.

Reproduction Steps

Cargo.toml

[package]
name = "s3-throughput-min-repro"
version = "0.1.0"
edition = "2021"

[dependencies]
anyhow = "1.0"
aws-config = { version = "1.5", default-features = false, features = ["client-hyper", "rt-tokio", "rustls"] }
aws-sdk-s3 = { version = "1.42", default-features = false, features = ["rt-tokio", "rustls"] }
clap = { version = "4.5", default-features = false, features = ["std", "derive"]}
tokio = { version = "1.23", default-features = false, features = ["full"] }
tracing = { version = "0.1", default-features = false }
tracing-subscriber = { version = "0.3", default-features = false, features = ["env-filter", "fmt", "std"] }

main.rs

use aws_config::BehaviorVersion;
use aws_sdk_s3::operation::RequestIdExt;
use clap::Parser;
use tokio::io::{AsyncRead, AsyncReadExt};
use tracing::info;

/// Read a file from S3. Reproduces a bug where the AWS SDK returns a low throughput error.
#[derive(Parser, Debug)]
#[command(author, version, about, long_about = None)]
struct Cli {
    /// S3 bucket to read from
    #[arg(long)]
    bucket: String,
    /// S3 path to read
    #[arg(long)]
    key: String,
}

pub async fn remote_object<'a>(
    s3_client: &'a aws_sdk_s3::Client,
    bucket: &str,
    key: &str,
) -> Result<impl AsyncRead, anyhow::Error> {
    let object = s3_client
        .get_object()
        .bucket(bucket)
        .key(key)
        .send()
        .await?;
    if let Some(id) = object.extended_request_id() {
        info!("Request ID: {id}");
    }
    Ok(object.body.into_async_read())
}

async fn repro(opts: Cli) -> Result<(), anyhow::Error> {
    let config = aws_config::defaults(BehaviorVersion::latest())
        .region("us-west-2")
        .load()
        .await;
    let s3_client = aws_sdk_s3::Client::new(&config);

    let mut object = remote_object(&s3_client, &opts.bucket, &opts.key).await?;
    let mut buffer = Vec::with_capacity(5000000);

    // Low throughput detection uses a 10-bin[1] window[2] to track requests
    // over the last second[3,4]. There is a 20 second grace period before a
    // throughput error is raised, implemented as a delay future[5,6]. Errors
    // are suppressed if the stream is not being polled, detected by half of the
    // bins being empty[7].
    //
    // 1: https://github.com/awslabs/aws-sdk-rust/blob/main/sdk/aws-smithy-runtime/src/client/http/body/minimum_throughput/throughput.rs#L273
    // 2: https://github.com/awslabs/aws-sdk-rust/blob/main/sdk/aws-smithy-runtime/src/client/http/body/minimum_throughput/throughput.rs#L334-L339
    // 3: https://github.com/awslabs/aws-sdk-rust/blob/main/sdk/aws-smithy-runtime/src/client/http/body/minimum_throughput/options.rs#L79
    // 4: https://github.com/awslabs/aws-sdk-rust/blob/main/sdk/aws-smithy-runtime/src/client/http/body/minimum_throughput/throughput.rs#L309
    // 5: https://github.com/awslabs/aws-sdk-rust/blob/main/sdk/aws-smithy-runtime-api/src/client/stalled_stream_protection.rs#L16-L20
    // 6: https://github.com/awslabs/aws-sdk-rust/blob/bbf69c2279fa9cdd1af5327ba58c7dad0cb62633/sdk/aws-smithy-runtime/src/client/http/body/minimum_throughput/http_body_0_4_x.rs#L111-L127
    // 7: https://github.com/awslabs/aws-sdk-rust/blob/main/sdk/aws-smithy-runtime/src/client/http/body/minimum_throughput/throughput.rs#L401

    // This repro works by filling the bins with pending requests, which are
    // counted as no data in the throughput calculation. The incoming bytes are
    // logged into bins, but the pending request label is sticky and takes
    // precedence.
    //
    // Full runtime pauses are used to simulate running alongside a CPU-bound
    // workload. (In our case, a lambda that's busy decompressing incoming
    // data.)
    //
    // Seems to me this is effectively the worst-case scenario for slow readers.
    // The throughput estimation does not work if most polls return pending
    // before becoming ready.
    //

    // Send requests just above the resolution limit to fill the throughput buckets
    loop {
        let n = object.read_buf(&mut buffer).await?;
        buffer.clear();

        std::thread::sleep(std::time::Duration::from_millis(120));

        // Good-enough EOF check
        if n == 0 {
            break Ok(());
        }
    }
}

#[tokio::main(flavor = "current_thread")]
async fn main() {
    if std::env::var("RUST_LOG").is_err() {
        std::env::set_var("RUST_LOG", "trace");
    }
    tracing_subscriber::fmt::init();
    let opts = Cli::parse();
    let res = repro(opts).await;
    if let Err(e) = res {
        eprintln!("{:?}", e);
    }
}

Possible Solution

No response

Additional Information/Context

No response

Version

├── aws-config v1.5.7
│   ├── aws-credential-types v1.2.1
│   │   ├── aws-smithy-async v1.2.1
│   │   ├── aws-smithy-runtime-api v1.7.2
│   │   │   ├── aws-smithy-async v1.2.1 (*)
│   │   │   ├── aws-smithy-types v1.2.7
│   │   ├── aws-smithy-types v1.2.7 (*)
│   ├── aws-runtime v1.4.3
│   │   ├── aws-credential-types v1.2.1 (*)
│   │   ├── aws-sigv4 v1.2.4
│   │   │   ├── aws-credential-types v1.2.1 (*)
│   │   │   ├── aws-smithy-eventstream v0.60.5
│   │   │   │   ├── aws-smithy-types v1.2.7 (*)
│   │   │   ├── aws-smithy-http v0.60.11
│   │   │   │   ├── aws-smithy-eventstream v0.60.5 (*)
│   │   │   │   ├── aws-smithy-runtime-api v1.7.2 (*)
│   │   │   │   ├── aws-smithy-types v1.2.7 (*)
│   │   │   ├── aws-smithy-runtime-api v1.7.2 (*)
│   │   │   ├── aws-smithy-types v1.2.7 (*)
│   │   ├── aws-smithy-async v1.2.1 (*)
│   │   ├── aws-smithy-eventstream v0.60.5 (*)
│   │   ├── aws-smithy-http v0.60.11 (*)
│   │   ├── aws-smithy-runtime v1.7.1
│   │   │   ├── aws-smithy-async v1.2.1 (*)
│   │   │   ├── aws-smithy-http v0.60.11 (*)
│   │   │   ├── aws-smithy-runtime-api v1.7.2 (*)
│   │   │   ├── aws-smithy-types v1.2.7 (*)
│   │   ├── aws-smithy-runtime-api v1.7.2 (*)
│   │   ├── aws-smithy-types v1.2.7 (*)
│   │   ├── aws-types v1.3.3
│   │   │   ├── aws-credential-types v1.2.1 (*)
│   │   │   ├── aws-smithy-async v1.2.1 (*)
│   │   │   ├── aws-smithy-runtime-api v1.7.2 (*)
│   │   │   ├── aws-smithy-types v1.2.7 (*)
│   ├── aws-sdk-sts v1.44.0
│   │   ├── aws-credential-types v1.2.1 (*)
│   │   ├── aws-runtime v1.4.3 (*)
│   │   ├── aws-smithy-async v1.2.1 (*)
│   │   ├── aws-smithy-http v0.60.11 (*)
│   │   ├── aws-smithy-json v0.60.7
│   │   │   └── aws-smithy-types v1.2.7 (*)
│   │   ├── aws-smithy-query v0.60.7
│   │   │   ├── aws-smithy-types v1.2.7 (*)
│   │   ├── aws-smithy-runtime v1.7.1 (*)
│   │   ├── aws-smithy-runtime-api v1.7.2 (*)
│   │   ├── aws-smithy-types v1.2.7 (*)
│   │   ├── aws-smithy-xml v0.60.9
│   │   ├── aws-types v1.3.3 (*)
│   ├── aws-smithy-async v1.2.1 (*)
│   ├── aws-smithy-http v0.60.11 (*)
│   ├── aws-smithy-json v0.60.7 (*)
│   ├── aws-smithy-runtime v1.7.1 (*)
│   ├── aws-smithy-runtime-api v1.7.2 (*)
│   ├── aws-smithy-types v1.2.7 (*)
│   ├── aws-types v1.3.3 (*)
├── aws-sdk-s3 v1.52.0
│   ├── aws-credential-types v1.2.1 (*)
│   ├── aws-runtime v1.4.3 (*)
│   ├── aws-sigv4 v1.2.4 (*)
│   ├── aws-smithy-async v1.2.1 (*)
│   ├── aws-smithy-checksums v0.60.12
│   │   ├── aws-smithy-http v0.60.11 (*)
│   │   ├── aws-smithy-types v1.2.7 (*)
│   ├── aws-smithy-eventstream v0.60.5 (*)
│   ├── aws-smithy-http v0.60.11 (*)
│   ├── aws-smithy-json v0.60.7 (*)
│   ├── aws-smithy-runtime v1.7.1 (*)
│   ├── aws-smithy-runtime-api v1.7.2 (*)
│   ├── aws-smithy-types v1.2.7 (*)
│   ├── aws-smithy-xml v0.60.9 (*)
│   ├── aws-types v1.3.3 (*)


### Environment details (OS name and version, etc.)

Repro tested on macos. We're hitting this in production on x86_64 lambdas running al2023 and 375 MiB of memory.

### Logs

_No response_
@GeorgeHahn GeorgeHahn added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Oct 1, 2024
@ysaito1001
Copy link
Collaborator

Hi @GeorgeHahn, thank you for bringing this to our attention and for providing detailed reproduction steps. We will investigate to see if we can first replicate the issue on our end.

@ysaito1001 ysaito1001 removed the needs-triage This issue or PR still needs to be triaged. label Oct 2, 2024
@ysaito1001
Copy link
Collaborator

Still trying to reproduce it on our end. In the meantime, could you provide us with the following information?

  • The size of an object stored in s3 that tends to trigger a stall stream protection error, given the above reproduction code
  • Snippet of a trace log, if available, leading up to a stall stream protection error you're observing (please be careful to not include sensitive information in the log)

@GeorgeHahn
Copy link
Author

The objects in this case are typically 4-10MiB.

Log attached. I removed the config and setup log lines for simplicity. Let me know if there are any details you'd like to see from that part.
repro.log

@ysaito1001
Copy link
Collaborator

Thank you, we've reproduced the issue on our end. Will further investigate it.

@ysaito1001 ysaito1001 self-assigned this Oct 9, 2024
github-merge-queue bot pushed a commit to smithy-lang/smithy-rs that referenced this issue Oct 9, 2024
## Motivation and Context
awslabs/aws-sdk-rust#1202

## Description
The issue above demonstrated the incorrect
[BinLabel](https://github.com/smithy-lang/smithy-rs/blob/07fe426697cc30ad613568902528c305f953deb1/rust-runtime/aws-smithy-runtime/src/client/http/body/minimum_throughput/throughput.rs#L100-L119)
ordering in
[LogBuffer](https://github.com/smithy-lang/smithy-rs/blob/07fe426697cc30ad613568902528c305f953deb1/rust-runtime/aws-smithy-runtime/src/client/http/body/minimum_throughput/throughput.rs#L173-L183),
the underlying data structure we use for stall stream protection.

The following trace logs are generated from executing the reproduction
steps in the issue above. In the file labeled "no_sleep," we have
commented out
`std::thread::sleep(std::time::Duration::from_millis(120));` from the
reproducer so the updated code can be tested as the happy path.


[s3_throughput_min_repro_no_sleep.log](https://github.com/user-attachments/files/17299373/s3_throughput_min_repro_no_sleep.log)


[s3_throughput_min_repro_with_sleep.log](https://github.com/user-attachments/files/17299447/s3_throughput_min_repro_with_sleep.log)

In both files, it’s important to note that `Bin`s assigned
`TransferredBytes` can be overwritten by `Pending` due to
[`ThroughputLogs::push`](https://github.com/smithy-lang/smithy-rs/blob/07fe426697cc30ad613568902528c305f953deb1/rust-runtime/aws-smithy-runtime/src/client/http/body/minimum_throughput/throughput.rs#L346).
Once a `Bin` is labeled as `Pending`, it cannot be re-labeled.

When this occurs, the only way to avoid the stall stream protection
check going into the grace period is for time to advance beyond the
current `Bin`'s resolution, the `LogBuffer` pushes a new `Bin` during
[`catch_up`](https://github.com/smithy-lang/smithy-rs/blob/07fe426697cc30ad613568902528c305f953deb1/rust-runtime/aws-smithy-runtime/src/client/http/body/minimum_throughput/throughput.rs#L355)
, and this new `Bin` hopefully [gets
assigned](https://github.com/smithy-lang/smithy-rs/blob/07fe426697cc30ad613568902528c305f953deb1/rust-runtime/aws-smithy-runtime/src/client/http/body/minimum_throughput/http_body_0_4_x.rs#L78-L79)
a `TransferredBytes`. However, this new `Bin` could also be overwritten
by Pending in a subsequent call to
[`MinimumThroughputDownloadBody::poll_data`](https://github.com/smithy-lang/smithy-rs/blob/07fe426697cc30ad613568902528c305f953deb1/rust-runtime/aws-smithy-runtime/src/client/http/body/minimum_throughput/http_body_0_4_x.rs#L78-L79),
which can trigger the the grace period if the overall `LogBuffer` looks
like it's violated the stall stream protection check.

The reproducer without sleep does not fail the stall stream protection
obviously because the execution completes way before the grace period
ends, but more importantly because the execution periodically assigns
new `TransferredBytes` `Bin`s in the throughput logs. This effectively
resets the grace period for the stall stream protection (search for
`throughput recovered; exiting grace period` in the
`s3_throughput_min_repro_no_sleep.log`). However, with sleep, `Bin`s
labeled as `TransferredBytes` are frequently (and almost immediately)
overwritten by `Pending`. This results in the execution being unable to
exit the grace period, ultimately leading to a stall stream protection
error.

To resolve this, we make `TransferredBytes` be the top priority in
`BinLabel`. This means once a new `Bin` has earned `TransferredBytes`,
it's green for that time resolution and that it should not be revoked by
`Pending` overwriting it to make it look like no bytes transferred
during that time.

## Testing
- Existing tests in CI
- Added unit tests for `BinLabel` ordering and for `ThroughputLogs`
- Passed the customer's reproduction step
- To confirm the stall stream protection for download still works, I
switched off WiFi while running the customer's reproducer (with sleep)
and it successfully failed with the stall stream protection error:
```
---- s3_throughput_min_repro stdout ----
2024-10-08T23:29:24.999477Z DEBUG aws_smithy_runtime::client::http::body::minimum_throughput::http_body_0_4_x: current throughput: 0 B/s is below minimum: 1 B/s
2024-10-08T23:29:24.999513Z TRACE aws_smithy_runtime::client::http::body::minimum_throughput::http_body_0_4_x: received poll pending
2024-10-08T23:29:24.999530Z DEBUG aws_smithy_runtime::client::http::body::minimum_throughput::http_body_0_4_x: current throughput: 0 B/s is below minimum: 1 B/s
2024-10-08T23:29:25.081811Z TRACE aws_smithy_runtime::client::http::body::minimum_throughput::http_body_0_4_x: received poll pending
2024-10-08T23:29:25.081938Z DEBUG aws_smithy_runtime::client::http::body::minimum_throughput::http_body_0_4_x: current throughput: 0 B/s is below minimum: 1 B/s
test s3_throughput_min_repro ... FAILED
...
called `Result::unwrap()` on an `Err` value: Custom { kind: Other, error: Error { kind: StreamingError(ThroughputBelowMinimum { expected: Throughput { bytes_read: 1, per_time_elapsed: 1s }, actual: Throughput { bytes_read: 0, per_time_elapsed: 1s } }) } }
```

## Checklist
<!--- If a checkbox below is not applicable, then please DELETE it
rather than leaving it unchecked -->
- [x] For changes to the smithy-rs codegen or runtime crates, I have
created a changelog entry Markdown file in the `.changelog` directory,
specifying "client," "server," or both in the `applies_to` key.
- [x] For changes to the AWS SDK, generated SDK code, or SDK runtime
crates, I have created a changelog entry Markdown file in the
`.changelog` directory, specifying "aws-sdk-rust" in the `applies_to`
key.

----

_By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice._
@ysaito1001 ysaito1001 added the pending-release This issue will be fixed by an approved PR that hasn't been released yet. label Oct 9, 2024
@landonxjames
Copy link
Contributor

This should be fixed in today's release of the aws-sdk-s3 crate (1.55.00). Thank you for the reproduction, we've used it to add an integration test so we can detect if this regresses in the future!

Fix: smithy-lang/smithy-rs#3871
Integration test based on your repro: smithy-lang/smithy-rs#3874

Copy link

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. pending-release This issue will be fixed by an approved PR that hasn't been released yet.
Projects
None yet
Development

No branches or pull requests

3 participants