Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 get_object() call sometimes returns partial response #1079

Closed
exabytes18 opened this issue Feb 28, 2024 · 10 comments
Closed

S3 get_object() call sometimes returns partial response #1079

exabytes18 opened this issue Feb 28, 2024 · 10 comments
Labels
bug This issue is a bug. p2 This is a standard priority issue

Comments

@exabytes18
Copy link

Describe the bug

We are occasionally seeing S3 get_object() calls return incomplete data.

async fn fetch_object(&self, s3_file_reference: &S3FileReference) -> Result<Value> {
    let response =
        self.s3_client
            .get_object()
            .bucket(s3_file_reference.s3_bucket_name.clone())
            .key(s3_file_reference.s3_key.clone())
            .send()
            .await?;

    let bytes = response.body.collect().await?.into_bytes();
    if let Some(content_length) = response.content_length {
        let num_bytes_received = bytes.len() as i64;
        if num_bytes_received != content_length {
            return Err(anyhow!("Downloaded s3://{}/{}; expected content length: {}, got: {}",
                s3_file_reference.s3_bucket_name,
                s3_file_reference.s3_key,
                content_length,
                num_bytes_received));
        }
    }

    Ok(serde_json::from_slice(&bytes)?)
}

Example error:

Downloaded s3://[REDACTED]/[REDACTED]/1709094648119.json; expected content length: 2659912, got: 503174

Expected Behavior

get_object() should return the full object.

Current Behavior

get_object() sometimes returns incomplete data.

Reproduction Steps

I do not have a minimum repro script available right now.

Possible Solution

No response

Additional Information/Context

No response

Version

│   │   ├── aws-config v1.1.7
│   │   │   ├── aws-credential-types v1.1.7
│   │   │   │   ├── aws-smithy-async v1.1.7
│   │   │   │   ├── aws-smithy-runtime-api v1.1.7
│   │   │   │   │   ├── aws-smithy-async v1.1.7 (*)
│   │   │   │   │   ├── aws-smithy-types v1.1.7
│   │   │   │   ├── aws-smithy-types v1.1.7 (*)
│   │   │   ├── aws-runtime v1.1.7
│   │   │   │   ├── aws-credential-types v1.1.7 (*)
│   │   │   │   ├── aws-sigv4 v1.1.7
│   │   │   │   │   ├── aws-credential-types v1.1.7 (*)
│   │   │   │   │   ├── aws-smithy-eventstream v0.60.4
│   │   │   │   │   │   ├── aws-smithy-types v1.1.7 (*)
│   │   │   │   │   ├── aws-smithy-http v0.60.6
│   │   │   │   │   │   ├── aws-smithy-eventstream v0.60.4 (*)
│   │   │   │   │   │   ├── aws-smithy-runtime-api v1.1.7 (*)
│   │   │   │   │   │   ├── aws-smithy-types v1.1.7 (*)
│   │   │   │   │   ├── aws-smithy-runtime-api v1.1.7 (*)
│   │   │   │   │   ├── aws-smithy-types v1.1.7 (*)
│   │   │   │   ├── aws-smithy-async v1.1.7 (*)
│   │   │   │   ├── aws-smithy-eventstream v0.60.4 (*)
│   │   │   │   ├── aws-smithy-http v0.60.6 (*)
│   │   │   │   ├── aws-smithy-runtime-api v1.1.7 (*)
│   │   │   │   ├── aws-smithy-types v1.1.7 (*)
│   │   │   │   ├── aws-types v1.1.7
│   │   │   │   │   ├── aws-credential-types v1.1.7 (*)
│   │   │   │   │   ├── aws-smithy-async v1.1.7 (*)
│   │   │   │   │   ├── aws-smithy-runtime-api v1.1.7 (*)
│   │   │   │   │   ├── aws-smithy-types v1.1.7 (*)
│   │   │   ├── aws-sdk-sso v1.15.0
│   │   │   │   ├── aws-credential-types v1.1.7 (*)
│   │   │   │   ├── aws-runtime v1.1.7 (*)
│   │   │   │   ├── aws-smithy-async v1.1.7 (*)
│   │   │   │   ├── aws-smithy-http v0.60.6 (*)
│   │   │   │   ├── aws-smithy-json v0.60.6
│   │   │   │   │   └── aws-smithy-types v1.1.7 (*)
│   │   │   │   ├── aws-smithy-runtime v1.1.7
│   │   │   │   │   ├── aws-smithy-async v1.1.7 (*)
│   │   │   │   │   ├── aws-smithy-http v0.60.6 (*)
│   │   │   │   │   ├── aws-smithy-runtime-api v1.1.7 (*)
│   │   │   │   │   ├── aws-smithy-types v1.1.7 (*)
│   │   │   │   ├── aws-smithy-runtime-api v1.1.7 (*)
│   │   │   │   ├── aws-smithy-types v1.1.7 (*)
│   │   │   │   ├── aws-types v1.1.7 (*)
│   │   │   ├── aws-sdk-ssooidc v1.15.0
│   │   │   │   ├── aws-credential-types v1.1.7 (*)
│   │   │   │   ├── aws-runtime v1.1.7 (*)
│   │   │   │   ├── aws-smithy-async v1.1.7 (*)
│   │   │   │   ├── aws-smithy-http v0.60.6 (*)
│   │   │   │   ├── aws-smithy-json v0.60.6 (*)
│   │   │   │   ├── aws-smithy-runtime v1.1.7 (*)
│   │   │   │   ├── aws-smithy-runtime-api v1.1.7 (*)
│   │   │   │   ├── aws-smithy-types v1.1.7 (*)
│   │   │   │   ├── aws-types v1.1.7 (*)
│   │   │   ├── aws-sdk-sts v1.15.0
│   │   │   │   ├── aws-credential-types v1.1.7 (*)
│   │   │   │   ├── aws-runtime v1.1.7 (*)
│   │   │   │   ├── aws-smithy-async v1.1.7 (*)
│   │   │   │   ├── aws-smithy-http v0.60.6 (*)
│   │   │   │   ├── aws-smithy-json v0.60.6 (*)
│   │   │   │   ├── aws-smithy-query v0.60.6
│   │   │   │   │   ├── aws-smithy-types v1.1.7 (*)
│   │   │   │   ├── aws-smithy-runtime v1.1.7 (*)
│   │   │   │   ├── aws-smithy-runtime-api v1.1.7 (*)
│   │   │   │   ├── aws-smithy-types v1.1.7 (*)
│   │   │   │   ├── aws-smithy-xml v0.60.6
│   │   │   │   ├── aws-types v1.1.7 (*)
│   │   │   ├── aws-smithy-async v1.1.7 (*)
│   │   │   ├── aws-smithy-http v0.60.6 (*)
│   │   │   ├── aws-smithy-json v0.60.6 (*)
│   │   │   ├── aws-smithy-runtime v1.1.7 (*)
│   │   │   ├── aws-smithy-runtime-api v1.1.7 (*)
│   │   │   ├── aws-smithy-types v1.1.7 (*)
│   │   │   ├── aws-types v1.1.7 (*)
│   │   ├── aws-sdk-s3 v1.17.0
│   │   │   ├── aws-credential-types v1.1.7 (*)
│   │   │   ├── aws-runtime v1.1.7 (*)
│   │   │   ├── aws-sigv4 v1.1.7 (*)
│   │   │   ├── aws-smithy-async v1.1.7 (*)
│   │   │   ├── aws-smithy-checksums v0.60.6
│   │   │   │   ├── aws-smithy-http v0.60.6 (*)
│   │   │   │   ├── aws-smithy-types v1.1.7 (*)
│   │   │   ├── aws-smithy-eventstream v0.60.4 (*)
│   │   │   ├── aws-smithy-http v0.60.6 (*)
│   │   │   ├── aws-smithy-json v0.60.6 (*)
│   │   │   ├── aws-smithy-runtime v1.1.7 (*)
│   │   │   ├── aws-smithy-runtime-api v1.1.7 (*)
│   │   │   ├── aws-smithy-types v1.1.7 (*)
│   │   │   ├── aws-smithy-xml v0.60.6 (*)
│   │   │   ├── aws-types v1.1.7 (*)

Environment details (OS name and version, etc.)

Ubuntu 22.04

Logs

No response

@exabytes18 exabytes18 added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Feb 28, 2024
@ysaito1001
Copy link
Collaborator

Thank you for reporting this. Can you enable trace debug logging, reproduce the issue, and then post the logs here? The log will probably be our next best bet without a reproducer.

@ysaito1001 ysaito1001 removed the needs-triage This issue or PR still needs to be triaged. label Mar 1, 2024
@exabytes18
Copy link
Author

Was finally able to capture some TRACE-level logs around the time we observe the partial response.

log.txt

@rcoh
Copy link
Contributor

rcoh commented Mar 13, 2024

Do you have a sense of what "sometimes" is here? 1%? 0.1%? 0.001%?

Trying to figure out what we might need to try to do to reproduce this on our side.

Are you connecting directly to S3? Is it possible there is an intermediate proxy somewhere that may be truncating the stream?

It seems like at a minimum, the SDK should be doing the validation you're doing so we can return an error if this case is detected.

I'm going to keep digging into this and I'll let you know what I find.

@rcoh
Copy link
Contributor

rcoh commented Mar 13, 2024

Also, I'd suggest turning on request checksums in your get_object() call:

        .checksum_mode(aws_sdk_s3::types::ChecksumMode::Enabled)

@rcoh
Copy link
Contributor

rcoh commented Mar 13, 2024

Are you doing anything notable when constructing your client? I've been unable to reproduce this failure mode—we actually have existing tests for this behavior.

  • If the connection stays open, it should timeout (or hang if you don't timeout the read as you're doing here).
  • If the connection is closed on the server side, you get an EOF error

From the logs, I can see you are using HTTP 1 which exactly matches the scenario we're testing.

We can add an additional safeguard to error out in this case—I think that would be prudent anyway.

@exabytes18
Copy link
Author

Do you have a sense of what "sometimes" is here? 1%? 0.1%? 0.001%?

We don't have good metrics on this, but I'll estimate this somewhere at <0.1%.

Are you connecting directly to S3? Is it possible there is an intermediate proxy somewhere that may be truncating the stream?

These processes are running in EC2 and hitting S3 directly. There's very little infrastructure in the way -- no ECS/EKS, no proxies, about as simple as you can get. Additionally, our VPCs and route tables are configured with the S3 Gateway Endpoints.

Notably, I think we typically see these "partial responses" with cross-region S3 calls. E.g. the log I posted above was an EC2 instance in eu-central-1 making a GetObject call to a us-east-1 bucket. This might be significant.

Also, I'd suggest turning on request checksums in your get_object() call

Excellent callout. We'll add this.

Are you doing anything notable when constructing your client?

Nothing specifically. Our code looks like this:

let sdk_config = aws_config::defaults(BehaviorVersion::latest())
    .region(Region::new(region.clone()))
    .load()
    .await;

let s3_client = Client::new(&sdk_config);

Perhaps moderately interesting, we do instantiate a fresh tokio runtime essentially per GetObject attempt. Something analogous to the following:

fn some_synchronous_function_which_is_called_infrequently(&self, s3_file_reference: &S3FileReference) -> Result<Value> {
    tokio::runtime::Builder::new_current_thread()
        .enable_all()
        .build()?
        .block_on(async move {
            self.fetch_object(&s3_file_reference).await
        })
}

I've been unable to reproduce this failure mode

Yeah, our own retry logic is able to resolve the issue. That is, our code makes the same exact GetObject call, using the same S3Client, in the exact same tokio runtime, and the second call succeeds. Thus, I would guess that reproducing it is tricky.

@rcoh
Copy link
Contributor

rcoh commented Mar 14, 2024

Perhaps moderately interesting, we do instantiate a fresh tokio runtime essentially per GetObject attempt. Something analogous to the following:

That is almost certainly the problem. Because the rust client contains a connection pool, weird things happen when you drop the runtime that's running the futures.

I'm working with Sean to fix this on Hyper's side but I'm also adding a middleware to the Rust SDK to validate content length.

In the mean time, share the runtime between clients :-)

@rcoh
Copy link
Contributor

rcoh commented Mar 14, 2024

Fix here: smithy-lang/smithy-rs#3491

github-merge-queue bot pushed a commit to smithy-lang/smithy-rs that referenced this issue Mar 26, 2024
…3491)

## Motivation and Context
<!--- Why is this change required? What problem does it solve? -->
<!--- If it fixes an open issue, please link to the issue here -->
There is a rarely-triggered bug
(awslabs/aws-sdk-rust#1079) that can occur
when the runtime is dropped between requests. Although this is
definitely the _wrong thing to do_(tm) which should still aim to at
least protect the users from bad data in this case.

This adds an interceptor which validates that the body returned is the
correct length.

## Description
- Adds an interceptor that computes the actual content length of the
body and compares it to the actual length

## Testing
- Integration style test. Note that this is very hard to test using
Hyper because Hyper will attempt to mitigate this issue.


## Checklist
<!--- If a checkbox below is not applicable, then please DELETE it
rather than leaving it unchecked -->
- [x] I have updated `CHANGELOG.next.toml` if I made changes to the
smithy-rs codegen or runtime crates


----

_By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice._
@jmklix jmklix added p2 This is a standard priority issue pending-release This issue will be fixed by an approved PR that hasn't been released yet. labels Mar 29, 2024
@Velfi Velfi removed the pending-release This issue will be fixed by an approved PR that hasn't been released yet. label May 28, 2024
@Velfi
Copy link
Contributor

Velfi commented May 28, 2024

@Velfi Velfi closed this as completed May 28, 2024
Copy link

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. p2 This is a standard priority issue
Projects
None yet
Development

No branches or pull requests

5 participants