Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make TransferredBytes be the top of the list in BinLabel #3871

Merged
merged 3 commits into from
Oct 9, 2024

Conversation

ysaito1001
Copy link
Contributor

@ysaito1001 ysaito1001 commented Oct 8, 2024

Motivation and Context

awslabs/aws-sdk-rust#1202

Description

The issue above demonstrated the incorrect BinLabel ordering in LogBuffer, the underlying data structure we use for stall stream protection.

The following trace logs are generated from executing the reproduction steps in the issue above. In the file labeled "no_sleep," we have commented out std::thread::sleep(std::time::Duration::from_millis(120)); from the reproducer so the updated code can be tested as the happy path.

s3_throughput_min_repro_no_sleep.log

s3_throughput_min_repro_with_sleep.log

In both files, it’s important to note that Bins assigned TransferredBytes can be overwritten by Pending due to ThroughputLogs::push. Once a Bin is labeled as Pending, it cannot be re-labeled.

When this occurs, the only way to avoid the stall stream protection check going into the grace period is for time to advance beyond the current Bin's resolution, the LogBuffer pushes a new Bin during catch_up , and this new Bin hopefully gets assigned a TransferredBytes. However, this new Bin could also be overwritten by Pending in a subsequent call to MinimumThroughputDownloadBody::poll_data, which can trigger the the grace period if the overall LogBuffer looks like it's violated the stall stream protection check.

The reproducer without sleep does not fail the stall stream protection obviously because the execution completes way before the grace period ends, but more importantly because the execution periodically assigns new TransferredBytes Bins in the throughput logs. This effectively resets the grace period for the stall stream protection (search for throughput recovered; exiting grace period in the s3_throughput_min_repro_no_sleep.log). However, with sleep, Bins labeled as TransferredBytes are frequently (and almost immediately) overwritten by Pending. This results in the execution being unable to exit the grace period, ultimately leading to a stall stream protection error.

To resolve this, we make TransferredBytes be the top priority in BinLabel. This means once a new Bin has earned TransferredBytes, it's green for that time resolution and that it should not be revoked by Pending overwriting it to make it look like no bytes transferred during that time.

Testing

  • Existing tests in CI
  • Added unit tests for BinLabel ordering and for ThroughputLogs
  • Passed the customer's reproduction step
  • To confirm the stall stream protection for download still works, I switched off WiFi while running the customer's reproducer (with sleep) and it successfully failed with the stall stream protection error:
---- s3_throughput_min_repro stdout ----
2024-10-08T23:29:24.999477Z DEBUG aws_smithy_runtime::client::http::body::minimum_throughput::http_body_0_4_x: current throughput: 0 B/s is below minimum: 1 B/s
2024-10-08T23:29:24.999513Z TRACE aws_smithy_runtime::client::http::body::minimum_throughput::http_body_0_4_x: received poll pending
2024-10-08T23:29:24.999530Z DEBUG aws_smithy_runtime::client::http::body::minimum_throughput::http_body_0_4_x: current throughput: 0 B/s is below minimum: 1 B/s
2024-10-08T23:29:25.081811Z TRACE aws_smithy_runtime::client::http::body::minimum_throughput::http_body_0_4_x: received poll pending
2024-10-08T23:29:25.081938Z DEBUG aws_smithy_runtime::client::http::body::minimum_throughput::http_body_0_4_x: current throughput: 0 B/s is below minimum: 1 B/s
test s3_throughput_min_repro ... FAILED
...
called `Result::unwrap()` on an `Err` value: Custom { kind: Other, error: Error { kind: StreamingError(ThroughputBelowMinimum { expected: Throughput { bytes_read: 1, per_time_elapsed: 1s }, actual: Throughput { bytes_read: 0, per_time_elapsed: 1s } }) } }

Checklist

  • For changes to the smithy-rs codegen or runtime crates, I have created a changelog entry Markdown file in the .changelog directory, specifying "client," "server," or both in the applies_to key.
  • For changes to the AWS SDK, generated SDK code, or SDK runtime crates, I have created a changelog entry Markdown file in the .changelog directory, specifying "aws-sdk-rust" in the applies_to key.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link

github-actions bot commented Oct 8, 2024

A new generated diff is ready to view.

  • AWS SDK (ignoring whitespace)
  • No codegen difference in the Client Test
  • No codegen difference in the Server Test
  • No codegen difference in the Server Test Python
  • No codegen difference in the Server Test Typescript

A new doc preview is ready to view.

@ysaito1001 ysaito1001 marked this pull request as ready for review October 9, 2024 16:11
@ysaito1001 ysaito1001 requested review from a team as code owners October 9, 2024 16:11
Copy link

github-actions bot commented Oct 9, 2024

A new generated diff is ready to view.

  • AWS SDK (ignoring whitespace)
  • No codegen difference in the Client Test
  • No codegen difference in the Server Test
  • No codegen difference in the Server Test Python
  • No codegen difference in the Server Test Typescript

A new doc preview is ready to view.

Copy link
Contributor

@landonxjames landonxjames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! One non-blocking request, would it be possible to create a test that replicates the reproduction from awslabs/aws-sdk-rust#1202? Might be too difficult since we would need some kind of mocked server to talk to? In that case I would prioritize getting this out since it is currently causing customer issues.

@ysaito1001
Copy link
Contributor Author

Looks good! One non-blocking request, would it be possible to create a test that replicates the reproduction from awslabs/aws-sdk-rust#1202? Might be too difficult since we would need some kind of mocked server to talk to? In that case I would prioritize getting this out since it is currently causing customer issues.

I think it would, using test utilities John originally wrote (it's in integration tests of aws-smithy-runtime).

In that case I would prioritize getting this out since it is currently causing customer issues.

Since I've verified the customer repro and added relevant unit tests, let's get it out first to customers' hands. I'll work on adding an integ test in the next PR.

@ysaito1001 ysaito1001 added this pull request to the merge queue Oct 9, 2024
Merged via the queue into main with commit 1e80349 Oct 9, 2024
44 checks passed
@ysaito1001 ysaito1001 deleted the ysaito/update-log-buffer-bin-label-priority branch October 9, 2024 18:36
ysaito1001 added a commit that referenced this pull request Oct 10, 2024
github-merge-queue bot pushed a commit that referenced this pull request Oct 10, 2024
…t#1202 (#3874)

## Motivation and Context
A follow-up on #3871, responding to [the review
feedback](#3871 (review))

## Testing
- Also confirmed that reverting the change in the above PR (so that
`BinLabel::Pending` becomes the top of the list) failed the integration
test added to this PR, as expected.
```
2024-10-10T19:06:56.417686Z TRACE aws_smithy_runtime::client::http::body::minimum_throughput::http_body_0_4_x: received poll pending
2024-10-10T19:06:56.417694Z DEBUG aws_smithy_runtime::client::http::body::minimum_throughput::http_body_0_4_x: current throughput: 0 B/s is below minimum: 1 B/s
thread 'user_polls_pending_followed_by_data_for_every_bin_in_throughput_logs' panicked at aws-smithy-runtime/tests/stalled_stream_download.rs:252:10:
response MUST NOT timeout: ThroughputBelowMinimum { expected: Throughput { bytes_read: 1, per_time_elapsed: 1s }, actual: Throughput { bytes_read: 0, per_time_elapsed: 1s } }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    user_polls_pending_followed_by_data_for_every_bin_in_throughput_logs
```

----

_By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice._
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants