Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add configurable retry policy for S3 client #21900

Merged
merged 1 commit into from
May 21, 2024

Conversation

anusudarsan
Copy link
Member

@anusudarsan anusudarsan commented May 9, 2024

Description

The stress testing and benchmarking of the S3 filesystem revealed errors in Hive connector as below:

Caused by: java.io.IOException: Failed to list location: s3://benchmark-sep-hive-us-east-2-tpcds-sf1000-01/sf1000/catalog_returns/cr_returned_date_sk=2451790
    at io.trino.filesystem.s3.S3FileSystem.listFiles(S3FileSystem.java:195)
    at io.trino.filesystem.manager.SwitchingFileSystem.listFiles(SwitchingFileSystem.java:110)
    at io.trino.filesystem.tracing.TracingFileSystem.lambda$listFiles$4(TracingFileSystem.java:109)
    at io.trino.filesystem.tracing.Tracing.withTracing(Tracing.java:47)
    at io.trino.filesystem.tracing.TracingFileSystem.listFiles(TracingFileSystem.java:109)
    at io.trino.filesystem.ForwardingTrinoFileSystem.listFiles(ForwardingTrinoFileSystem.java:89)
    at io.trino.plugin.hive.fs.CachingDirectoryLister.listFilesRecursively(CachingDirectoryLister.java:96)
    at io.trino.plugin.hive.fs.TransactionScopeCachingDirectoryLister.createListingRemoteIterator(TransactionScopeCachingDirectoryLister.java:97)
    at io.trino.plugin.hive.fs.TransactionScopeCachingDirectoryLister.lambda$listInternal$0(TransactionScopeCachingDirectoryLister.java:78)
    at com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4955)
    at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3589)
    at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2328)
    at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2187)
    at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2081)
    at com.google.common.cache.LocalCache.get(LocalCache.java:4036)
    at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4950)
    at io.trino.cache.EvictableCache.get(EvictableCache.java:112)
    at io.trino.plugin.hive.fs.TransactionScopeCachingDirectoryLister.listInternal(TransactionScopeCachingDirectoryLister.java:78)
    at io.trino.plugin.hive.fs.TransactionScopeCachingDirectoryLister.listFilesRecursively(TransactionScopeCachingDirectoryLister.java:70)
    at io.trino.plugin.hive.fs.HiveFileIterator$FileStatusIterator.<init>(HiveFileIterator.java:140)
    ... 11 more
Caused by: software.amazon.awssdk.services.s3.model.S3Exception: Please reduce your request rate. (Service: S3, Status Code: 503, Request ID: 3TWYHXY49Z761P7Y, Extended Request ID: nDKplvh5sDhgsEEJAVCPHmiDsF0vUlIQMKbvRfjs1sFK+4WGtsZYtDHn6ed1mCmHx/9VjgWKRoI=)
    at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleErrorResponse(AwsXmlPredicatedResponseHandler.java:156)
    at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleResponse(AwsXmlPredicatedResponseHandler.java:108)
    at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:85)
    at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:43)
    at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler$Crc32ValidationResponseHandler.handle(AwsSyncClientHandler.java:93)
    at software.amazon.awssdk.core.internal.handler.BaseClientHandler.lambda$successTransformationResponseHandler$7(BaseClientHandler.java:279)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:50)
    a

This change allows native filesytem S3 client to have a configurable retry mechanism since the default retry mechanism does not seem to be good enough. As per AWS team's recommendations, played around with the max error retry count, and bumping this up from the default of 3 helped with fixing the issue. AWS support also suggests having a retry mode to STANDARD for some workloads. As per https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ the default "Equal Jitter" is the loser. So having this configurable might help for some workloads. So exposing this setting to be configurable as well

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@anusudarsan anusudarsan requested a review from electrum May 9, 2024 19:48
@cla-bot cla-bot bot added the cla-signed label May 9, 2024
@findepi
Copy link
Member

findepi commented May 10, 2024

cc @charlesjmorgan @findinpath

@findinpath findinpath self-requested a review May 13, 2024 04:13
@anusudarsan anusudarsan force-pushed the anu/retry-s3-error branch from 6d15dd1 to d663e8b Compare May 13, 2024 14:35
@@ -83,6 +99,8 @@ public static ObjectCannedACL getCannedAcl(S3FileSystemConfig.ObjectCannedAcl ca
private HostAndPort httpProxy;
private boolean httpProxySecure;
private ObjectCannedAcl objectCannedAcl = ObjectCannedAcl.NONE;
private RetryMode retryMode = RetryMode.LEGACY;
private int maxErrorRetries = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you inspire yourself from hadoop s3 code with the number of retries?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@anusudarsan anusudarsan force-pushed the anu/retry-s3-error branch from d663e8b to 99114f7 Compare May 14, 2024 17:33
@github-actions github-actions bot added the docs label May 14, 2024
@@ -224,6 +242,32 @@ public S3FileSystemConfig setCannedAcl(ObjectCannedAcl objectCannedAcl)
return this;
}

public RetryMode getRetryMode()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully follow why we would want to configure the retry mode.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was not particularly revealing any problems as per our internal benchmarks. But according AWS support Standard Retry mechanism helps in throttling issues most of the time. As per https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ the default "Equal Jitter" is the loser. So having this configurable might help for some workloads, and wouldnt hurt to be exposed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change the default to be STANDARD? I wonder why they default to LEGACY.

@anusudarsan anusudarsan force-pushed the anu/retry-s3-error branch from 99114f7 to ffb7d75 Compare May 15, 2024 13:12
@electrum electrum merged commit a471b4b into trinodb:master May 21, 2024
62 checks passed
@github-actions github-actions bot added this to the 449 milestone May 21, 2024
@anusudarsan anusudarsan deleted the anu/retry-s3-error branch May 21, 2024 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants