Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HADOOP-19354. S3A: S3AInputStream to be created by factory under S3AStore #7214

Open
wants to merge 2 commits into
base: trunk
Choose a base branch
from

Conversation

steveloughran
Copy link
Contributor

@steveloughran steveloughran commented Dec 6, 2024

HADOOP-19354

  • Factory interface with a parameter object creation method
  • Base class AbstractS3AInputStream for all streams to create
  • S3AInputStream subclasses that and has a factory
  • Production and test code to use it
  • Input stream callbacks pushed down to S3Store
  • S3Store to dynamically choose factory at startup, stop in close()
  • S3Store to implement the factory interface, completing final binding operations (callbacks, stats)

How was this patch tested?

S3 london

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

@steveloughran
Copy link
Contributor Author

test failure from me pushing disk allocator down into store and test case not setting the store up

tion
[ERROR] testInterruptSimplePut[disk-2](org.apache.hadoop.fs.s3a.scale.ITestS3ABlockOutputStreamInterruption)  Time elapsed: 2.421 s  <<< ERROR!
java.lang.NullPointerException
        at org.apache.hadoop.fs.s3a.impl.ErrorTranslation.maybeExtractChannelException(ErrorTranslation.java:267)
        at org.apache.hadoop.fs.s3a.impl.ErrorTranslation.maybeExtractIOException(ErrorTranslation.java:189)
        at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:212)
        at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:124)
        at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$4(Invoker.java:376)
        at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:468)
        at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:372)
        at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:347)
        at org.apache.hadoop.fs.s3a.WriteOperationHelper.retry(WriteOperationHelper.java:207)
        at org.apache.hadoop.fs.s3a.WriteOperationHelper.putObject(WriteOperationHelper.java:525)
        at org.apache.hadoop.fs.s3a.S3ABlockOutputStream.putObject(S3ABlockOutputStream.java:708)
        at org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:500)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:77)
        at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
        at org.apache.hadoop.test.LambdaTestUtils.intercept(LambdaTestUtils.java:410)
        at org.apache.hadoop.fs.s3a.scale.ITestS3ABlockOutputStreamInterruption.expectCloseInterrupted(ITestS3ABlockOutputStreamInterruption.java:406)
        at org.apache.hadoop.fs.s3a.scale.ITestS3ABlockOutputStreamInterruption.testInterruptSimplePut(ITestS3ABlockOutputStreamInterruption.java:386)
 

@steveloughran steveloughran force-pushed the s3/HADOOP-19354-s3a-inputstream-factory branch from 5a32f16 to 7d76047 Compare December 6, 2024 18:45
@apache apache deleted a comment from hadoop-yetus Jan 1, 2025
@apache apache deleted a comment from hadoop-yetus Jan 1, 2025
@apache apache deleted a comment from hadoop-yetus Jan 1, 2025
@steveloughran steveloughran force-pushed the s3/HADOOP-19354-s3a-inputstream-factory branch from a944b86 to 0f01d61 Compare January 3, 2025 17:39
@steveloughran steveloughran marked this pull request as ready for review January 3, 2025 18:08
@apache apache deleted a comment from hadoop-yetus Jan 3, 2025
@apache apache deleted a comment from hadoop-yetus Jan 3, 2025
@apache apache deleted a comment from hadoop-yetus Jan 3, 2025
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 50s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+0 🆗 xmllint 0m 1s xmllint was not available.
+0 🆗 markdownlint 0m 0s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 18 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 39m 58s trunk passed
+1 💚 compile 0m 45s trunk passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚 compile 0m 35s trunk passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
+1 💚 checkstyle 0m 33s trunk passed
+1 💚 mvnsite 0m 40s trunk passed
+1 💚 javadoc 0m 41s trunk passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚 javadoc 0m 33s trunk passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
+1 💚 spotbugs 1m 8s trunk passed
+1 💚 shadedclient 37m 24s branch has no errors when building and testing our client artifacts.
-0 ⚠️ patch 37m 45s Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 29s the patch passed
+1 💚 compile 0m 36s the patch passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚 javac 0m 36s the patch passed
+1 💚 compile 0m 27s the patch passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
+1 💚 javac 0m 27s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 0m 21s /results-checkstyle-hadoop-tools_hadoop-aws.txt hadoop-tools/hadoop-aws: The patch generated 1 new + 25 unchanged - 0 fixed = 26 total (was 25)
+1 💚 mvnsite 0m 31s the patch passed
-1 ❌ javadoc 0m 30s /results-javadoc-javadoc-hadoop-tools_hadoop-aws-jdkUbuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04.txt hadoop-tools_hadoop-aws-jdkUbuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04 with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04 generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)
-1 ❌ javadoc 0m 25s /results-javadoc-javadoc-hadoop-tools_hadoop-aws-jdkPrivateBuild-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga.txt hadoop-tools_hadoop-aws-jdkPrivateBuild-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)
+1 💚 spotbugs 1m 6s the patch passed
+1 💚 shadedclient 37m 39s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 2m 47s hadoop-aws in the patch passed.
+1 💚 asflicense 0m 36s The patch does not generate ASF License warnings.
130m 4s
Subsystem Report/Notes
Docker ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7214/8/artifact/out/Dockerfile
GITHUB PR #7214
Optional Tests dupname asflicense codespell detsecrets xmllint compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle markdownlint
uname Linux 5978404f578e 5.15.0-124-generic #134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 0f01d61
Default Java Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7214/8/testReport/
Max. process+thread count 623 (vs. ulimit of 5500)
modules C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7214/8/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@steveloughran steveloughran force-pushed the s3/HADOOP-19354-s3a-inputstream-factory branch from 0f01d61 to e7e454c Compare January 7, 2025 14:36
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 52s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+0 🆗 xmllint 0m 1s xmllint was not available.
+0 🆗 markdownlint 0m 0s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 18 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 38m 11s trunk passed
+1 💚 compile 0m 46s trunk passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚 compile 0m 34s trunk passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
+1 💚 checkstyle 0m 32s trunk passed
+1 💚 mvnsite 0m 41s trunk passed
+1 💚 javadoc 0m 41s trunk passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚 javadoc 0m 32s trunk passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
+1 💚 spotbugs 1m 10s trunk passed
+1 💚 shadedclient 37m 49s branch has no errors when building and testing our client artifacts.
-0 ⚠️ patch 38m 11s Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 32s the patch passed
+1 💚 compile 0m 40s the patch passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚 javac 0m 40s the patch passed
+1 💚 compile 0m 28s the patch passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
+1 💚 javac 0m 28s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 0m 20s /results-checkstyle-hadoop-tools_hadoop-aws.txt hadoop-tools/hadoop-aws: The patch generated 11 new + 25 unchanged - 0 fixed = 36 total (was 25)
+1 💚 mvnsite 0m 35s the patch passed
-1 ❌ javadoc 0m 30s /patch-javadoc-hadoop-tools_hadoop-aws-jdkUbuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04.txt hadoop-aws in the patch failed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04.
-1 ❌ javadoc 0m 26s /patch-javadoc-hadoop-tools_hadoop-aws-jdkPrivateBuild-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga.txt hadoop-aws in the patch failed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga.
+1 💚 spotbugs 1m 13s the patch passed
+1 💚 shadedclient 39m 4s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 2m 49s hadoop-aws in the patch passed.
+1 💚 asflicense 0m 36s The patch does not generate ASF License warnings.
130m 26s
Subsystem Report/Notes
Docker ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7214/9/artifact/out/Dockerfile
GITHUB PR #7214
Optional Tests dupname asflicense codespell detsecrets xmllint compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle markdownlint
uname Linux 7aa7731515a7 5.15.0-125-generic #135-Ubuntu SMP Fri Sep 27 13:53:58 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / e7e454c
Default Java Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7214/9/testReport/
Max. process+thread count 529 (vs. ulimit of 5500)
modules C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7214/9/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Copy link
Contributor

@mukund-thakur mukund-thakur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I like the design and refactoring.
One thought, can we make minimal prefetching changes in this PR and only focus on the interface and ClassicInputStream and create a separate PR for all prefetching stuff?

@@ -993,7 +983,7 @@ private void initThreadPools(Configuration conf) {
unboundedThreadPool.allowCoreThreadTimeOut(true);
executorCapacity = intOption(conf,
EXECUTOR_CAPACITY, DEFAULT_EXECUTOR_CAPACITY, 1);
if (prefetchEnabled) {
if (requirements.createFuturePool()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change the name to prefetchRequirements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's more requirements than just prefetching, e.g if vector IO support is needed then some extra threads are added to the pool passed down.

@steveloughran
Copy link
Contributor Author

I'm just setting this up so it is ready for the analytics stream work...making sure that prefetch is also covered is my way to validate the factory model, and that the options need to include things like the options to ask for a shared thread pool and stream thread pool, with the intent that analytics will use that too.

And once I do that, they all need a single base stream class.

For my vector IO resilience PR, once I have this PR in, I'm going to go back to #7105 and make it something which works with all object input streams

  • probe the stream for being "all in memory"; if so just do the reads sequentially, no need to parallelize.
  • if "partially in memory", give implementation that list of ranges and have them split into "all in memory" and "needs retrieval". again, in memory blocks can be filled in immediately (needs a lock on removing cache items)
  • range coalesce
  • sort by largest range first (stops the tail being the bottleneck)
  • queue for reading

read failure

  1. single range: retry
  2. merged range: complete successfully read parts
  3. and incomplete parts are split into their originals, reread individually in same thread, with retries on them

the read failure stuff is essentially in my PR, so maybe we can rebase onto this, merge in and then pull up. Goal: analytics stream gets vector IO.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 50s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+0 🆗 xmllint 0m 1s xmllint was not available.
+0 🆗 markdownlint 0m 0s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 18 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 39m 17s trunk passed
+1 💚 compile 0m 44s trunk passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚 compile 0m 35s trunk passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
+1 💚 checkstyle 0m 31s trunk passed
+1 💚 mvnsite 0m 41s trunk passed
+1 💚 javadoc 0m 41s trunk passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚 javadoc 0m 33s trunk passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
+1 💚 spotbugs 1m 8s trunk passed
+1 💚 shadedclient 37m 31s branch has no errors when building and testing our client artifacts.
-0 ⚠️ patch 37m 53s Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 29s the patch passed
+1 💚 compile 0m 36s the patch passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚 javac 0m 36s the patch passed
+1 💚 compile 0m 27s the patch passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
+1 💚 javac 0m 27s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 0m 21s /results-checkstyle-hadoop-tools_hadoop-aws.txt hadoop-tools/hadoop-aws: The patch generated 11 new + 25 unchanged - 0 fixed = 36 total (was 25)
+1 💚 mvnsite 0m 32s the patch passed
-1 ❌ javadoc 0m 29s /patch-javadoc-hadoop-tools_hadoop-aws-jdkUbuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04.txt hadoop-aws in the patch failed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04.
-1 ❌ javadoc 0m 25s /patch-javadoc-hadoop-tools_hadoop-aws-jdkPrivateBuild-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga.txt hadoop-aws in the patch failed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga.
+1 💚 spotbugs 1m 7s the patch passed
+1 💚 shadedclient 37m 7s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 2m 45s hadoop-aws in the patch passed.
+1 💚 asflicense 0m 36s The patch does not generate ASF License warnings.
129m 2s
Subsystem Report/Notes
Docker ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7214/10/artifact/out/Dockerfile
GITHUB PR #7214
Optional Tests dupname asflicense codespell detsecrets xmllint compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle markdownlint
uname Linux 6f6ef8b7b272 5.15.0-124-generic #134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / c35c915
Default Java Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7214/10/testReport/
Max. process+thread count 608 (vs. ulimit of 5500)
modules C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7214/10/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

this.ioStatistics = streamStatistics.getIOStatistics();
this.inputPolicy = context.getInputPolicy();
streamStatistics.inputPolicySet(inputPolicy.ordinal());
this.boundedThreadPool = parameters.getBoundedThreadPool();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see boundedThreadPool is used in S3AInputStream but not in S3APrefetchingInputStream, can we keep boundedThreadPool local to S3AInputStream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each stream can declare what it wants thread-pool wise and we will allocate those to them. If they don't want it, they don't get it.
That bounded thread pool passed down is the semaphore pool we also use in uploads. It takes a subset of the shared pool, has its own pending queue and blocks the caller thread when that pending queue is full.

If the analytics stream doesn't currently need it -don't ask for any

But I do want to have the vector IO code to be moved out of S3AInputStream so it can work with the superclass, so all streams get it. These also want a bounded number of threads


/**
* A stream of data from an S3 object.
* The blase class includes common methods, stores

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: spelling base

* This must be re-invoked after replacing the S3Client during test
* runs.
* <p>
* It requires the S3Store to have been instantiated.
* @param conf configuration.
Copy link

@rajdchak rajdchak Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@param conf is no longer required

* @param sharedThreads Number of shared threads to included in the bounded pool.
* @param streamThreads How many threads per stream, ignoring vector IO requirements.
* @param createFuturePool Flag to enable creation of a future pool around the bounded thread pool.
*/

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@param vectorSupported missing

@@ -845,7 +826,7 @@ private S3AFileSystemOperations createFileSystemHandler() {
@VisibleForTesting
protected S3AStore createS3AStore(final ClientManager clientManager,
final int rateLimitCapacity) {
return new S3AStoreBuilder()
final S3AStore st = new S3AStoreBuilder()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: rename variable to meaningful name

@steveloughran
Copy link
Contributor Author

@rajdchak thanks for the comments, will address

I do want to pull up the vector IO support, with integration with prefetch and cacheing.

For prefetch/caching stream we'd ask for a the requested ranges to be split up into

  1. ranges which were wholly in memory: satisfy immediately in current thread (or copier thread?)
  2. ranges which have an active prefetch to wholly satisfy the request: somehow wire prefetching up so as soon as it arrives, range gets the data.
  3. other ranges (not cached, prefetched or only partially in cache): coalesce as needed, then retrieve. +notify stream that these ranges are being fetched, so no need to prefetch

It'd be good to collect stats on cache hit/miss here, to assess integration of vector reads with ranges. When a list of ranges comes down, there is less need to infer the next range and prefetch, and I'm not actually sure how important cacheing becomes. This is why setting parquet up to use vector IO already appears to give speedups comparable to the analytics stream benchmarks published.

what I want is best of both worlds: prefetch of rowgroups from stream inference -and when vector reads come in, statisfy those by returning current/active prefetches, or retrieve new ranges through ranged GET requests.

#7105 is where that will go; I've halted that until this is in. And I'll only worry about that integration with prefetched/cached blocks with the analytics stream.

Copy link
Contributor

@ahmarsuhail ahmarsuhail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @steveloughran, looks good to me overall. Just need to allow for the ClientManager to be passed into the factory.

: 0);
// create an executor which is a subset of the
// bounded thread pool.
final SemaphoredDelegatingExecutor pool = new SemaphoredDelegatingExecutor(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a clarifying question, what is the benefit of creating a new SemaphoredDelegatingExecutor per stream vs just creating this once?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I think I get it, this is basically a way to ensure a single stream instance does not use up too many threads.

public static ObjectInputStreamFactory createStreamFactory(final Configuration conf) {
// choose the default input stream type
InputStreamType defaultStream = InputStreamType.DEFAULT_STREAM_TYPE;
if (conf.getBoolean(PREFETCH_ENABLED_KEY, false)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're saying PREFETCH_ENABLED_KEY deprecated, but still setting the stream type to prefetch. Is this something we want? If yes, we should make the message clearer to say "we're going to deprecate this in the future, but it works for now"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to say "if you set it, we will tell you not to but still take the setting as the default...so it can be overridden by the new option"

* Each enum value contains the factory function actually used to create
* the factory.
*/
public enum InputStreamType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in #7295, the S3SeekableInputStreamFactory requires a client to be passed in. For this, we need a way to pass in the ClientManager here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. will do that. after Service.init() we will pass down a reference to the client manager, -though that won't be ready to use until Service.start().

Also, client manager should declare whether CRT is used or not, even before the client is instantiated (avoids launch-performance hit). Then the analyitics stream can just fail fast in start() based on that flag alone

Copy link
Contributor Author

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/* sorry, had commented back on others but hadn't pressed the submit button. doing it now */

@@ -993,7 +983,7 @@ private void initThreadPools(Configuration conf) {
unboundedThreadPool.allowCoreThreadTimeOut(true);
executorCapacity = intOption(conf,
EXECUTOR_CAPACITY, DEFAULT_EXECUTOR_CAPACITY, 1);
if (prefetchEnabled) {
if (requirements.createFuturePool()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's more requirements than just prefetching, e.g if vector IO support is needed then some extra threads are added to the pool passed down.

this.ioStatistics = streamStatistics.getIOStatistics();
this.inputPolicy = context.getInputPolicy();
streamStatistics.inputPolicySet(inputPolicy.ordinal());
this.boundedThreadPool = parameters.getBoundedThreadPool();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each stream can declare what it wants thread-pool wise and we will allocate those to them. If they don't want it, they don't get it.
That bounded thread pool passed down is the semaphore pool we also use in uploads. It takes a subset of the shared pool, has its own pending queue and blocks the caller thread when that pending queue is full.

If the analytics stream doesn't currently need it -don't ask for any

But I do want to have the vector IO code to be moved out of S3AInputStream so it can work with the superclass, so all streams get it. These also want a bounded number of threads

@steveloughran
Copy link
Contributor Author

(just had to rebase as it wouldn't merge with the directory marker changes. going to make backporting to branch-3.4 harder. FWIW I'm wondering if we should make the leap to a 3.5.0 release with java17 the baseline and keep 3.4.x the maintenance branch with CVE and jar updates only. Not discussed that on the mail lists yet though...)

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 25m 50s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+0 🆗 xmllint 0m 1s xmllint was not available.
+0 🆗 markdownlint 0m 1s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 18 new or modified test files.
_ trunk Compile Tests _
-1 ❌ mvninstall 0m 24s /branch-mvninstall-root.txt root in trunk failed.
-1 ❌ compile 0m 24s /branch-compile-hadoop-tools_hadoop-aws-jdkUbuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04.txt hadoop-aws in trunk failed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04.
-1 ❌ compile 0m 23s /branch-compile-hadoop-tools_hadoop-aws-jdkPrivateBuild-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga.txt hadoop-aws in trunk failed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga.
-0 ⚠️ checkstyle 0m 22s /buildtool-branch-checkstyle-hadoop-tools_hadoop-aws.txt The patch fails to run checkstyle in hadoop-aws
-1 ❌ mvnsite 0m 23s /branch-mvnsite-hadoop-tools_hadoop-aws.txt hadoop-aws in trunk failed.
-1 ❌ javadoc 0m 24s /branch-javadoc-hadoop-tools_hadoop-aws-jdkUbuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04.txt hadoop-aws in trunk failed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04.
-1 ❌ javadoc 0m 24s /branch-javadoc-hadoop-tools_hadoop-aws-jdkPrivateBuild-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga.txt hadoop-aws in trunk failed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga.
-1 ❌ spotbugs 0m 24s /branch-spotbugs-hadoop-tools_hadoop-aws.txt hadoop-aws in trunk failed.
+1 💚 shadedclient 2m 53s branch has no errors when building and testing our client artifacts.
-0 ⚠️ patch 3m 18s Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
-1 ❌ mvninstall 0m 25s /patch-mvninstall-hadoop-tools_hadoop-aws.txt hadoop-aws in the patch failed.
-1 ❌ compile 0m 25s /patch-compile-hadoop-tools_hadoop-aws-jdkUbuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04.txt hadoop-aws in the patch failed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04.
-1 ❌ javac 0m 25s /patch-compile-hadoop-tools_hadoop-aws-jdkUbuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04.txt hadoop-aws in the patch failed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04.
-1 ❌ compile 0m 24s /patch-compile-hadoop-tools_hadoop-aws-jdkPrivateBuild-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga.txt hadoop-aws in the patch failed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga.
-1 ❌ javac 0m 24s /patch-compile-hadoop-tools_hadoop-aws-jdkPrivateBuild-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga.txt hadoop-aws in the patch failed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga.
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 0m 22s /buildtool-patch-checkstyle-hadoop-tools_hadoop-aws.txt The patch fails to run checkstyle in hadoop-aws
-1 ❌ mvnsite 0m 25s /patch-mvnsite-hadoop-tools_hadoop-aws.txt hadoop-aws in the patch failed.
-1 ❌ javadoc 0m 23s /patch-javadoc-hadoop-tools_hadoop-aws-jdkUbuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04.txt hadoop-aws in the patch failed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04.
-1 ❌ javadoc 0m 24s /patch-javadoc-hadoop-tools_hadoop-aws-jdkPrivateBuild-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga.txt hadoop-aws in the patch failed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga.
-1 ❌ spotbugs 0m 41s /patch-spotbugs-hadoop-tools_hadoop-aws.txt hadoop-aws in the patch failed.
+1 💚 shadedclient 4m 59s patch has no errors when building and testing our client artifacts.
_ Other Tests _
-1 ❌ unit 0m 10s /patch-unit-hadoop-tools_hadoop-aws.txt hadoop-aws in the patch failed.
+1 💚 asflicense 0m 43s The patch does not generate ASF License warnings.
39m 8s
Subsystem Report/Notes
Docker ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7214/12/artifact/out/Dockerfile
GITHUB PR #7214
Optional Tests dupname asflicense codespell detsecrets xmllint compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle markdownlint
uname Linux 1507c98d6d1f 5.15.0-125-generic #135-Ubuntu SMP Fri Sep 27 13:53:58 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 88ee1d2
Default Java Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7214/12/testReport/
Max. process+thread count 46 (vs. ulimit of 5500)
modules C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7214/12/console
versions git=2.25.1 maven=3.6.3
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@ahmarsuhail
Copy link
Contributor

Thanks @steveloughran, this looks good now. We've just done an initial rebase on this here and we're able to integrate successfully. I will merge this into the feature branch, and then follow up with our changes.

@steveloughran
Copy link
Contributor Author

@ahmarsuhail will look at it. just rebase and review of this; last failure seems VM rather than code

@ahmarsuhail
Copy link
Contributor

@steveloughran do you want to merge this PR into trunk? Or do you want this to go in via our feature branch?

So either this PR goes into trunk directly, or it can go in as part of the feature branch.

ahmarsuhail added a commit that referenced this pull request Jan 27, 2025
@steveloughran steveloughran force-pushed the s3/HADOOP-19354-s3a-inputstream-factory branch from 88ee1d2 to b5346a1 Compare January 27, 2025 18:16
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 18m 11s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+0 🆗 xmllint 0m 1s xmllint was not available.
+0 🆗 markdownlint 0m 0s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 18 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 41m 21s trunk passed
+1 💚 compile 0m 42s trunk passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚 compile 0m 34s trunk passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
+1 💚 checkstyle 0m 33s trunk passed
+1 💚 mvnsite 0m 40s trunk passed
+1 💚 javadoc 0m 41s trunk passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚 javadoc 0m 33s trunk passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
+1 💚 spotbugs 1m 7s trunk passed
+1 💚 shadedclient 38m 24s branch has no errors when building and testing our client artifacts.
-0 ⚠️ patch 38m 45s Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 30s the patch passed
+1 💚 compile 0m 37s the patch passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚 javac 0m 37s the patch passed
+1 💚 compile 0m 28s the patch passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
+1 💚 javac 0m 28s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 0m 20s /results-checkstyle-hadoop-tools_hadoop-aws.txt hadoop-tools/hadoop-aws: The patch generated 1 new + 14 unchanged - 11 fixed = 15 total (was 25)
+1 💚 mvnsite 0m 32s the patch passed
-1 ❌ javadoc 0m 30s /patch-javadoc-hadoop-tools_hadoop-aws-jdkUbuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04.txt hadoop-aws in the patch failed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04.
-1 ❌ javadoc 0m 25s /patch-javadoc-hadoop-tools_hadoop-aws-jdkPrivateBuild-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga.txt hadoop-aws in the patch failed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga.
+1 💚 spotbugs 1m 7s the patch passed
+1 💚 shadedclient 38m 19s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 0m 32s hadoop-aws in the patch passed.
+1 💚 asflicense 0m 35s The patch does not generate ASF License warnings.
148m 18s
Subsystem Report/Notes
Docker ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7214/13/artifact/out/Dockerfile
GITHUB PR #7214
Optional Tests dupname asflicense codespell detsecrets xmllint compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle markdownlint
uname Linux 908740910b43 5.15.0-130-generic #140-Ubuntu SMP Wed Dec 18 17:59:53 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / b5346a1
Default Java Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7214/13/testReport/
Max. process+thread count 607 (vs. ulimit of 5500)
modules C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7214/13/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@steveloughran
Copy link
Contributor Author

Do you think we should fallback if a stream factory fails to load? as if they depend on 3rd party libraries those libs may not be deployed across the cluster

Good: something works
Bad: you don't know what you've got.

We can/should add an iostats gauge to indicate which indicates which stream is in use -serve it up in FS and stream

@ahmarsuhail
Copy link
Contributor

@steveloughran personally think we should throw the failure and not have the fallback. Users of both prefetching input stream and AAL will expect performance benefits from using them, and if the failures are not visible, it'll lead to people thinking those streams aren't any faster.

@steveloughran
Copy link
Contributor Author

@ahmarsuhail +1

now, unrelated issue. It looks to me like the jersey update's associated junit stuff has stopped tests being discovered in hadoop-aws. I'm rebasing this PR onto the commit before that one just so I can make progress.

Can you check out and build trunk and tell me if your run of the hadoop-aws unit tests run any tests -or is it my setup (across both git clones i have of the repo)

@steveloughran steveloughran changed the title HADOOP-19354. S3AInputStream to be created by factory under S3AStore HADOOP-19354. S3A: S3AInputStream to be created by factory under S3AStore Jan 28, 2025
@ahmarsuhail
Copy link
Contributor

@steveloughran I just hit the same issue on my CRT PR, unable to run tests :(

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO]
[INFO] --- failsafe:3.0.0-M1:integration-test (sequential-integration-tests) @ hadoop-aws ---
[INFO]
[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO]
[INFO] --- enforcer:3.5.0:enforce (depcheck) @ hadoop-aws ---
[INFO] Rule 0: org.apache.maven.enforcer.rules.dependency.DependencyConvergence passed
[INFO] Rule 1: org.apache.maven.enforcer.rules.dependency.BannedDependencies passed
[INFO]
[INFO] --- failsafe:3.0.0-M1:verify (default-integration-test) @ hadoop-aws ---
[INFO]
[INFO] --- failsafe:3.0.0-M1:verify (sequential-integration-tests) @ hadoop-aws ---

@steveloughran steveloughran force-pushed the s3/HADOOP-19354-s3a-inputstream-factory branch from b5346a1 to 745492d Compare January 28, 2025 16:32
S3 InputStreams are created by a factory class, with the
choice of factory dynamically chosen by the option

  fs.s3a.input.stream.type

Supported values: classic, prefetching, analytics.

S3AStore

* Manages the creation and service lifecycle of the chosen factory,
  as well as forwarding stream construction requests to the chosen factory.
* Provides the callbacks needed by both the factories and input streams.
* StreamCapabilities.hasCapability(), which is
  relayed to the active factory. This avoids the FS having
  to know what capabilities are available in the stream.
@steveloughran steveloughran force-pushed the s3/HADOOP-19354-s3a-inputstream-factory branch from 745492d to 9c8e753 Compare January 28, 2025 16:33
Ability to create custom streams (type = custom), which
reads class from "fs.s3a.input.stream.custom.factory".
This is mainly for testing, especially CNFE and similar.

Unit test TestStreamFactories for this.

ObjectInputStreams save and export stream type to assist
these tests too, as it enables assertions on the generated
stream type.

Simplified that logic related to the old prefetch enabled flag

If fs.s3a.prefetch.enabled is true, the prefetch stream is returned,
the stream.type option is not used at all. Simpler logic, simpler
docs, fewer support calls.

Parameters supplied to ObjectInputStreamFactory.bind converted
to a parameter object. Allows for more parameters to be added later
if ever required.

ObjectInputStreamFactory returns more requirements to
the store/fs. For this reason
  StreamThreadOptions threadRequirements();
is renamed
  StreamFactoryRequirements factoryRequirements()

VectorIO context changes
* Returned in factoryRequirements()
* exiting configuration reading code moved into
  StreamIntegration.populateVectoredIOContext()
* Streams which don't have custom vector IO, e.g. prefetching
  can return a minimum seek range of 0.
  This disables range merging on the default PositionedReadable
  implementation, so ensures that they will only get asked for
  data which will be read...leaving prefetch/cache code
  to know exactly what is needed.

Other
 * Draft docs.
 * Stream capability declares stream type
   & is exported through FS too.
   (todo: test, document, add to bucket-info)
 * ConfigurationHelper.resolveEnum() supercedes
   Configuration.getEnum() with
     - case independence
     - fallback is a supplier<Enum> rather than a simple
       value.

Change-Id: I2e59300af48042df8173de61d0b3d6139a0ae7fe
@apache apache deleted a comment from hadoop-yetus Jan 30, 2025
@apache apache deleted a comment from hadoop-yetus Jan 30, 2025
@steveloughran
Copy link
Contributor Author

  • big new version, lots of changes
  • not compatible with anyone rebasing -but I think this is stabilising now. Sorry!
  • this PR is based on the last hadoop-trunk where the tests ran

Not fully tested yet. I want to have the stream type passed down as a -D option

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants