HADOOP-18221. Drains stream async before closing #4294

ahmarsuhail · 2022-05-09T13:13:18Z

Description of PR

If close the prefetching input stream before prefetched blocks have finished reading the S3 input stream, the sdk repeatedly complains "Not all bytes were read from the S3ObjectInputStream". This happened on S3AInputStream as well, see this issue for more details.

Closing the stream before draining it will abort the connection, so to allow for connection reuse we drain it asynchronously.

How was this patch tested?

Tested in eu-west-1 by running

mvn -Dparallel-tests -DtestsThreadCount=16 clean verify

This is the the initial merge of the HADOOP-18028 S3A performance input stream. This patch on its own is incomplete and must be accompanied by all other commits with HADOOP-18028 in their git commit message. Consult the JIRA for that list Contributed by Bhalchandra Pandit.

…3A prefetching stream (apache#4115) Contributed by PJ Fanning.

Contributed by Ahmar Suhail

…ache#4212) Contributed by Monthon Klongklaew

hadoop-yetus · 2022-05-09T15:06:54Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 52s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
-1 ❌	test4tests	0m 0s		The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
			_ feature-HADOOP-18028-s3a-prefetch Compile Tests _
+1 💚	mvninstall	45m 5s		feature-HADOOP-18028-s3a-prefetch passed
+1 💚	compile	1m 7s		feature-HADOOP-18028-s3a-prefetch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	compile	0m 50s		feature-HADOOP-18028-s3a-prefetch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	checkstyle	0m 43s		feature-HADOOP-18028-s3a-prefetch passed
+1 💚	mvnsite	0m 59s		feature-HADOOP-18028-s3a-prefetch passed
+1 💚	javadoc	0m 38s		feature-HADOOP-18028-s3a-prefetch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javadoc	0m 45s		feature-HADOOP-18028-s3a-prefetch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	1m 39s		feature-HADOOP-18028-s3a-prefetch passed
+1 💚	shadedclient	24m 42s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	0m 37s		the patch passed
+1 💚	compile	0m 44s		the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javac	0m 44s		the patch passed
+1 💚	compile	0m 36s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	javac	0m 36s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	0m 23s	/results-checkstyle-hadoop-tools_hadoop-aws.txt	hadoop-tools/hadoop-aws: The patch generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)
+1 💚	mvnsite	0m 44s		the patch passed
+1 💚	javadoc	0m 22s		the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javadoc	0m 30s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	1m 16s		the patch passed
+1 💚	shadedclient	25m 4s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	3m 2s		hadoop-aws in the patch passed.
+1 💚	asflicense	0m 45s		The patch does not generate ASF License warnings.
		112m 13s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4294/1/artifact/out/Dockerfile
GITHUB PR	#4294
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname	Linux 31b41031ada1 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	feature-HADOOP-18028-s3a-prefetch / `5e726ea`
Default Java	Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4294/1/testReport/
Max. process+thread count	530 (vs. ulimit of 5500)
modules	C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4294/1/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

dannycjones

Looks good to me!

dannycjones · 2022-05-09T15:07:47Z

@ahmarsuhail In trunk, Steve recently introduced a threshold for moving the drain to a synchronous operation.

hadoop/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java

Lines 612 to 623 in e0cd0a8

    
           if (blocking || shouldAbort || remaining <= asyncDrainThreshold) { 
        
             // don't bother with async io. 
        
             operation = CompletableFuture.completedFuture( 
        
                 drain(shouldAbort, reason, remaining, object, wrappedStream)); 
        
           } else { 
        
             LOG.debug("initiating asynchronous drain of {} bytes", remaining); 
        
             // schedule an async drain/abort with references to the fields so they 
        
             // can be reused 
        
             operation = client.submit( 
        
                 () -> drain(false, reason, remaining, object, wrappedStream)); 
        
           }

Can we create a JIRA to add it after rebase?

ahmarsuhail · 2022-05-09T15:53:17Z

Thanks danny, I've created https://issues.apache.org/jira/browse/HADOOP-18230

hadoop-yetus · 2022-05-09T17:15:34Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	1m 4s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
-1 ❌	test4tests	0m 0s		The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
			_ feature-HADOOP-18028-s3a-prefetch Compile Tests _
+1 💚	mvninstall	43m 10s		feature-HADOOP-18028-s3a-prefetch passed
+1 💚	compile	0m 55s		feature-HADOOP-18028-s3a-prefetch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	compile	0m 48s		feature-HADOOP-18028-s3a-prefetch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	checkstyle	0m 43s		feature-HADOOP-18028-s3a-prefetch passed
+1 💚	mvnsite	1m 2s		feature-HADOOP-18028-s3a-prefetch passed
+1 💚	javadoc	0m 38s		feature-HADOOP-18028-s3a-prefetch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javadoc	0m 46s		feature-HADOOP-18028-s3a-prefetch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	1m 42s		feature-HADOOP-18028-s3a-prefetch passed
+1 💚	shadedclient	25m 27s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	0m 39s		the patch passed
+1 💚	compile	0m 45s		the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javac	0m 45s		the patch passed
+1 💚	compile	0m 36s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	javac	0m 36s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
+1 💚	checkstyle	0m 22s		the patch passed
+1 💚	mvnsite	0m 42s		the patch passed
+1 💚	javadoc	0m 21s		the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javadoc	0m 30s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	1m 15s		the patch passed
+1 💚	shadedclient	24m 24s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	2m 44s		hadoop-aws in the patch passed.
+1 💚	asflicense	0m 43s		The patch does not generate ASF License warnings.
		109m 45s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4294/2/artifact/out/Dockerfile
GITHUB PR	#4294
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname	Linux 4aadc5ece918 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	feature-HADOOP-18028-s3a-prefetch / `21fde47`
Default Java	Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4294/2/testReport/
Max. process+thread count	593 (vs. ulimit of 5500)
modules	C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4294/2/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

ahmarsuhail · 2022-05-12T12:41:04Z

@steveloughran this is ready for review now, would you be able to take a look?

steveloughran

look at look at the changes in hadoop trunk s3a input stream in #2584 as a basis for this work. (i plan to rebase this branch this week, so you will have merge problems...sorry)

async draining does deliver speedups, but only if the amount of data to be read is "large enough". for small amounts of data, synchronous draining is lower overhead and guarantees the active http connection can be reused.

when i do there merge there will be an async drain threshold for this.

steveloughran · 2022-05-18T12:31:27Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/S3File.java

+
+        Io.closeIgnoringIoException(this.inputStream);
+        Io.closeIgnoringIoException(this.obj);
+      } catch (Exception e) {


if this happens then the readl: raised an exception. the stream MUST be aborted to stop it being returned to the http connection pool, as its connection is probably broken

steveloughran · 2022-05-18T12:31:46Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/S3File.java

+    public void run() {
+      try {
+
+        while(this.inputStream.read() >= 0) {


look at the changes in hadoop trunk s3a input stream here...it reads into a buffer for draining, and is marginally faster

steveloughran · 2022-05-18T12:32:10Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/S3File.java

+  /**
+   * Drain task that is submitted to the future pool.
+   */
+  private static class DrainTask implements Runnable {


declare final to keep style checker happy

i'd prefer to not use Runnable, instead completable futures., look at drainOrAbortHttpStream() and its use. at which point you can just pass in a function

steveloughran · 2022-05-18T12:32:53Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/S3File.java

@@ -98,6 +106,7 @@ public S3File(
    this.streamStatistics = streamStatistics;
    this.changeTracker = changeTracker;
    this.s3Objects = new IdentityHashMap<InputStream, S3Object>();
+    this.futurePool = context.getFuturePool();


if this is only for the drain, given the context is already stored, you can just get the pool when needed

steveloughran and others added 5 commits March 28, 2022 11:05

HADOOP-18180. Replace use of twitter util-core with java futures in S…

3aa03e0

…3A prefetching stream (apache#4115) Contributed by PJ Fanning.

HADOOP-18177. Document prefetching architecture. (apache#4205)

f4d016f

Contributed by Ahmar Suhail

HADOOP-18175. fix test failures with prefetching s3a input stream (ap…

f38bbe2

…ache#4212) Contributed by Monthon Klongklaew

Drains stream async before closing

5e726ea

dannycjones approved these changes May 9, 2022

View reviewed changes

fixes indentation

21fde47

steveloughran requested changes May 18, 2022

View reviewed changes

asfgit force-pushed the feature-HADOOP-18028-s3a-prefetch branch from f38bbe2 to b75b72b Compare May 30, 2022 16:50

ahmarsuhail mentioned this pull request Jun 8, 2022

HADOOP-18231. Fixes failing tests & drain stream async. #4386

Merged

ahmarsuhail closed this Jun 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HADOOP-18221. Drains stream async before closing #4294

HADOOP-18221. Drains stream async before closing #4294

ahmarsuhail commented May 9, 2022

hadoop-yetus commented May 9, 2022

dannycjones left a comment

dannycjones commented May 9, 2022

ahmarsuhail commented May 9, 2022

hadoop-yetus commented May 9, 2022

ahmarsuhail commented May 12, 2022

steveloughran left a comment

steveloughran May 18, 2022

steveloughran May 18, 2022

steveloughran May 18, 2022

steveloughran May 18, 2022

HADOOP-18221. Drains stream async before closing #4294

HADOOP-18221. Drains stream async before closing #4294

Conversation

ahmarsuhail commented May 9, 2022

Description of PR

How was this patch tested?

hadoop-yetus commented May 9, 2022

dannycjones left a comment

Choose a reason for hiding this comment

dannycjones commented May 9, 2022

ahmarsuhail commented May 9, 2022

hadoop-yetus commented May 9, 2022

ahmarsuhail commented May 12, 2022

steveloughran left a comment

Choose a reason for hiding this comment

steveloughran May 18, 2022

Choose a reason for hiding this comment

steveloughran May 18, 2022

Choose a reason for hiding this comment

steveloughran May 18, 2022

Choose a reason for hiding this comment

steveloughran May 18, 2022

Choose a reason for hiding this comment