HADOOP-18103. High performance vectored read API in Hadoop #4476

mukund-thakur · 2022-06-21T01:34:15Z

Description of PR

Add support for multiple ranged vectored read api in PositionedReadable. The default iterates through the ranges to read each synchronously, but the intent is that FSDataInputStream subclasses can make more efficient readers especially object stores implementation.

How was this patch tested?

Added new tests. Ran older test suites in ap-south-1. All good.

For code changes:

Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

part of HADOOP-18103. Add support for multiple ranged vectored read api in PositionedReadable. The default iterates through the ranges to read each synchronously, but the intent is that FSDataInputStream subclasses can make more efficient readers especially in object stores implementation. Also added implementation in S3A where smaller ranges are merged and sliced byte buffers are returned to the readers. All the merged ranged are fetched from S3 asynchronously. Contributed By: Owen O'Malley and Mukund Thakur

… maxReadSizeForVectorReads (#3964) Part of HADOOP-18103. Introducing fs.s3a.vectored.read.min.seek.size and fs.s3a.vectored.read.max.merged.size to configure min seek and max read during a vectored IO operation in S3A connector. These properties actually define how the ranges will be merged. To completely disable merging set fs.s3a.max.readsize.vectored.read to 0. Contributed By: Mukund Thakur

part of HADOOP-18103. Contributed By: Mukund Thakur

part of HADOOP-18103. Required for vectored IO feature. None of current buffer pool implementation is complete. ElasticByteBufferPool doesn't use weak references and could lead to memory leak errors and DirectBufferPool doesn't support caller preferences of direct and heap buffers and has only fixed length buffer implementation. Contributed By: Mukund Thakur

part of HADOOP-18103. Handling memory fragmentation in S3A vectored IO implementation by allocating smaller user range requested size buffers and directly filling them from the remote S3 stream and skipping undesired data in between ranges. This patch also adds aborting active vectored reads when stream is closed or unbuffer() is called. Contributed By: Mukund Thakur

mukund-thakur · 2022-06-21T01:35:17Z

@steveloughran , @mehakmeet So will go ahead and merge this if Yetus is okay?

hadoop-yetus · 2022-06-21T21:02:06Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 38s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 1s		No case conflicting files found.
+0 🆗	codespell	0m 1s		codespell was not available.
+0 🆗	detsecrets	0m 1s		detect-secrets was not available.
+0 🆗	shelldocs	0m 1s		Shelldocs was not available.
+0 🆗	markdownlint	0m 1s		markdownlint was not available.
+0 🆗	xmllint	0m 1s		xmllint was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 12 new or modified test files.
			_ trunk Compile Tests _
+0 🆗	mvndep	16m 2s		Maven dependency ordering for branch
+1 💚	mvninstall	25m 10s		trunk passed
+1 💚	compile	23m 3s		trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	compile	20m 32s		trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	checkstyle	4m 27s		trunk passed
+1 💚	mvnsite	18m 47s		trunk passed
+1 💚	javadoc	8m 13s		trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javadoc	7m 23s		trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+0 🆗	spotbugs	0m 35s		branch/hadoop-project no spotbugs output file (spotbugsXml.xml)
+1 💚	shadedclient	50m 33s		branch has no errors when building and testing our client artifacts.
-0 ⚠️	patch	51m 2s		Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 58s		Maven dependency ordering for patch
+1 💚	mvninstall	32m 38s		the patch passed
+1 💚	compile	22m 31s		the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
-1 ❌	javac	22m 31s	/results-compile-javac-root-jdkPrivateBuild-11.0.15+10-Ubuntu-0ubuntu0.20.04.1.txt	root-jdkPrivateBuild-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 generated 3 new + 2875 unchanged - 0 fixed = 2878 total (was 2875)
+1 💚	compile	20m 29s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
-1 ❌	javac	20m 29s	/results-compile-javac-root-jdkPrivateBuild-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07.txt	root-jdkPrivateBuild-1.8.0_312-8u312-b07-0ubuntu1~~20.04-b07 with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~~20.04-b07 generated 3 new + 2671 unchanged - 0 fixed = 2674 total (was 2671)
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	4m 11s	/results-checkstyle-root.txt	root: The patch generated 3 new + 90 unchanged - 3 fixed = 93 total (was 93)
+1 💚	mvnsite	18m 25s		the patch passed
+1 💚	shellcheck	0m 0s		No new issues.
+1 💚	javadoc	8m 4s		the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javadoc	7m 29s		root-jdkPrivateBuild-1.8.0_312-8u312-b07-0ubuntu1~~20.04-b07 with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~~20.04-b07 generated 0 new + 2328 unchanged - 1 fixed = 2328 total (was 2329)
+0 🆗	spotbugs	0m 33s		hadoop-project has no data from spotbugs
+1 💚	shadedclient	53m 13s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
-1 ❌	unit	785m 22s	/patch-unit-root.txt	root in the patch failed.
+0 🆗	asflicense	2m 13s		ASF License check generated no output?
		1166m 24s

Reason	Tests
Failed junit tests	hadoop.cli.TestHDFSCLI
	hadoop.tools.dynamometer.workloadgenerator.TestWorkloadGenerator
	hadoop.tools.TestHadoopArchives
	hadoop.streaming.TestStreamingOutputOnlyKeys
	hadoop.yarn.server.router.webapp.TestRouterWebServicesREST
	hadoop.mapred.TestLocalDistributedCacheManager

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4476/1/artifact/out/Dockerfile
GITHUB PR	#4476
Optional Tests	dupname asflicense codespell detsecrets shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle markdownlint xmllint
uname	Linux 62db4b268209 4.15.0-156-generic #163-Ubuntu SMP Thu Aug 19 23:31:58 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `645bcd6`
Default Java	Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4476/1/testReport/
Max. process+thread count	2872 (vs. ulimit of 5500)
modules	C: hadoop-project hadoop-common-project/hadoop-common hadoop-common-project hadoop-tools/hadoop-aws hadoop-tools/hadoop-benchmark hadoop-tools . U: .
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4476/1/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2 shellcheck=0.7.0
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

steveloughran · 2022-06-21T21:11:49Z

as discussed in a call, I think should be merged as a merge commit of the chain, rather than through the github IDE. I will help with this; it can be done through the cli or sourcetree.

steveloughran · 2022-06-21T21:12:53Z

test failure hadoop.mapred.TestLocalDistributedCacheManager is mine; TestHDFSCLI is known

mukund-thakur · 2022-06-21T22:21:34Z

Other test failures seems unrelated to this
OutOfMemoryError: unable to create new native thread

mukund-thakur · 2022-06-21T22:24:11Z

./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java:614: final int maxReadSizeVectored = (int) longBytesOption(conf, AWS_S3_VECTOR_READS_MAX_MERGED_READ_SIZE,: Line is longer than 100 characters (found 105). [LineLength]
This is the only checkstyle which is new and can be fixed.

This feature adds methods for ranged vectored read operations in PositionedReadable. All stream which implement that interface support the new API. The default implementation reads each range in the vector sequentially. However, specific implementations may provide higher performance versions. This is done in two places * Local FileSystem/Checksum FileSystem * The S3A client. The S3A client first coalesces adjacent and "nearby" ranges together, then fetches each range in separate HTTP GET requests, executed in parallel. As such it delivers significant speedups to applications reading separate blocks of data from the same file, columnar data format libraries in particular. This is the merge commit of the feature branch; the work is in HADOOP-11867. Add a high-performance vectored read API. HADOOP-18104. S3A: Add configs to configure minSeekForVectorReads and maxReadSizeForVectorReads. HADOOP-18107. Adding scale test for vectored reads for large file HADOOP-18105. Implement buffer pooling with weak references. HADOOP-18106. Handle memory fragmentation in S3A Vectored IO. Contributed By: Owen O'Malley and Mukund Thakur

steveloughran · 2022-06-22T17:50:31Z

merged manually, closing work. great work mukund & owen, once we get this picked up it is going to deliver significant speedups. get those apachecon demos ready!

mukund-thakur added 5 commits June 20, 2022 17:30

HADOOP-18107 Adding scale test for vectored reads for large file (#4273)

0640790

part of HADOOP-18103. Contributed By: Mukund Thakur

mukund-thakur requested a review from steveloughran June 21, 2022 01:34

steveloughran closed this Jun 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HADOOP-18103. High performance vectored read API in Hadoop #4476

HADOOP-18103. High performance vectored read API in Hadoop #4476

mukund-thakur commented Jun 21, 2022 •

edited

Loading

mukund-thakur commented Jun 21, 2022

hadoop-yetus commented Jun 21, 2022

steveloughran commented Jun 21, 2022

steveloughran commented Jun 21, 2022

mukund-thakur commented Jun 21, 2022

mukund-thakur commented Jun 21, 2022

steveloughran commented Jun 22, 2022

HADOOP-18103. High performance vectored read API in Hadoop #4476

HADOOP-18103. High performance vectored read API in Hadoop #4476

Conversation

mukund-thakur commented Jun 21, 2022 • edited Loading

Description of PR

How was this patch tested?

For code changes:

mukund-thakur commented Jun 21, 2022

hadoop-yetus commented Jun 21, 2022

steveloughran commented Jun 21, 2022

steveloughran commented Jun 21, 2022

mukund-thakur commented Jun 21, 2022

mukund-thakur commented Jun 21, 2022

steveloughran commented Jun 22, 2022

mukund-thakur commented Jun 21, 2022 •

edited

Loading