Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HADOOP-18103. High performance vectored read API in Hadoop #4476

Closed
wants to merge 5 commits into from

Conversation

mukund-thakur
Copy link
Contributor

@mukund-thakur mukund-thakur commented Jun 21, 2022

Description of PR

Add support for multiple ranged vectored read api in PositionedReadable. The default iterates through the ranges to read each synchronously, but the intent is that FSDataInputStream subclasses can make more efficient readers especially object stores implementation.

How was this patch tested?

Added new tests. Ran older test suites in ap-south-1. All good.

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

part of HADOOP-18103.
Add support for multiple ranged vectored read api in PositionedReadable.
The default iterates through the ranges to read each synchronously,
but the intent is that FSDataInputStream subclasses can make more
efficient readers especially in object stores implementation.

Also added implementation in S3A where smaller ranges are merged and
sliced byte buffers are returned to the readers. All the merged ranged are
fetched from S3 asynchronously.

Contributed By: Owen O'Malley and Mukund Thakur
… maxReadSizeForVectorReads (#3964)

Part of HADOOP-18103.
Introducing fs.s3a.vectored.read.min.seek.size and fs.s3a.vectored.read.max.merged.size
to configure min seek and max read during a vectored IO operation in S3A connector.
These properties actually define how the ranges will be merged. To completely
disable merging set fs.s3a.max.readsize.vectored.read to 0.

Contributed By: Mukund Thakur
part of HADOOP-18103.
Required for vectored IO feature. None of current buffer pool
implementation is complete. ElasticByteBufferPool doesn't use
weak references and could lead to memory leak errors and
DirectBufferPool doesn't support caller preferences of direct
and heap buffers and has only fixed length buffer implementation.

Contributed By: Mukund Thakur
part of HADOOP-18103.
Handling memory fragmentation in S3A vectored IO implementation by
allocating smaller user range requested size buffers and directly
filling them from the remote S3 stream and skipping undesired
data in between ranges.
This patch also adds aborting active vectored reads when stream is
closed or unbuffer() is called.

Contributed By: Mukund Thakur
@mukund-thakur
Copy link
Contributor Author

@steveloughran , @mehakmeet So will go ahead and merge this if Yetus is okay?

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 38s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+0 🆗 shelldocs 0m 1s Shelldocs was not available.
+0 🆗 markdownlint 0m 1s markdownlint was not available.
+0 🆗 xmllint 0m 1s xmllint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 12 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 16m 2s Maven dependency ordering for branch
+1 💚 mvninstall 25m 10s trunk passed
+1 💚 compile 23m 3s trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 compile 20m 32s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 4m 27s trunk passed
+1 💚 mvnsite 18m 47s trunk passed
+1 💚 javadoc 8m 13s trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 javadoc 7m 23s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+0 🆗 spotbugs 0m 35s branch/hadoop-project no spotbugs output file (spotbugsXml.xml)
+1 💚 shadedclient 50m 33s branch has no errors when building and testing our client artifacts.
-0 ⚠️ patch 51m 2s Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 58s Maven dependency ordering for patch
+1 💚 mvninstall 32m 38s the patch passed
+1 💚 compile 22m 31s the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
-1 ❌ javac 22m 31s /results-compile-javac-root-jdkPrivateBuild-11.0.15+10-Ubuntu-0ubuntu0.20.04.1.txt root-jdkPrivateBuild-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 generated 3 new + 2875 unchanged - 0 fixed = 2878 total (was 2875)
+1 💚 compile 20m 29s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
-1 ❌ javac 20m 29s /results-compile-javac-root-jdkPrivateBuild-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07.txt root-jdkPrivateBuild-1.8.0_312-8u312-b07-0ubuntu120.04-b07 with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu120.04-b07 generated 3 new + 2671 unchanged - 0 fixed = 2674 total (was 2671)
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 4m 11s /results-checkstyle-root.txt root: The patch generated 3 new + 90 unchanged - 3 fixed = 93 total (was 93)
+1 💚 mvnsite 18m 25s the patch passed
+1 💚 shellcheck 0m 0s No new issues.
+1 💚 javadoc 8m 4s the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 javadoc 7m 29s root-jdkPrivateBuild-1.8.0_312-8u312-b07-0ubuntu120.04-b07 with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu120.04-b07 generated 0 new + 2328 unchanged - 1 fixed = 2328 total (was 2329)
+0 🆗 spotbugs 0m 33s hadoop-project has no data from spotbugs
+1 💚 shadedclient 53m 13s patch has no errors when building and testing our client artifacts.
_ Other Tests _
-1 ❌ unit 785m 22s /patch-unit-root.txt root in the patch failed.
+0 🆗 asflicense 2m 13s ASF License check generated no output?
1166m 24s
Reason Tests
Failed junit tests hadoop.cli.TestHDFSCLI
hadoop.tools.dynamometer.workloadgenerator.TestWorkloadGenerator
hadoop.tools.TestHadoopArchives
hadoop.streaming.TestStreamingOutputOnlyKeys
hadoop.yarn.server.router.webapp.TestRouterWebServicesREST
hadoop.mapred.TestLocalDistributedCacheManager
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4476/1/artifact/out/Dockerfile
GITHUB PR #4476
Optional Tests dupname asflicense codespell detsecrets shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle markdownlint xmllint
uname Linux 62db4b268209 4.15.0-156-generic #163-Ubuntu SMP Thu Aug 19 23:31:58 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 645bcd6
Default Java Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4476/1/testReport/
Max. process+thread count 2872 (vs. ulimit of 5500)
modules C: hadoop-project hadoop-common-project/hadoop-common hadoop-common-project hadoop-tools/hadoop-aws hadoop-tools/hadoop-benchmark hadoop-tools . U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4476/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2 shellcheck=0.7.0
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@steveloughran
Copy link
Contributor

as discussed in a call, I think should be merged as a merge commit of the chain, rather than through the github IDE. I will help with this; it can be done through the cli or sourcetree.

@steveloughran
Copy link
Contributor

test failure hadoop.mapred.TestLocalDistributedCacheManager is mine; TestHDFSCLI is known

@mukund-thakur
Copy link
Contributor Author

Other test failures seems unrelated to this
OutOfMemoryError: unable to create new native thread

@mukund-thakur
Copy link
Contributor Author

./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java:614: final int maxReadSizeVectored = (int) longBytesOption(conf, AWS_S3_VECTOR_READS_MAX_MERGED_READ_SIZE,: Line is longer than 100 characters (found 105). [LineLength]
This is the only checkstyle which is new and can be fixed.

asfgit pushed a commit that referenced this pull request Jun 22, 2022
This feature adds methods for ranged vectored read operations
in PositionedReadable.

All stream which implement that interface support the new API.

The default implementation reads each range in the vector
sequentially.

However, specific implementations may provide higher performance
versions. This is done in two places

* Local FileSystem/Checksum FileSystem
* The S3A client.

The S3A client first coalesces adjacent and "nearby" ranges
together, then fetches each range in separate HTTP GET requests,
executed in parallel. As such it delivers significant speedups
to applications reading separate blocks of data from the same
file, columnar data format libraries in particular.

This is the merge commit of the feature branch; the work is in

HADOOP-11867. Add a high-performance vectored read API.
HADOOP-18104. S3A: Add configs to configure minSeekForVectorReads and maxReadSizeForVectorReads.
HADOOP-18107. Adding scale test for vectored reads for large file
HADOOP-18105. Implement buffer pooling with weak references.
HADOOP-18106. Handle memory fragmentation in S3A Vectored IO.

Contributed By: Owen O'Malley and Mukund Thakur
@steveloughran
Copy link
Contributor

merged manually, closing work. great work mukund & owen, once we get this picked up it is going to deliver significant speedups. get those apachecon demos ready!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants