-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-18103. High performance vectored read API in Hadoop #4476
Conversation
part of HADOOP-18103. Add support for multiple ranged vectored read api in PositionedReadable. The default iterates through the ranges to read each synchronously, but the intent is that FSDataInputStream subclasses can make more efficient readers especially in object stores implementation. Also added implementation in S3A where smaller ranges are merged and sliced byte buffers are returned to the readers. All the merged ranged are fetched from S3 asynchronously. Contributed By: Owen O'Malley and Mukund Thakur
… maxReadSizeForVectorReads (#3964) Part of HADOOP-18103. Introducing fs.s3a.vectored.read.min.seek.size and fs.s3a.vectored.read.max.merged.size to configure min seek and max read during a vectored IO operation in S3A connector. These properties actually define how the ranges will be merged. To completely disable merging set fs.s3a.max.readsize.vectored.read to 0. Contributed By: Mukund Thakur
part of HADOOP-18103. Contributed By: Mukund Thakur
part of HADOOP-18103. Required for vectored IO feature. None of current buffer pool implementation is complete. ElasticByteBufferPool doesn't use weak references and could lead to memory leak errors and DirectBufferPool doesn't support caller preferences of direct and heap buffers and has only fixed length buffer implementation. Contributed By: Mukund Thakur
part of HADOOP-18103. Handling memory fragmentation in S3A vectored IO implementation by allocating smaller user range requested size buffers and directly filling them from the remote S3 stream and skipping undesired data in between ranges. This patch also adds aborting active vectored reads when stream is closed or unbuffer() is called. Contributed By: Mukund Thakur
@steveloughran , @mehakmeet So will go ahead and merge this if Yetus is okay? |
💔 -1 overall
This message was automatically generated. |
as discussed in a call, I think should be merged as a merge commit of the chain, rather than through the github IDE. I will help with this; it can be done through the cli or sourcetree. |
test failure hadoop.mapred.TestLocalDistributedCacheManager is mine; TestHDFSCLI is known |
Other test failures seems unrelated to this |
|
This feature adds methods for ranged vectored read operations in PositionedReadable. All stream which implement that interface support the new API. The default implementation reads each range in the vector sequentially. However, specific implementations may provide higher performance versions. This is done in two places * Local FileSystem/Checksum FileSystem * The S3A client. The S3A client first coalesces adjacent and "nearby" ranges together, then fetches each range in separate HTTP GET requests, executed in parallel. As such it delivers significant speedups to applications reading separate blocks of data from the same file, columnar data format libraries in particular. This is the merge commit of the feature branch; the work is in HADOOP-11867. Add a high-performance vectored read API. HADOOP-18104. S3A: Add configs to configure minSeekForVectorReads and maxReadSizeForVectorReads. HADOOP-18107. Adding scale test for vectored reads for large file HADOOP-18105. Implement buffer pooling with weak references. HADOOP-18106. Handle memory fragmentation in S3A Vectored IO. Contributed By: Owen O'Malley and Mukund Thakur
merged manually, closing work. great work mukund & owen, once we get this picked up it is going to deliver significant speedups. get those apachecon demos ready! |
Description of PR
Add support for multiple ranged vectored read api in PositionedReadable. The default iterates through the ranges to read each synchronously, but the intent is that FSDataInputStream subclasses can make more efficient readers especially object stores implementation.
How was this patch tested?
Added new tests. Ran older test suites in ap-south-1. All good.
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?