-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-18221. Drains stream async before closing #4294
HADOOP-18221. Drains stream async before closing #4294
Conversation
This is the the initial merge of the HADOOP-18028 S3A performance input stream. This patch on its own is incomplete and must be accompanied by all other commits with HADOOP-18028 in their git commit message. Consult the JIRA for that list Contributed by Bhalchandra Pandit.
…3A prefetching stream (apache#4115) Contributed by PJ Fanning.
Contributed by Ahmar Suhail
…ache#4212) Contributed by Monthon Klongklaew
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
@ahmarsuhail In hadoop/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java Lines 612 to 623 in e0cd0a8
Can we create a JIRA to add it after rebase? |
Thanks danny, I've created https://issues.apache.org/jira/browse/HADOOP-18230 |
💔 -1 overall
This message was automatically generated. |
@steveloughran this is ready for review now, would you be able to take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look at look at the changes in hadoop trunk s3a input stream in #2584 as a basis for this work. (i plan to rebase this branch this week, so you will have merge problems...sorry)
async draining does deliver speedups, but only if the amount of data to be read is "large enough". for small amounts of data, synchronous draining is lower overhead and guarantees the active http connection can be reused.
when i do there merge there will be an async drain threshold for this.
|
||
Io.closeIgnoringIoException(this.inputStream); | ||
Io.closeIgnoringIoException(this.obj); | ||
} catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this happens then the readl: raised an exception. the stream MUST be aborted to stop it being returned to the http connection pool, as its connection is probably broken
public void run() { | ||
try { | ||
|
||
while(this.inputStream.read() >= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look at the changes in hadoop trunk s3a input stream here...it reads into a buffer for draining, and is marginally faster
/** | ||
* Drain task that is submitted to the future pool. | ||
*/ | ||
private static class DrainTask implements Runnable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- declare final to keep style checker happy
- i'd prefer to not use Runnable, instead completable futures., look at drainOrAbortHttpStream() and its use. at which point you can just pass in a function
@@ -98,6 +106,7 @@ public S3File( | |||
this.streamStatistics = streamStatistics; | |||
this.changeTracker = changeTracker; | |||
this.s3Objects = new IdentityHashMap<InputStream, S3Object>(); | |||
this.futurePool = context.getFuturePool(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is only for the drain, given the context is already stored, you can just get the pool when needed
f38bbe2
to
b75b72b
Compare
Description of PR
If close the prefetching input stream before prefetched blocks have finished reading the S3 input stream, the sdk repeatedly complains "Not all bytes were read from the S3ObjectInputStream". This happened on
S3AInputStream
as well, see this issue for more details.Closing the stream before draining it will abort the connection, so to allow for connection reuse we drain it asynchronously.
How was this patch tested?
Tested in eu-west-1 by running
mvn -Dparallel-tests -DtestsThreadCount=16 clean verify