-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDFS-15413. add dfs.client.read.striped.datanode.max.attempts to fix read ecfile timeout #5829
base: trunk
Are you sure you want to change the base?
Conversation
💔 -1 overall
This message was automatically generated. |
cc @zhangshuyan0 Would you mind to take a review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add tests around this, which can reproduce these issues, maybe by setting a lower value for socket timeout.
Should cover scenarios, where
- Connection to DN containing DataBlock is established.
- Connection to DN containing ParityBlock is established.
- When there are missing/lost nodes in the pipeline
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripeReader.java
Outdated
Show resolved
Hide resolved
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripeReader.java
Outdated
Show resolved
Hide resolved
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripeReader.java
Outdated
Show resolved
Hide resolved
Please also check the checkstyle and blannks reported by Yetus. Thanks. @Neilxzn |
Fix these checkstyle and add unit test. Please review it again. Thanks |
💔 -1 overall
This message was automatically generated. |
@ayushtkn @zhangshuyan0 looks like the remaining failing checks are unrelated, and the feedback was addressed. Any chance for another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this test is reproducing the issue for me, I reverted the changes in StripeReader and ran the test & it still passed.
If that gets sorted, We should add a test, where one DN is dead, like same test, but kill a DN
...t/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSStripedInputStreamWithTimeout.java
Show resolved
Hide resolved
...t/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSStripedInputStreamWithTimeout.java
Outdated
Show resolved
Hide resolved
Hi @Neilxzn Any progress here? Thanks. |
Hi @Neilxzn , any chance you have time to finish this up? |
Sorry for the late reply. I have been busy with other things recently. I will try to submit a new unit test tomorrow. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
I can pass the unit test hadoop.hdfs.TestDFSStripedInputStreamWithTimeout in my local development environment, but it fails on GitHub Jenkins. |
@Neilxzn I tried & it fails locally
To reproduce: |
Thank you. I will check it again soon. |
@haiyang1987 Our online environment (70 PB EC Data cluster, spark + hive olap) has already applied this patch. So far, everything is running normally. |
Noted, thanks for you work ~ |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripeReader.java
Outdated
Show resolved
Hide resolved
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripeReader.java
Outdated
Show resolved
Hide resolved
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripeReader.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @haiyang1987 said, we should make sure pread
work well after retrying.
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripeReader.java
Outdated
Show resolved
Hide resolved
@Neilxzn Hi, this patch is very useful, would you mind further fixing this PR? |
Sorry for my late reply. I have updated the patch based on the suggestions above. Please review it again. @haiyang1987 @zhangshuyan0 |
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripeReader.java
Outdated
Show resolved
Hide resolved
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripeReader.java
Outdated
Show resolved
Hide resolved
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripeReader.java
Outdated
Show resolved
Hide resolved
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripeReader.java
Outdated
Show resolved
Hide resolved
💔 -1 overall
This message was automatically generated. |
The UT |
💔 -1 overall
This message was automatically generated. |
add Datanode$closeDataXceiverServer method to close connnect for testing. |
💔 -1 overall
This message was automatically generated. |
please fix checkstyle, thanks~ |
Should we suppress this checkstyle warning? Or are there any better suggestions? |
🎊 +1 overall
This message was automatically generated. |
I believe we've started encountering this issue as well, would be great to get this in |
} catch (IOException e) { | ||
//Clear buffer to make next decode success | ||
strategy.getReadBuffer().clear(); | ||
if (curAttempts < readDNMaxAttempts) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may have been different in the initial implementation, but on the first pass through the while (true)
, curAttemtps
will already be 1 so there will be no retries by default I think?
Description of PR
https://issues.apache.org/jira/browse/HDFS-15413
Offer a available patch to fix HDFS-15413. This patch add dfs.client.read.striped.datanode.max.attempts config to allow users to adjust the number of dn retries to solve the problem of Datanode timeout when reading EC files.
How was this patch tested?
no add test. just test in our cluster
For code changes:
add dfs.client.read.striped.datanode.max.attempts config