-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDFS-16896 clear ignoredNodes list when we clear deadnode list on ref… #5322
Conversation
…etchLocations. ignoredNodes list is only used on hedged read codepath
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change generally looks good to me.
One concern is that when refetchLocation
is called inside hedgedFetchBlockByteRange
we could remove a node from the ignoredList that is already part of the futures array. This would lead to multiple reads to the same node.
What I'm thinking of is
- Node A is added to futures array
- getFirstToComplete throws an InterruptedException when doing hedgedService.take()
- We call retchLocation, which remove node A from ignored list.
- The while loop re-adds Node A to the futures list.
I don't know if this actually can happen. Even if it is technically possible, it may not be an issue. Thoughts?
I ran each of the failing unit test classes in Intellij individually.
They all passed with your PR applied. So the failures are just test flakiness. |
…request to same node
yes @simbadzina I agree this was an issue. I've resolved this now. As you pointed out the future is removed in |
💔 -1 overall
This message was automatically generated. |
// if refetch is true then all nodes are in deadlist or ignorelist | ||
// we should loop through all futures and remove them so we do not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add punctuation here and start new sentences with caps. That will make the comment easier to follow.
What is the deadlist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed comments. deadlist is actually deadNodes (I fixed that comment as well.)
When connections fail (in both hedged and non hedged code path) nodes are added to the deadNodes collection to try other nodes. Once chooseDataNode
returns null
(or more accurately getBestNodeDNAddrPair
) it calls refetchLocations
which clears the deadNodes clearLocalDeadNodes()
and now with my change, also clears the ignore list.
Note we have added an assumption to this method refetchLocations
. The comment I added to refetchLocations
/**
* RefetchLocations should only be called when there are no active requests
* to datanodes. In the hedged read case this means futures should be empty
*/
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change looks good to me. The failing tests in TestAuditLogs.java pass locally for me.
/** | ||
* Clear list of ignored nodes used for hedged reads. | ||
*/ | ||
private void clearIgnoredNodes(Collection<DatanodeInfo> ignoredNodes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit. Param documentation.
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java
Show resolved
Hide resolved
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java
Outdated
Show resolved
Hide resolved
// If refetch is true, then all nodes are in deadNodes or ignoredNodes. | ||
// We should loop through all futures and remove them, so we do not | ||
// have concurrent requests to the same node. | ||
// Once all futures are cleared, we can clear the ignoredNodes and retry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignored and dead nodes are cleared, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the thing i am trying to emphasize is the && futures.isEmpty()
check which is specific to how ignored nodes is cleared
assertEquals(3, input.getHedgedReadOpsLoopNumForTesting()); | ||
// The result of 9 is due to 2 blocks by 4 iterations plus one because | ||
// hedgedReadOpsLoopNumForTesting is incremented at start of the loop. | ||
assertEquals(9, input.getHedgedReadOpsLoopNumForTesting()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are tripling the IO per hedged request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are actually 4x'ing, the comment was meant to help clarify.
If you recall the issue was we only previously tried each block once, the change is to make hedged reads follow the same number of retires as non hedged reads which has 3 retry loops.
This example has 2 blocks, and the last loop is when it exists.
Previously 2+1 and now 8+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is pretty close. (Ugh is the hedged read code you are fixing ugly. :frown:)
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java
Outdated
Show resolved
Hide resolved
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 LGTM
apache#5322) HDFS-16896 clear ignoredNodes list when we clear deadnode list on refetchLocations. ignoredNodes list is only used on hedged read codepath Co-authored-by: Tom McCormick <[email protected]>
#5322) (#5444) Cherry picked from: 162288b Co-authored-by: Tom McCormick <[email protected]>
💔 -1 overall
This message was automatically generated. |
apache#5322) HDFS-16896 clear ignoredNodes list when we clear deadnode list on refetchLocations. ignoredNodes list is only used on hedged read codepath Co-authored-by: Tom McCormick <[email protected]>
…adnode list on ref… (apache#5322) (apache#5444) Cherry picked from: 162288b Co-authored-by: Tom McCormick <[email protected]>
apache#5322) (apache#5444) (apache#99) Cherry picked from: 162288b Co-authored-by: Tom McCormick <[email protected]> ACLOVERRIDE for oss email Co-authored-by: Tom <[email protected]>
…etchLocations. ignoredNodes list is only used on hedged read codepath
Description of PR
clear ignoredNodes list when we clear deadnode list on refetchLocations. ignoredNodes list is only used on hedged read codepath
How was this patch tested?
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?