-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDFS-16303. Improve handling of datanode lost while decommissioning #3675
Conversation
This comment has been minimized.
This comment has been minimized.
Seems the unit tests failed on an unrelated error:
Added "TestRollingUpgrade.testRollback" as a potentially flaky test here: https://issues.apache.org/jira/browse/HDFS-15646 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @KevinWikant for your patch. The change mostly looks good to me.
@sodonnel @jojochuang I suppose you have more insights about DN decommission. Would you check this PR?
protected BlockManager blockManager; | ||
protected Namesystem namesystem; | ||
protected DatanodeAdminManager dnAdmin; | ||
protected Configuration conf; | ||
|
||
protected final Queue<DatanodeDescriptor> pendingNodes = new ArrayDeque<>(); | ||
private final PriorityQueue<DatanodeDescriptor> pendingNodes = new PriorityQueue<>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change increases the time complexity of adding an element from O(1) to O(logN), however, the N is bounded to the number of DNs decommissioning (i.e. about several hundreds even in the large cluster). Therefore I think the performance effect is not critical.
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java
Outdated
Show resolved
Hide resolved
This is quite a big change. I have a couple of thoughts. If a node goes dead while decommissioning, would it not be better to just remove it from the decommission monitor rather than keep tracking it there at all? If the node comes alive again, it should be entered back into the monitor. We could either detect it is dead in the monitor and remove it from tracking then, or have the place that logs the mentioned "node is dead while decommission in progress" remove it from the monitor. The DatanodeAdminBackoffMonitor is probably rarely used, if it is used at all, but it does not have a tracking limit I think at the moment. Perhaps it should have, it it was designed to run with less overhead than the default monitor, but perhaps if you decommissioned 100's of nodes at a time it would struggle, I am not sure. |
I agree with this statement & this is essentially the behavior this code change provides with 1 caveat "If the node comes alive again OR if there are no (potentially healthy) nodes in the pendingNodes queue, it will be entered back into the monitor" Consider the 2 alternatives: Note that both alternatives do not prevent healthy nodes from being decommissioned. With Option B) if there are more nodes being decommissioned than can be tracked, any unhealthy nodes being tracked will be removed from the tracked set & re-queued (with a priority lower than all healthy nodes in the priority queue); then the healthy nodes will be de-queued & moved to the tracked set, once the healthy nodes are decommissioned the unhealthy nodes can be tracked again It may seem Option A) is more performant as it avoids tracking unhealthy nodes each DatanodeAdminMonitor cycle, but this may not be the case as we will still need to be checking the health status of all the unhealthy nodes in the pendingNodes queue to determine if they should be tracked again. Furthermore, tracking unhealthy nodes is the current behavior & as far as I know it has not caused any performance problems. With Option B) unhealthy nodes are only tracked (and there health status is only checked) when there are fewer healthy nodes than "dfs.namenode.decommission.max.concurrent.tracked.nodes". This may result in superior performance for Option B) as it will only need to check the health of up to "dfs.namenode.decommission.max.concurrent.tracked.nodes" unhealthy nodes, rather than having to check the health of all nodes in the pendingNodes queue. Note that the log "node is dead while decommission in progress", occurs from within the DatanodeAdminMonitor:
|
In
If a DN goes dead and then re-registers, it will be added back into the pending nodes, so I don't think we need to continue to track it in the decommission monitor when it goes dead. We can just handle the dead event in the decommission monitor and stop tracking it, clearing a slot for another node. Then it will be re-added if it comes back by existing logic above. |
Based on unit testing & code inspection, I think the "dfs.namenode.decommission.max.concurrent.tracked.nodes" still applies to DatanodeAdminBackoffMonitor From the code, DatanodeAdminBackoffMonitor:
So:
The new unit test "TestDecommissionWithBackoffMonitor.testRequeueUnhealthyDecommissioningNodes" will fail without the changes made to "TestDecommissionWithBackoffMonitor". It fails because the unhealthy nodes have filled up the tracked set (i.e. outOfServiceBlocks) & the healthy nodes are stuck in the pendingNodes queue |
Thanks for sharing this! I will make this change, test it, & get back to you |
Ah yes, the BackoffMonitor gets calls the method below which is indeed limits by max Tracked Nodes:
|
@sodonnel After going through the re-implementation of the logic, I think I found a caveat worth mentioning Note the implementation of "isNodeHealthyForDecommissionOrMaintenance":
Here's the happy case:
Here's a less happy case:
The problem is that dead datanodes can be eventually DECOMMISSIONED if the namenode eliminates all lowRedundancy blocks (because of "isNodeHealthyForDecommissionOrMaintenance" logic), but by removing the dead datanode from the DatanodeAdminManager entirely (until it comes back alive) we prevent the dead datanode from transitioning to DECOMMISSIONED even though it could have So the impact is that a dead datanode which never comes back alive again will never transition from DECOMMISSION_INPROGRESS to DECOMMISSIONED even though it could have |
@sodonnel let me know your thoughts, but I think the problem with removing a dead node from the DatanodeAdminManager until it comes back alive again is that it will never be decommissioned if it never comes alive again Do you see any major downside in keeping the dead nodes in the pendingNodes priority queue behind all the alive nodes? Because of the priority queue ordering the dead nodes will not block decommissioning of alive nodes |
For me, DECOMMISSION_IN_PROGRESS + DEAD is an error state that means decommission has effectively failed. There is a case where it can complete, but what does that really mean - if the node is dead, it has not been gracefully stopped. If it wasn't for the way decommission is triggered using the hosts files, I would suggest switching it back to IN_SERVICE + DEAD, and let it be treated like any other dead host. If you have some monitoring tool tracking the decommission, and it sees "DECOMMISSIONED", then it assumes the decommission went fine. If if sees DECOMMISSION_IN_PROGRESS + DEAD, then its a flag that the admin needs to go look into it, as it should not have happened - perhaps they need to bring the node back, or conclude that the cluster is still OK without it (no missing blocks) and add it to the exclude list and forget about it. My feeling is that the priority queue idea adds some more complexity to an already hard to follow process / code area and I wonder if it is better to just remove the node from the monitor and let it be dealt with manually, which may be required a lot of the time anyway? |
Having faced the similar situation of |
The case which I have described where dead node decommissioning completes can occur when:
In this case, the node did go dead while decommissioning, but there was no LowRedundancy blocks thanks to redundant block replicas. From the user perspective, the loss of the decommissioning node did not impact the outcome of the decommissioning process. Had the node not gone dead while decommissioning, the eventual outcome is the same in that the node is decommissioned & there is no LowRedundancy blocks. If there is LowRedundancy blocks then a dead datanode will remain decommissioning, because if the dead node were to come alive again then it may be able to recover the LowRedundancy blocks. But if there is no LowRedundancy blocks then the when the node comes alive again it will be immediately transition to decommissioned anyway, so why not make it decommissioned while its still dead? Also, I don't think the priority queue is adding much complexity, it's just putting healthy nodes (with more recent heartbeat times) ahead of unhealthy nodes (with older heartbeat times); such that healthy nodes are decommissioned first I also want to call out another caveat with the approach of removing the node from the DatanodeAdminManager which I uncovered while unit testing If we leave the node in DECOMMISSION_IN_PROGRESS & remove the node from DatanodeAdminManager, then the following callstack should re-add the datanode to the DatanodeAdminManager when it comes alive again:
The problem is this condition "!node.isDecommissionInProgress()": Line 177 in 62c86ea
Because the dead datanode is left in DECOMMISSION_INPROGRESS, "startTrackingNode" is not invoked because of the "!node.isDecommissionInProgress()" condition Simply removing the condition "!node.isDecommissionInProgress()" will not function well because "startTrackingNode" is not idempotent:
I can think of 2 obvious ways to handle this: For "A)", the challenge is that we need to ensure the DatanodeDescriptor is not in "pendingReplication" or "outOfServiceBlocks" which could be a fairly costly call to execute repeatedly. Also, I am not even sure such a check is thread-safe given there is no locking used as part of "startDecommission" or "startTrackingNode" For "B)", the awareness of if a registerDatanode call is related to a restarted datanode is available here. So this information would need to be passed down the callstack to a method "startDecommission(DatanodeDescriptor node, boolean isNodeRestart)". Because of the modified method signature, all the other invocations of "startDecommission" will need to specify isNodeRestart=false Given this additional hurdle in the approach of removing a dead datanode from the DatanodeAdminManager, are we sure it will be less complex than the proposed changed? In short:
@sodonnel @virajjasani @aajisaka let me know your thoughts, I am still more in favor of tracking dead datanodes in DatanodeAdminManager (when there are LowRedundancy blocks), but if the community thinks its better to remove the dead datanodes from DatanodeAdminManager I can implement proposal "C)" |
I have implemented the alternative where dead DECOMMISSION_INPROGRESS nodes are removed from the DatanodeAdminManager immediately: #3746 I created a separate PR so that the 2 alternatives can be compared, the diff for each alternative is quite different @sodonnel @virajjasani @aajisaka let me know your thoughts |
@sodonnel The existing test "TestDecommissioningStatus.testDecommissionStatusAfterDNRestart" will be problematic for the proposed alternative of removing a dead DECOMMISSION_INPROGRESS node from the DatanodeAdminManager: #3746 As previously stated, removing the dead DECOMMISSION_INPROGRESS node from the DatanodeAdminManager means that when there are no LowRedundancy blocks the dead node will remain in DECOMMISSION_INPROGRESS rather than transitioning to DECOMMISSIONED This violates the expectation the the unit test is enforcing which is that a dead DECOMMISSION_INPROGRESS node should transition to DECOMMISSIONED when there are no LowRedundancy blocks
Line 451 in 6342d5e
Therefore, I think this is a good argument to remain more in favor of the original proposed change |
I would also add, that if you look at the implementation of the proposed alternative of removing a dead DECOMMISSION_INPROGRESS node from the DatanodeAdminManager: https://github.com/apache/hadoop/pull/3746/files It is not any less complex than this change, due to aforementioned caveats that need to be dealt with |
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java
Outdated
Show resolved
Hide resolved
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java
Outdated
Show resolved
Hide resolved
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminBackoffMonitor.java
Show resolved
Hide resolved
Thank you @KevinWikant @sodonnel @virajjasani for your discussion, and thanks @KevinWikant again for the deep dive. I'm in favor of the approach in this PR.
The expectation is added by HDFS-7374. This JIRA allowed decommissioning of dead DataNodes, and the feature has been enabled for a long time since Hadoop 2.7.0. I don't want to break the expectation because breaking it surprises the administrators. |
ping @sodonnel let me know your thoughts. |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @KevinWikant for your update. I reviewed your patch again and added some comments.
...fs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminDefaultMonitor.java
Outdated
Show resolved
Hide resolved
...fs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java
Outdated
Show resolved
Hide resolved
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java
Outdated
Show resolved
Hide resolved
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Unit testing failed due to unrelated flaky tests
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I'll commit tomorrow if there is no objection.
Sorry, I could not get to this PR last week. I will review later this week but I don't mean to block this work. If I find something odd or something as an improvement over this, we can anyways get it clarified later on the PR/Jira or create addendum PR later. With a quick glance, just one question for now: Overall it seems the goal is to improve and continue the decommissioning of healthy nodes over unhealthy ones (by removing and then re-queueing the entries), hence if few nodes are really in bad state (hardware/network issues), the plan is to keep re-queueing them until more nodes are getting decommissioned than max tracked nodes right? Since unhealthy node getting decommissioned might anyways require some sort of retry, shall we requeue them even if the condition is not met (i.e. total no of decomm in progress < max tracked nodes) as a limited retries? I am just thinking at high level, yet to catch up with the PR. Also, good to know HDFS-7374 is not broken. |
Yeah this test failure is not relevant. Even after the recent attempt, it is still flaky, we might require better insights for this test. |
@virajjasani, please see my response to your comments below
It's the opposite, the unhealthy nodes will only be re-queued when there are more nodes being decommissioned than max tracked nodes. Otherwise, if there are fewer nodes being decommissioned than max tracked nodes, then the unhealthy nodes will not be re-queued because they do not risk blocking the decommissioning of queued healthy nodes (i.e. because the queue is empty). One potential performance impact that comes to mind is that if there are say 200 unhealthy decommissioning nodes & max tracked nodes = 100, then this may cause some churn in the queueing/de-queueing process because each DatanodeAdminMonitor tick all 100 tracked nodes will be re-queued & then 100 queued nodes will be de-queued/tracked. Note that this churn (and any associated performance impact) will only take effect when:
The amount of re-queued/de-queued nodes per tick can be quantified as:
This churn of queueing/de-queueing will not occur at all under typical decommissioning scenarios (i.e. where there isn't a large number of dead decommissioning nodes). One idea to mitigate this is to have DatanodeAdminMonitor maintain counters used to track the number of healthy nodes in the pendingNodes queue; then this count can be used to make an improved re-queue decision. In particular, unhealthy nodes are only re-queued if there are healthy nodes in the pendingNodes queue. But this approach has some flaws, for example an unhealthy node in the queue could come alive again, but then an unhealthy node in the tracked set wouldn't be re-queued because the healthy queued node count hasn't been updated. To solve this, we would need to scan the pendingNodes queue to update the healthy/unhealthy node counts periodically, this scan could prove expensive.
If there are fewer nodes being decommissioned than max tracked nodes, then there are no nodes in the pendingNodes queue & all nodes are being tracked for decommissioning. Therefore, there is no possibility that any healthy nodes are blocked in the pendingNodes queue (preventing them from being decommissioned) & so in my opinion there is no benefit to re-queueing the unhealthy nodes in this case. Furthermore, this will negatively impact performance through frequent re-queueing & de-queueing. |
Yes makes sense. Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left few minor comments, looks good overall. Thanks @KevinWikant
...src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminBackoffMonitor.java
Outdated
Show resolved
Hide resolved
...fs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java
Outdated
Show resolved
Hide resolved
...fs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java
Outdated
Show resolved
Hide resolved
...fs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminDefaultMonitor.java
Outdated
Show resolved
Hide resolved
...rc/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestDatanodeAdminMonitorBase.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 (non-binding), pending QA
🎊 +1 overall
This message was automatically generated. |
Thanks @aajisaka & @virajjasani for the PR approvals! Are we good to merge now? |
Merged. Thank you @KevinWikant @virajjasani @sodonnel. |
…pache#3675) Co-authored-by: Kevin Wikant <[email protected]> Signed-off-by: Akira Ajisaka <[email protected]>
Description of PR
Fixes a bug in Hadoop HDFS where if more than "dfs.namenode.decommission.max.concurrent.tracked.nodes" datanodes are lost while in state decommissioning, then all forward progress towards decommissioning any datanodes (including healthy datanodes) is blocked
JIRA: https://issues.apache.org/jira/browse/HDFS-16303
How was this patch tested?
Unit Testing
Added new unit tests:
All "TestDecommission", "TestDecommissionWithBackoffMonitor", & "DatanodeAdminMonitorBase" tests pass when run locally
Note that without the "DatanodeAdminManager" changes the new test "testRequeueUnhealthyDecommissioningNodes" fails because it times out waiting for the healthy nodes to be decommissioned
Manual Testing
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?