HDFS-16303. Improve handling of datanode lost while decommissioning #3675

KevinWikant · 2021-11-17T22:09:32Z

Description of PR

Fixes a bug in Hadoop HDFS where if more than "dfs.namenode.decommission.max.concurrent.tracked.nodes" datanodes are lost while in state decommissioning, then all forward progress towards decommissioning any datanodes (including healthy datanodes) is blocked

JIRA: https://issues.apache.org/jira/browse/HDFS-16303

How was this patch tested?

Unit Testing

Added new unit tests:

TestDecommission.testRequeueUnhealthyDecommissioningNodes (& TestDecommissionWithBackoffMonitor.testRequeueUnhealthyDecommissioningNodes)
DatanodeAdminMonitorBase.testPendingNodesQueueOrdering
DatanodeAdminMonitorBase.testPendingNodesQueueReverseOrdering

All "TestDecommission", "TestDecommissionWithBackoffMonitor", & "DatanodeAdminMonitorBase" tests pass when run locally

Note that without the "DatanodeAdminManager" changes the new test "testRequeueUnhealthyDecommissioningNodes" fails because it times out waiting for the healthy nodes to be decommissioned

> mvn -Dtest=TestDecommission#testRequeueUnhealthyDecommissioningNodes test
...
[ERROR] Errors: 
[ERROR]   TestDecommission.testRequeueUnhealthyDecommissioningNodes:1776 » Timeout Timed...

> mvn -Dtest=TestDecommissionWithBackoffMonitor#testRequeueUnhealthyDecommissioningNodes test
...
[ERROR] Errors: 
[ERROR]   TestDecommissionWithBackoffMonitor>TestDecommission.testRequeueUnhealthyDecommissioningNodes:1776 » Timeout

Manual Testing

create Hadoop cluster with:
- 30 datanodes initially
- custom Namenode JAR containing this change
- hdfs-site configuration "dfs.namenode.decommission.max.concurrent.tracked.node = 10"

> cat /etc/hadoop/conf/hdfs-site.xml | grep -A 1 'tracked'
    <name>dfs.namenode.decommission.max.concurrent.tracked.nodes</name>
    <value>10</value>

reproduce the bug: https://issues.apache.org/jira/browse/HDFS-16303
- start decommissioning over 20 datanodes
- terminate 20 datanodes while they are in state decommissioning
- observe the Namenode logs to validate that there are 20 unhealthy datanodes stuck "in Decommission In Progress"

2021-11-15 17:57:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 10 nodes are currently queued waiting to be decommissioned.
2021-11-15 17:57:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 10 nodes which are dead while in Decommission In Progress.

2021-11-15 17:58:14,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 10 nodes are currently queued waiting to be decommissioned.
2021-11-15 17:58:14,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 10 nodes which are dead while in Decommission In Progress.

2021-11-15 17:58:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 10 nodes are currently queued waiting to be decommissioned.
2021-11-15 17:58:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 10 nodes which are dead while in Decommission In Progress.

2021-11-15 17:59:14,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 10 nodes are currently queued waiting to be decommissioned.
2021-11-15 17:59:14,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 10 nodes which are dead while in Decommission In Progress.

scale-up to 25 healthy datanodes & then decommission 22 of those datanodes (all but 3)
- observe the Namenode logs to validate those 22 healthy datanodes are decommissioned (i.e. HDFS-16303 is solved)

2021-11-15 17:59:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 10 nodes are currently queued waiting to be decommissioned.
2021-11-15 17:59:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 10 nodes which are dead while in Decommission In Progress.

2021-11-15 18:00:14,487 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 42 nodes decommissioning but only 10 nodes will be tracked at a time. 32 nodes are currently queued waiting to be decommissioned.
2021-11-15 18:00:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 42 nodes decommissioning but only 10 nodes will be tracked at a time. 32 nodes are currently queued waiting to be decommissioned.
2021-11-15 18:01:14,486 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 32 nodes decommissioning but only 10 nodes will be tracked at a time. 32 nodes are currently queued waiting to be decommissioned.
2021-11-15 18:01:44,486 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 32 nodes decommissioning but only 10 nodes will be tracked at a time. 22 nodes are currently queued waiting to be decommissioned.
2021-11-15 18:02:14,486 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 22 nodes decommissioning but only 10 nodes will be tracked at a time. 22 nodes are currently queued waiting to be decommissioned.

2021-11-15 18:02:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 12 nodes are currently queued waiting to be decommissioned.
2021-11-15 18:02:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 8 nodes which are dead while in Decommission In Progress.

2021-11-15 18:03:14,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 10 nodes are currently queued waiting to be decommissioned.
2021-11-15 18:03:14,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 10 nodes which are dead while in Decommission In Progress.

For code changes:

[yes] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
[n/a] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
[n/a] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
[no] If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

KevinWikant · 2021-11-18T03:54:07Z

Seems the unit tests failed on an unrelated error:

[ERROR] Failures: [ERROR] org.apache.hadoop.hdfs.TestRollingUpgrade.testRollback(org.apache.hadoop.hdfs.TestRollingUpgrade) [ERROR] Run 1: TestRollingUpgrade.testRollback:328->checkMxBeanIsNull:299 expected null, but was:<javax.management.openmbean.CompositeDataSupport(compositeType=javax.management.openmbean.CompositeType(name=org.apache.hadoop.hdfs.protocol.RollingUpgradeInfo$Bean,items=((itemName=blockPoolId,itemType=javax.management.openmbean.SimpleType(name=java.lang.String)),(itemName=createdRollbackImages,itemType=javax.management.openmbean.SimpleType(name=java.lang.Boolean)),(itemName=finalizeTime,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long)),(itemName=startTime,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long)))),contents={blockPoolId=BP-1039609189-172.17.0.2-1637204447583, createdRollbackImages=true, finalizeTime=0, startTime=1637204448659})> [ERROR] Run 2: TestRollingUpgrade.testRollback:328->checkMxBeanIsNull:299 expected null, but was:<javax.management.openmbean.CompositeDataSupport(compositeType=javax.management.openmbean.CompositeType(name=org.apache.hadoop.hdfs.protocol.RollingUpgradeInfo$Bean,items=((itemName=blockPoolId,itemType=javax.management.openmbean.SimpleType(name=java.lang.String)),(itemName=createdRollbackImages,itemType=javax.management.openmbean.SimpleType(name=java.lang.Boolean)),(itemName=finalizeTime,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long)),(itemName=startTime,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long)))),contents={blockPoolId=BP-1039609189-172.17.0.2-1637204447583, createdRollbackImages=true, finalizeTime=0, startTime=1637204448659})> [ERROR] Run 3: TestRollingUpgrade.testRollback:328->checkMxBeanIsNull:299 expected null, but was:<javax.management.openmbean.CompositeDataSupport(compositeType=javax.management.openmbean.CompositeType(name=org.apache.hadoop.hdfs.protocol.RollingUpgradeInfo$Bean,items=((itemName=blockPoolId,itemType=javax.management.openmbean.SimpleType(name=java.lang.String)),(itemName=createdRollbackImages,itemType=javax.management.openmbean.SimpleType(name=java.lang.Boolean)),(itemName=finalizeTime,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long)),(itemName=startTime,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long)))),contents={blockPoolId=BP-1039609189-172.17.0.2-1637204447583, createdRollbackImages=true, finalizeTime=0, startTime=1637204448659})> [INFO]

Added "TestRollingUpgrade.testRollback" as a potentially flaky test here: https://issues.apache.org/jira/browse/HDFS-15646

aajisaka

Thank you @KevinWikant for your patch. The change mostly looks good to me.

@sodonnel @jojochuang I suppose you have more insights about DN decommission. Would you check this PR?

aajisaka · 2021-11-19T08:25:59Z

...fs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java

  protected BlockManager blockManager;
  protected Namesystem namesystem;
  protected DatanodeAdminManager dnAdmin;
  protected Configuration conf;

-  protected final Queue<DatanodeDescriptor> pendingNodes = new ArrayDeque<>();
+  private final PriorityQueue<DatanodeDescriptor> pendingNodes = new PriorityQueue<>(


This change increases the time complexity of adding an element from O(1) to O(logN), however, the N is bounded to the number of DNs decommissioning (i.e. about several hundreds even in the large cluster). Therefore I think the performance effect is not critical.

hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java

sodonnel · 2021-11-19T09:48:53Z

This is quite a big change. I have a couple of thoughts.

If a node goes dead while decommissioning, would it not be better to just remove it from the decommission monitor rather than keep tracking it there at all? If the node comes alive again, it should be entered back into the monitor.

We could either detect it is dead in the monitor and remove it from tracking then, or have the place that logs the mentioned "node is dead while decommission in progress" remove it from the monitor.

The DatanodeAdminBackoffMonitor is probably rarely used, if it is used at all, but it does not have a tracking limit I think at the moment. Perhaps it should have, it it was designed to run with less overhead than the default monitor, but perhaps if you decommissioned 100's of nodes at a time it would struggle, I am not sure.

KevinWikant · 2021-11-22T17:31:47Z

If a node goes dead while decommissioning, would it not be better to just remove it from the decommission monitor rather than keep tracking it there at all? If the node comes alive again, it should be entered back into the monitor.
We could either detect it is dead in the monitor and remove it from tracking then, or have the place that logs the mentioned "node is dead while decommission in progress" remove it from the monitor.

I agree with this statement & this is essentially the behavior this code change provides with 1 caveat "If the node comes alive again OR if there are no (potentially healthy) nodes in the pendingNodes queue, it will be entered back into the monitor"

Consider the 2 alternatives:
A) Do not track an unhealthy node until it becomes healthy
B) Do not track an unhealthy node until it becomes healthy OR until there are fewer healthy nodes being decommissioned than "dfs.namenode.decommission.max.concurrent.tracked.nodes"

Note that both alternatives do not prevent healthy nodes from being decommissioned. With Option B) if there are more nodes being decommissioned than can be tracked, any unhealthy nodes being tracked will be removed from the tracked set & re-queued (with a priority lower than all healthy nodes in the priority queue); then the healthy nodes will be de-queued & moved to the tracked set, once the healthy nodes are decommissioned the unhealthy nodes can be tracked again

It may seem Option A) is more performant as it avoids tracking unhealthy nodes each DatanodeAdminMonitor cycle, but this may not be the case as we will still need to be checking the health status of all the unhealthy nodes in the pendingNodes queue to determine if they should be tracked again. Furthermore, tracking unhealthy nodes is the current behavior & as far as I know it has not caused any performance problems.

With Option B) unhealthy nodes are only tracked (and there health status is only checked) when there are fewer healthy nodes than "dfs.namenode.decommission.max.concurrent.tracked.nodes". This may result in superior performance for Option B) as it will only need to check the health of up to "dfs.namenode.decommission.max.concurrent.tracked.nodes" unhealthy nodes, rather than having to check the health of all nodes in the pendingNodes queue.

Note that the log "node is dead while decommission in progress", occurs from within the DatanodeAdminMonitor:

sodonnel · 2021-11-22T17:44:30Z

In DatanodeManager.registerDatanode(), it has logic to add a node into the decommission workflow if the node has a DECOMMISSIONED status in the hosts / combined hosts file, as it calls startAdminOperationIfNecessary which looks like:

void startAdminOperationIfNecessary(DatanodeDescriptor nodeReg) {
    long maintenanceExpireTimeInMS =
        hostConfigManager.getMaintenanceExpirationTimeInMS(nodeReg);
    // If the registered node is in exclude list, then decommission it
    if (getHostConfigManager().isExcluded(nodeReg)) {
      datanodeAdminManager.startDecommission(nodeReg);
    } else if (nodeReg.maintenanceNotExpired(maintenanceExpireTimeInMS)) {
      datanodeAdminManager.startMaintenance(nodeReg, maintenanceExpireTimeInMS);
    }
  }

If a DN goes dead and then re-registers, it will be added back into the pending nodes, so I don't think we need to continue to track it in the decommission monitor when it goes dead. We can just handle the dead event in the decommission monitor and stop tracking it, clearing a slot for another node. Then it will be re-added if it comes back by existing logic above.

KevinWikant · 2021-11-22T17:53:46Z

The DatanodeAdminBackoffMonitor is probably rarely used, if it is used at all, but it does not have a tracking limit I think at the moment. Perhaps it should have, it it was designed to run with less overhead than the default monitor, but perhaps if you decommissioned 100's of nodes at a time it would struggle, I am not sure.

Based on unit testing & code inspection, I think the "dfs.namenode.decommission.max.concurrent.tracked.nodes" still applies to DatanodeAdminBackoffMonitor

From the code, DatanodeAdminBackoffMonitor:

has an additional data structure pendingRep
- note that pendingRep is size constrained by "dfs.namenode.decommission.backoff.monitor.pending.limit"
will first process blocks from within pendingRep
then will move blocks from outOfServiceBlocks to pendingRep so that they can be processed next cycle

So:

pendingRep gets its blocks from outOfServiceBlocks
outOfServiceBlocks gets its datanodes from the pendingQueue
- note how outOfServiceBlocks is still size constrained by "dfs.namenode.decommission.max.concurrent.tracked.nodes"

The new unit test "TestDecommissionWithBackoffMonitor.testRequeueUnhealthyDecommissioningNodes" will fail without the changes made to "TestDecommissionWithBackoffMonitor". It fails because the unhealthy nodes have filled up the tracked set (i.e. outOfServiceBlocks) & the healthy nodes are stuck in the pendingNodes queue

KevinWikant · 2021-11-22T17:56:13Z

If a DN goes dead and then re-registers, it will be added back into the pending nodes, so I don't think we need to continue to track it in the decommission monitor when it goes dead. We can just handle the dead event in the decommission monitor and stop tracking it, clearing a slot for another node. Then it will be re-added if it comes back by existing logic above.

Thanks for sharing this! I will make this change, test it, & get back to you

sodonnel · 2021-11-22T18:05:23Z

Ah yes, the BackoffMonitor gets calls the method below which is indeed limits by max Tracked Nodes:

private void processPendingNodes() {
    while (!pendingNodes.isEmpty() &&
        (maxConcurrentTrackedNodes == 0 ||
            outOfServiceNodeBlocks.size() < maxConcurrentTrackedNodes)) {
      outOfServiceNodeBlocks.put(pendingNodes.poll(), null);
    }
  }

KevinWikant · 2021-11-26T16:29:49Z

If a DN goes dead and then re-registers, it will be added back into the pending nodes, so I don't think we need to continue to track it in the decommission monitor when it goes dead. We can just handle the dead event in the decommission monitor and stop tracking it, clearing a slot for another node. Then it will be re-added if it comes back by existing logic above.

@sodonnel After going through the re-implementation of the logic, I think I found a caveat worth mentioning

Note the implementation of "isNodeHealthyForDecommissionOrMaintenance":

hadoop/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java

Line 4546 in dc751df

return true;
if HDFS has no lowRedundancy or pendingReconstruction blocks then all datanodes (including dead datanodes) are considered "healthy" for the purpose of being DECOMMISSIONED

Here's the happy case:

a dead datanode is removed from DatanodeAdminManager (i.e. removed from both outOfServiceNodeBlocks & pendingNodes) with AdminState=DECOMMISSION_INPROGRESS
a dead datanode comes back alive, gets re-added to DatanodeAdminManager, & transitions to AdminState=DECOMMISSIONED

Here's a less happy case:

a dead datanode is removed from DatanodeAdminManager with AdminState=DECOMMISSION_INPROGRESS
the dead datanode never comes back alive & therefore remains with AdminState=DECOMMISSION_INPROGRESS forever

The problem is that dead datanodes can be eventually DECOMMISSIONED if the namenode eliminates all lowRedundancy blocks (because of "isNodeHealthyForDecommissionOrMaintenance" logic), but by removing the dead datanode from the DatanodeAdminManager entirely (until it comes back alive) we prevent the dead datanode from transitioning to DECOMMISSIONED even though it could have

So the impact is that a dead datanode which never comes back alive again will never transition from DECOMMISSION_INPROGRESS to DECOMMISSIONED even though it could have

KevinWikant · 2021-11-29T20:17:12Z

@sodonnel let me know your thoughts, but I think the problem with removing a dead node from the DatanodeAdminManager until it comes back alive again is that it will never be decommissioned if it never comes alive again

Do you see any major downside in keeping the dead nodes in the pendingNodes priority queue behind all the alive nodes? Because of the priority queue ordering the dead nodes will not block decommissioning of alive nodes

sodonnel · 2021-11-29T21:00:06Z

For me, DECOMMISSION_IN_PROGRESS + DEAD is an error state that means decommission has effectively failed. There is a case where it can complete, but what does that really mean - if the node is dead, it has not been gracefully stopped. If it wasn't for the way decommission is triggered using the hosts files, I would suggest switching it back to IN_SERVICE + DEAD, and let it be treated like any other dead host.

If you have some monitoring tool tracking the decommission, and it sees "DECOMMISSIONED", then it assumes the decommission went fine.

If if sees DECOMMISSION_IN_PROGRESS + DEAD, then its a flag that the admin needs to go look into it, as it should not have happened - perhaps they need to bring the node back, or conclude that the cluster is still OK without it (no missing blocks) and add it to the exclude list and forget about it.

My feeling is that the priority queue idea adds some more complexity to an already hard to follow process / code area and I wonder if it is better to just remove the node from the monitor and let it be dealt with manually, which may be required a lot of the time anyway?

virajjasani · 2021-11-30T12:33:43Z

the priority queue idea adds some more complexity to an already hard to follow process / code area and I wonder if it is better to just remove the node from the monitor and let it be dealt with manually, which may be required a lot of the time anyway?

Having faced the similar situation of DECOMMISSION_IN_PROGRESS + DEAD state requiring manual intervention, I agree with approach of removing the node and also the fact that it's already too complex process to follow so introducing new priority queue would complicate it even further.

KevinWikant · 2021-11-30T21:25:48Z

DECOMMISSION_IN_PROGRESS + DEAD is an error state that means decommission has effectively failed. There is a case where it can complete, but what does that really mean - if the node is dead, it has not been gracefully stopped.

The case which I have described where dead node decommissioning completes can occur when:

a decommissioning node goes dead, but all of its blocks still have block replicas on other live nodes
the namenode is eventually able to satisfy the minimum replication of all blocks (by replicating the under-replicated blocks from the live nodes)
the dead decommissioning node is transitioned to decommissioned

In this case, the node did go dead while decommissioning, but there was no LowRedundancy blocks thanks to redundant block replicas. From the user perspective, the loss of the decommissioning node did not impact the outcome of the decommissioning process. Had the node not gone dead while decommissioning, the eventual outcome is the same in that the node is decommissioned & there is no LowRedundancy blocks.

If there is LowRedundancy blocks then a dead datanode will remain decommissioning, because if the dead node were to come alive again then it may be able to recover the LowRedundancy blocks. But if there is no LowRedundancy blocks then the when the node comes alive again it will be immediately transition to decommissioned anyway, so why not make it decommissioned while its still dead?

Also, I don't think the priority queue is adding much complexity, it's just putting healthy nodes (with more recent heartbeat times) ahead of unhealthy nodes (with older heartbeat times); such that healthy nodes are decommissioned first

I also want to call out another caveat with the approach of removing the node from the DatanodeAdminManager which I uncovered while unit testing

If we leave the node in DECOMMISSION_IN_PROGRESS & remove the node from DatanodeAdminManager, then the following callstack should re-add the datanode to the DatanodeAdminManager when it comes alive again:

The problem is this condition "!node.isDecommissionInProgress()":

hadoop/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java

Line 177 in 62c86ea

if (!node.isDecommissionInProgress() && !node.isDecommissioned()) {

Because the dead datanode is left in DECOMMISSION_INPROGRESS, "startTrackingNode" is not invoked because of the "!node.isDecommissionInProgress()" condition

Simply removing the condition "!node.isDecommissionInProgress()" will not function well because "startTrackingNode" is not idempotent:

startDecommission is invoked periodically when refreshDatanodes is called
pendingNodes is an ArrayDequeue which does not deduplicate the DatanodeDescriptor
therefore, removing the "!node.isDecommissionInProgress()" check will cause a large number of duplicate DatanodeDescriptor to be added to DatanodeAdminManager

I can think of 2 obvious ways to handle this:
A) make calls to "startTrackingNode" idempotent (meaning that if the DatanodeDescriptor is already tracked, it does not get added to the DatanodeAdminManager)
B) modify startDecommission such that its aware of if the invocation is for a datanode which was just restarted after being dead such that it can still invoke "startTrackingNode" even though "node.isDecommissionInProgress()"
C) add a flag to DatanodeDescriptor which indicates if the datanode is being tracked within DatanodeAdminManager, then check this flag in startDecommission

For "A)", the challenge is that we need to ensure the DatanodeDescriptor is not in "pendingReplication" or "outOfServiceBlocks" which could be a fairly costly call to execute repeatedly. Also, I am not even sure such a check is thread-safe given there is no locking used as part of "startDecommission" or "startTrackingNode"

For "B)", the awareness of if a registerDatanode call is related to a restarted datanode is available here. So this information would need to be passed down the callstack to a method "startDecommission(DatanodeDescriptor node, boolean isNodeRestart)". Because of the modified method signature, all the other invocations of "startDecommission" will need to specify isNodeRestart=false

Given this additional hurdle in the approach of removing a dead datanode from the DatanodeAdminManager, are we sure it will be less complex than the proposed changed?

In short:

I don't think there is any downside in moving a dead datanode to decommissioned when there are no LowRedundancy blocks because this would happen immediately anyway were the node to come back alive (and get re-added to DatanodeAdminManager)
the approach of removing a dead datanode from the DatanodeAdminManager will not work properly without some significant refactoring of the "startDecommission" method & related code

@sodonnel @virajjasani @aajisaka let me know your thoughts, I am still more in favor of tracking dead datanodes in DatanodeAdminManager (when there are LowRedundancy blocks), but if the community thinks its better to remove the dead datanodes from DatanodeAdminManager I can implement proposal "C)"

KevinWikant · 2021-12-02T21:44:25Z

I have implemented the alternative where dead DECOMMISSION_INPROGRESS nodes are removed from the DatanodeAdminManager immediately: #3746

I created a separate PR so that the 2 alternatives can be compared, the diff for each alternative is quite different

@sodonnel @virajjasani @aajisaka let me know your thoughts

KevinWikant · 2021-12-03T16:16:24Z

@sodonnel The existing test "TestDecommissioningStatus.testDecommissionStatusAfterDNRestart" will be problematic for the proposed alternative of removing a dead DECOMMISSION_INPROGRESS node from the DatanodeAdminManager: #3746

As previously stated, removing the dead DECOMMISSION_INPROGRESS node from the DatanodeAdminManager means that when there are no LowRedundancy blocks the dead node will remain in DECOMMISSION_INPROGRESS rather than transitioning to DECOMMISSIONED

This violates the expectation the the unit test is enforcing which is that a dead DECOMMISSION_INPROGRESS node should transition to DECOMMISSIONED when there are no LowRedundancy blocks

"Delete the under-replicated file, which should let the DECOMMISSION_IN_PROGRESS node become DECOMMISSIONED"

hadoop/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestDecommissioningStatus.java

Line 451 in 6342d5e

assertTrue("the node should be decommissioned",

Therefore, I think this is a good argument to remain more in favor of the original proposed change

KevinWikant · 2021-12-03T16:19:48Z

I would also add, that if you look at the implementation of the proposed alternative of removing a dead DECOMMISSION_INPROGRESS node from the DatanodeAdminManager: https://github.com/apache/hadoop/pull/3746/files

It is not any less complex than this change, due to aforementioned caveats that need to be dealt with

hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java

...src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminBackoffMonitor.java

aajisaka · 2021-12-07T01:39:13Z

Thank you @KevinWikant @sodonnel @virajjasani for your discussion, and thanks @KevinWikant again for the deep dive.

I'm in favor of the approach in this PR.

This violates the expectation the the unit test is enforcing which is that a dead DECOMMISSION_INPROGRESS node should transition to DECOMMISSIONED when there are no LowRedundancy blocks

The expectation is added by HDFS-7374. This JIRA allowed decommissioning of dead DataNodes, and the feature has been enabled for a long time since Hadoop 2.7.0. I don't want to break the expectation because breaking it surprises the administrators.

aajisaka · 2021-12-08T23:41:26Z

ping @sodonnel let me know your thoughts.

…evision 2

aajisaka

Thank you @KevinWikant for your update. I reviewed your patch again and added some comments.

...fs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java

...src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminDefaultMonitor.java

...fs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java

hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java

…evision 3

…evision 4

KevinWikant · 2021-12-17T22:56:51Z

Unit testing failed due to unrelated flaky tests

[ERROR] Errors:
[ERROR] org.apache.hadoop.hdfs.TestRollingUpgrade.testRollback(org.apache.hadoop.hdfs.TestRollingUpgrade)
[ERROR] Run 1: TestRollingUpgrade.testRollback:329->waitForNullMxBean:361 Â» Timeout Timed out...
[ERROR] Run 2: TestRollingUpgrade.testRollback:329->waitForNullMxBean:361 Â» Timeout Timed out...
[ERROR] Run 3: TestRollingUpgrade.testRollback:329->waitForNullMxBean:361 Â» Timeout Timed out...
[INFO]
[WARNING] Flakes:
[WARNING] org.apache.hadoop.hdfs.TestRollingUpgrade.testCheckpoint(org.apache.hadoop.hdfs.TestRollingUpgrade)
[ERROR] Run 1: TestRollingUpgrade.testCheckpoint:599->testCheckpoint:686 Test resulted in an unexpected exit
[INFO] Run 2: PASS

aajisaka

+1, I'll commit tomorrow if there is no objection.

virajjasani · 2021-12-20T16:04:46Z

Sorry, I could not get to this PR last week. I will review later this week but I don't mean to block this work. If I find something odd or something as an improvement over this, we can anyways get it clarified later on the PR/Jira or create addendum PR later.
Thanks for your work @KevinWikant, this might be really helpful going forward.

With a quick glance, just one question for now: Overall it seems the goal is to improve and continue the decommissioning of healthy nodes over unhealthy ones (by removing and then re-queueing the entries), hence if few nodes are really in bad state (hardware/network issues), the plan is to keep re-queueing them until more nodes are getting decommissioned than max tracked nodes right? Since unhealthy node getting decommissioned might anyways require some sort of retry, shall we requeue them even if the condition is not met (i.e. total no of decomm in progress < max tracked nodes) as a limited retries? I am just thinking at high level, yet to catch up with the PR.

Also, good to know HDFS-7374 is not broken.

virajjasani · 2021-12-20T16:06:44Z

Unit testing failed due to unrelated flaky tests

[ERROR] Errors:
[ERROR] org.apache.hadoop.hdfs.TestRollingUpgrade.testRollback(org.apache.hadoop.hdfs.TestRollingUpgrade)
[ERROR] Run 1: TestRollingUpgrade.testRollback:329->waitForNullMxBean:361 Â» Timeout Timed out...
[ERROR] Run 2: TestRollingUpgrade.testRollback:329->waitForNullMxBean:361 Â» Timeout Timed out...
[ERROR] Run 3: TestRollingUpgrade.testRollback:329->waitForNullMxBean:361 Â» Timeout Timed out...
[INFO]
[WARNING] Flakes:
[WARNING] org.apache.hadoop.hdfs.TestRollingUpgrade.testCheckpoint(org.apache.hadoop.hdfs.TestRollingUpgrade)
[ERROR] Run 1: TestRollingUpgrade.testCheckpoint:599->testCheckpoint:686 Test resulted in an unexpected exit
[INFO] Run 2: PASS

Yeah this test failure is not relevant. Even after the recent attempt, it is still flaky, we might require better insights for this test.

KevinWikant · 2021-12-20T21:58:38Z

@virajjasani, please see my response to your comments below

hence if few nodes are really in bad state (hardware/network issues), the plan is to keep re-queueing them until more nodes are getting decommissioned than max tracked nodes right?

It's the opposite, the unhealthy nodes will only be re-queued when there are more nodes being decommissioned than max tracked nodes. Otherwise, if there are fewer nodes being decommissioned than max tracked nodes, then the unhealthy nodes will not be re-queued because they do not risk blocking the decommissioning of queued healthy nodes (i.e. because the queue is empty).

One potential performance impact that comes to mind is that if there are say 200 unhealthy decommissioning nodes & max tracked nodes = 100, then this may cause some churn in the queueing/de-queueing process because each DatanodeAdminMonitor tick all 100 tracked nodes will be re-queued & then 100 queued nodes will be de-queued/tracked. Note that this churn (and any associated performance impact) will only take effect when:

there are more nodes being decommissioned than max tracked nodes
AND either:
- number of healthy decommissioning nodes < max tracked nodes
- number of unhealthy decommissioning nodes > max tracked nodes

The amount of re-queued/de-queued nodes per tick can be quantified as:

numRequeue = numDecommissioning <= numTracked ? 0 : numDeadDecommissioning - (numDecommissioning - numTracked)

This churn of queueing/de-queueing will not occur at all under typical decommissioning scenarios (i.e. where there isn't a large number of dead decommissioning nodes).

One idea to mitigate this is to have DatanodeAdminMonitor maintain counters used to track the number of healthy nodes in the pendingNodes queue; then this count can be used to make an improved re-queue decision. In particular, unhealthy nodes are only re-queued if there are healthy nodes in the pendingNodes queue. But this approach has some flaws, for example an unhealthy node in the queue could come alive again, but then an unhealthy node in the tracked set wouldn't be re-queued because the healthy queued node count hasn't been updated. To solve this, we would need to scan the pendingNodes queue to update the healthy/unhealthy node counts periodically, this scan could prove expensive.

Since unhealthy node getting decommissioned might anyways require some sort of retry, shall we requeue them even if the condition is not met (i.e. total no of decomm in progress < max tracked nodes) as a limited retries?

If there are fewer nodes being decommissioned than max tracked nodes, then there are no nodes in the pendingNodes queue & all nodes are being tracked for decommissioning. Therefore, there is no possibility that any healthy nodes are blocked in the pendingNodes queue (preventing them from being decommissioned) & so in my opinion there is no benefit to re-queueing the unhealthy nodes in this case. Furthermore, this will negatively impact performance through frequent re-queueing & de-queueing.

virajjasani · 2021-12-21T05:17:44Z

If there are fewer nodes being decommissioned than max tracked nodes, then there are no nodes in the pendingNodes queue & all nodes are being tracked for decommissioning. Therefore, there is no possibility that any healthy nodes are blocked in the pendingNodes queue

Yes makes sense. Thanks

virajjasani

Left few minor comments, looks good overall. Thanks @KevinWikant

...src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminBackoffMonitor.java

...fs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java

...src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminDefaultMonitor.java

...rc/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestDatanodeAdminMonitorBase.java

…evision 5

virajjasani

+1 (non-binding), pending QA

hadoop-yetus · 2021-12-23T11:15:28Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	1m 20s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 1s		codespell was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 4 new or modified test files.
			_ trunk Compile Tests _
+1 💚	mvninstall	42m 1s		trunk passed
+1 💚	compile	1m 40s		trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚	compile	1m 25s		trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚	checkstyle	1m 4s		trunk passed
+1 💚	mvnsite	1m 35s		trunk passed
+1 💚	javadoc	1m 12s		trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚	javadoc	1m 35s		trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚	spotbugs	3m 43s		trunk passed
+1 💚	shadedclient	26m 52s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	1m 26s		the patch passed
+1 💚	compile	1m 37s		the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚	javac	1m 37s		the patch passed
+1 💚	compile	1m 22s		the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚	javac	1m 22s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
+1 💚	checkstyle	0m 57s		hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 45 unchanged - 1 fixed = 45 total (was 46)
+1 💚	mvnsite	1m 33s		the patch passed
+1 💚	javadoc	0m 58s		the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚	javadoc	1m 29s		the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚	spotbugs	3m 46s		the patch passed
+1 💚	shadedclient	27m 18s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	350m 2s		hadoop-hdfs in the patch passed.
+1 💚	asflicense	1m 13s		The patch does not generate ASF License warnings.
		471m 19s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3675/5/artifact/out/Dockerfile
GITHUB PR	#3675
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname	Linux 11c131aeedfc 4.15.0-162-generic #170-Ubuntu SMP Mon Oct 18 11:38:05 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `b57cbef`
Default Java	Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3675/5/testReport/
Max. process+thread count	1803 (vs. ulimit of 5500)
modules	C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3675/5/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

KevinWikant · 2021-12-23T13:56:35Z

Thanks @aajisaka & @virajjasani for the PR approvals! Are we good to merge now?

aajisaka · 2021-12-23T13:59:31Z

Merged. Thank you @KevinWikant @virajjasani @sodonnel.

…pache#3675) Co-authored-by: Kevin Wikant <[email protected]> Signed-off-by: Akira Ajisaka <[email protected]>

HDFS-16303. Improve handling of datanode lost while decommissioning

b77f043

This comment has been minimized.

Sign in to view

aajisaka reviewed Nov 19, 2021

View reviewed changes

KevinWikant mentioned this pull request Dec 2, 2021

HDFS-16303. Improve handling of datanode lost while decommissioning #3746

Closed

KevinWikant commented Dec 3, 2021

View reviewed changes

hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java Outdated Show resolved Hide resolved

KevinWikant commented Dec 3, 2021

View reviewed changes

hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java Outdated Show resolved Hide resolved

KevinWikant commented Dec 3, 2021

View reviewed changes

hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java Outdated Show resolved Hide resolved

KevinWikant commented Dec 3, 2021

View reviewed changes

...src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminBackoffMonitor.java Show resolved Hide resolved

HDFS-16303. Improve handling of datanode lost while decommissioning r…

54ba1cb

…evision 2

This comment has been minimized.

Sign in to view

aajisaka requested changes Dec 16, 2021

View reviewed changes

aajisaka reviewed Dec 16, 2021

View reviewed changes

hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java Outdated Show resolved Hide resolved

HDFS-16303. Improve handling of datanode lost while decommissioning r…

43f1c01

…evision 3

This comment has been minimized.

Sign in to view

HDFS-16303. Improve handling of datanode lost while decommissioning r…

44cd931

…evision 4

This comment has been minimized.

Sign in to view

aajisaka approved these changes Dec 20, 2021

View reviewed changes

virajjasani reviewed Dec 22, 2021

View reviewed changes

HDFS-16303. Improve handling of datanode lost while decommissioning r…

b57cbef

…evision 5

virajjasani approved these changes Dec 23, 2021

View reviewed changes

aajisaka approved these changes Dec 23, 2021

View reviewed changes

aajisaka merged commit d20b598 into apache:trunk Dec 23, 2021

HDFS-16303. Improve handling of datanode lost while decommissioning #3675

HDFS-16303. Improve handling of datanode lost while decommissioning #3675

Conversation

KevinWikant commented Nov 17, 2021

Description of PR

How was this patch tested?

Unit Testing

Manual Testing

For code changes:

This comment has been minimized.

KevinWikant commented Nov 18, 2021 • edited Loading

aajisaka left a comment

Choose a reason for hiding this comment

aajisaka Nov 19, 2021

Choose a reason for hiding this comment

sodonnel commented Nov 19, 2021

KevinWikant commented Nov 22, 2021 • edited Loading

sodonnel commented Nov 22, 2021

KevinWikant commented Nov 22, 2021 • edited Loading

KevinWikant commented Nov 22, 2021

sodonnel commented Nov 22, 2021

KevinWikant commented Nov 26, 2021 • edited Loading

KevinWikant commented Nov 29, 2021

sodonnel commented Nov 29, 2021

virajjasani commented Nov 30, 2021

KevinWikant commented Nov 30, 2021 • edited Loading

KevinWikant commented Dec 2, 2021

KevinWikant commented Dec 3, 2021

KevinWikant commented Dec 3, 2021

aajisaka commented Dec 7, 2021

aajisaka commented Dec 8, 2021

This comment has been minimized.

aajisaka left a comment

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

KevinWikant commented Dec 17, 2021

aajisaka left a comment

Choose a reason for hiding this comment

virajjasani commented Dec 20, 2021 • edited Loading

virajjasani commented Dec 20, 2021

KevinWikant commented Dec 20, 2021 • edited Loading

virajjasani commented Dec 21, 2021

virajjasani left a comment

Choose a reason for hiding this comment

virajjasani left a comment

Choose a reason for hiding this comment

hadoop-yetus commented Dec 23, 2021

KevinWikant commented Dec 23, 2021

aajisaka commented Dec 23, 2021

KevinWikant commented Nov 18, 2021 •

edited

Loading

KevinWikant commented Nov 22, 2021 •

edited

Loading

KevinWikant commented Nov 22, 2021 •

edited

Loading

KevinWikant commented Nov 26, 2021 •

edited

Loading

KevinWikant commented Nov 30, 2021 •

edited

Loading

virajjasani commented Dec 20, 2021 •

edited

Loading

KevinWikant commented Dec 20, 2021 •

edited

Loading