Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDFS-16303. Improve handling of datanode lost while decommissioning #3675

Merged
merged 5 commits into from
Dec 23, 2021

Conversation

KevinWikant
Copy link
Contributor

Description of PR

Fixes a bug in Hadoop HDFS where if more than "dfs.namenode.decommission.max.concurrent.tracked.nodes" datanodes are lost while in state decommissioning, then all forward progress towards decommissioning any datanodes (including healthy datanodes) is blocked

JIRA: https://issues.apache.org/jira/browse/HDFS-16303

How was this patch tested?

Unit Testing

Added new unit tests:

  • TestDecommission.testRequeueUnhealthyDecommissioningNodes (& TestDecommissionWithBackoffMonitor.testRequeueUnhealthyDecommissioningNodes)
  • DatanodeAdminMonitorBase.testPendingNodesQueueOrdering
  • DatanodeAdminMonitorBase.testPendingNodesQueueReverseOrdering

All "TestDecommission", "TestDecommissionWithBackoffMonitor", & "DatanodeAdminMonitorBase" tests pass when run locally

Note that without the "DatanodeAdminManager" changes the new test "testRequeueUnhealthyDecommissioningNodes" fails because it times out waiting for the healthy nodes to be decommissioned

> mvn -Dtest=TestDecommission#testRequeueUnhealthyDecommissioningNodes test
...
[ERROR] Errors: 
[ERROR]   TestDecommission.testRequeueUnhealthyDecommissioningNodes:1776 » Timeout Timed...
> mvn -Dtest=TestDecommissionWithBackoffMonitor#testRequeueUnhealthyDecommissioningNodes test
...
[ERROR] Errors: 
[ERROR]   TestDecommissionWithBackoffMonitor>TestDecommission.testRequeueUnhealthyDecommissioningNodes:1776 » Timeout

Manual Testing

  • create Hadoop cluster with:
    • 30 datanodes initially
    • custom Namenode JAR containing this change
    • hdfs-site configuration "dfs.namenode.decommission.max.concurrent.tracked.node = 10"
> cat /etc/hadoop/conf/hdfs-site.xml | grep -A 1 'tracked'
    <name>dfs.namenode.decommission.max.concurrent.tracked.nodes</name>
    <value>10</value>
  • reproduce the bug: https://issues.apache.org/jira/browse/HDFS-16303
    • start decommissioning over 20 datanodes
    • terminate 20 datanodes while they are in state decommissioning
    • observe the Namenode logs to validate that there are 20 unhealthy datanodes stuck "in Decommission In Progress"
2021-11-15 17:57:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 10 nodes are currently queued waiting to be decommissioned.
2021-11-15 17:57:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 10 nodes which are dead while in Decommission In Progress.

2021-11-15 17:58:14,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 10 nodes are currently queued waiting to be decommissioned.
2021-11-15 17:58:14,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 10 nodes which are dead while in Decommission In Progress.

2021-11-15 17:58:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 10 nodes are currently queued waiting to be decommissioned.
2021-11-15 17:58:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 10 nodes which are dead while in Decommission In Progress.

2021-11-15 17:59:14,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 10 nodes are currently queued waiting to be decommissioned.
2021-11-15 17:59:14,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 10 nodes which are dead while in Decommission In Progress.
  • scale-up to 25 healthy datanodes & then decommission 22 of those datanodes (all but 3)
    • observe the Namenode logs to validate those 22 healthy datanodes are decommissioned (i.e. HDFS-16303 is solved)
2021-11-15 17:59:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 10 nodes are currently queued waiting to be decommissioned.
2021-11-15 17:59:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 10 nodes which are dead while in Decommission In Progress.

2021-11-15 18:00:14,487 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 42 nodes decommissioning but only 10 nodes will be tracked at a time. 32 nodes are currently queued waiting to be decommissioned.
2021-11-15 18:00:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 42 nodes decommissioning but only 10 nodes will be tracked at a time. 32 nodes are currently queued waiting to be decommissioned.
2021-11-15 18:01:14,486 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 32 nodes decommissioning but only 10 nodes will be tracked at a time. 32 nodes are currently queued waiting to be decommissioned.
2021-11-15 18:01:44,486 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 32 nodes decommissioning but only 10 nodes will be tracked at a time. 22 nodes are currently queued waiting to be decommissioned.
2021-11-15 18:02:14,486 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 22 nodes decommissioning but only 10 nodes will be tracked at a time. 22 nodes are currently queued waiting to be decommissioned.

2021-11-15 18:02:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 12 nodes are currently queued waiting to be decommissioned.
2021-11-15 18:02:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 8 nodes which are dead while in Decommission In Progress.

2021-11-15 18:03:14,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 10 nodes are currently queued waiting to be decommissioned.
2021-11-15 18:03:14,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 10 nodes which are dead while in Decommission In Progress.

For code changes:

  • [yes] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • [n/a] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • [n/a] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • [no] If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

@hadoop-yetus

This comment has been minimized.

@KevinWikant
Copy link
Contributor Author

KevinWikant commented Nov 18, 2021

Seems the unit tests failed on an unrelated error:

[ERROR] Failures: [ERROR] org.apache.hadoop.hdfs.TestRollingUpgrade.testRollback(org.apache.hadoop.hdfs.TestRollingUpgrade) [ERROR] Run 1: TestRollingUpgrade.testRollback:328->checkMxBeanIsNull:299 expected null, but was:<javax.management.openmbean.CompositeDataSupport(compositeType=javax.management.openmbean.CompositeType(name=org.apache.hadoop.hdfs.protocol.RollingUpgradeInfo$Bean,items=((itemName=blockPoolId,itemType=javax.management.openmbean.SimpleType(name=java.lang.String)),(itemName=createdRollbackImages,itemType=javax.management.openmbean.SimpleType(name=java.lang.Boolean)),(itemName=finalizeTime,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long)),(itemName=startTime,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long)))),contents={blockPoolId=BP-1039609189-172.17.0.2-1637204447583, createdRollbackImages=true, finalizeTime=0, startTime=1637204448659})> [ERROR] Run 2: TestRollingUpgrade.testRollback:328->checkMxBeanIsNull:299 expected null, but was:<javax.management.openmbean.CompositeDataSupport(compositeType=javax.management.openmbean.CompositeType(name=org.apache.hadoop.hdfs.protocol.RollingUpgradeInfo$Bean,items=((itemName=blockPoolId,itemType=javax.management.openmbean.SimpleType(name=java.lang.String)),(itemName=createdRollbackImages,itemType=javax.management.openmbean.SimpleType(name=java.lang.Boolean)),(itemName=finalizeTime,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long)),(itemName=startTime,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long)))),contents={blockPoolId=BP-1039609189-172.17.0.2-1637204447583, createdRollbackImages=true, finalizeTime=0, startTime=1637204448659})> [ERROR] Run 3: TestRollingUpgrade.testRollback:328->checkMxBeanIsNull:299 expected null, but was:<javax.management.openmbean.CompositeDataSupport(compositeType=javax.management.openmbean.CompositeType(name=org.apache.hadoop.hdfs.protocol.RollingUpgradeInfo$Bean,items=((itemName=blockPoolId,itemType=javax.management.openmbean.SimpleType(name=java.lang.String)),(itemName=createdRollbackImages,itemType=javax.management.openmbean.SimpleType(name=java.lang.Boolean)),(itemName=finalizeTime,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long)),(itemName=startTime,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long)))),contents={blockPoolId=BP-1039609189-172.17.0.2-1637204447583, createdRollbackImages=true, finalizeTime=0, startTime=1637204448659})> [INFO]

Added "TestRollingUpgrade.testRollback" as a potentially flaky test here: https://issues.apache.org/jira/browse/HDFS-15646

Copy link
Member

@aajisaka aajisaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @KevinWikant for your patch. The change mostly looks good to me.

@sodonnel @jojochuang I suppose you have more insights about DN decommission. Would you check this PR?

protected BlockManager blockManager;
protected Namesystem namesystem;
protected DatanodeAdminManager dnAdmin;
protected Configuration conf;

protected final Queue<DatanodeDescriptor> pendingNodes = new ArrayDeque<>();
private final PriorityQueue<DatanodeDescriptor> pendingNodes = new PriorityQueue<>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change increases the time complexity of adding an element from O(1) to O(logN), however, the N is bounded to the number of DNs decommissioning (i.e. about several hundreds even in the large cluster). Therefore I think the performance effect is not critical.

@sodonnel
Copy link
Contributor

This is quite a big change. I have a couple of thoughts.

If a node goes dead while decommissioning, would it not be better to just remove it from the decommission monitor rather than keep tracking it there at all? If the node comes alive again, it should be entered back into the monitor.

We could either detect it is dead in the monitor and remove it from tracking then, or have the place that logs the mentioned "node is dead while decommission in progress" remove it from the monitor.

The DatanodeAdminBackoffMonitor is probably rarely used, if it is used at all, but it does not have a tracking limit I think at the moment. Perhaps it should have, it it was designed to run with less overhead than the default monitor, but perhaps if you decommissioned 100's of nodes at a time it would struggle, I am not sure.

@KevinWikant
Copy link
Contributor Author

KevinWikant commented Nov 22, 2021

If a node goes dead while decommissioning, would it not be better to just remove it from the decommission monitor rather than keep tracking it there at all? If the node comes alive again, it should be entered back into the monitor.
We could either detect it is dead in the monitor and remove it from tracking then, or have the place that logs the mentioned "node is dead while decommission in progress" remove it from the monitor.

I agree with this statement & this is essentially the behavior this code change provides with 1 caveat "If the node comes alive again OR if there are no (potentially healthy) nodes in the pendingNodes queue, it will be entered back into the monitor"

Consider the 2 alternatives:
A) Do not track an unhealthy node until it becomes healthy
B) Do not track an unhealthy node until it becomes healthy OR until there are fewer healthy nodes being decommissioned than "dfs.namenode.decommission.max.concurrent.tracked.nodes"

Note that both alternatives do not prevent healthy nodes from being decommissioned. With Option B) if there are more nodes being decommissioned than can be tracked, any unhealthy nodes being tracked will be removed from the tracked set & re-queued (with a priority lower than all healthy nodes in the priority queue); then the healthy nodes will be de-queued & moved to the tracked set, once the healthy nodes are decommissioned the unhealthy nodes can be tracked again

It may seem Option A) is more performant as it avoids tracking unhealthy nodes each DatanodeAdminMonitor cycle, but this may not be the case as we will still need to be checking the health status of all the unhealthy nodes in the pendingNodes queue to determine if they should be tracked again. Furthermore, tracking unhealthy nodes is the current behavior & as far as I know it has not caused any performance problems.

With Option B) unhealthy nodes are only tracked (and there health status is only checked) when there are fewer healthy nodes than "dfs.namenode.decommission.max.concurrent.tracked.nodes". This may result in superior performance for Option B) as it will only need to check the health of up to "dfs.namenode.decommission.max.concurrent.tracked.nodes" unhealthy nodes, rather than having to check the health of all nodes in the pendingNodes queue.

Note that the log "node is dead while decommission in progress", occurs from within the DatanodeAdminMonitor:

@sodonnel
Copy link
Contributor

In DatanodeManager.registerDatanode(), it has logic to add a node into the decommission workflow if the node has a DECOMMISSIONED status in the hosts / combined hosts file, as it calls startAdminOperationIfNecessary which looks like:

void startAdminOperationIfNecessary(DatanodeDescriptor nodeReg) {
    long maintenanceExpireTimeInMS =
        hostConfigManager.getMaintenanceExpirationTimeInMS(nodeReg);
    // If the registered node is in exclude list, then decommission it
    if (getHostConfigManager().isExcluded(nodeReg)) {
      datanodeAdminManager.startDecommission(nodeReg);
    } else if (nodeReg.maintenanceNotExpired(maintenanceExpireTimeInMS)) {
      datanodeAdminManager.startMaintenance(nodeReg, maintenanceExpireTimeInMS);
    }
  }

If a DN goes dead and then re-registers, it will be added back into the pending nodes, so I don't think we need to continue to track it in the decommission monitor when it goes dead. We can just handle the dead event in the decommission monitor and stop tracking it, clearing a slot for another node. Then it will be re-added if it comes back by existing logic above.

@KevinWikant
Copy link
Contributor Author

KevinWikant commented Nov 22, 2021

The DatanodeAdminBackoffMonitor is probably rarely used, if it is used at all, but it does not have a tracking limit I think at the moment. Perhaps it should have, it it was designed to run with less overhead than the default monitor, but perhaps if you decommissioned 100's of nodes at a time it would struggle, I am not sure.

Based on unit testing & code inspection, I think the "dfs.namenode.decommission.max.concurrent.tracked.nodes" still applies to DatanodeAdminBackoffMonitor

From the code, DatanodeAdminBackoffMonitor:

So:

The new unit test "TestDecommissionWithBackoffMonitor.testRequeueUnhealthyDecommissioningNodes" will fail without the changes made to "TestDecommissionWithBackoffMonitor". It fails because the unhealthy nodes have filled up the tracked set (i.e. outOfServiceBlocks) & the healthy nodes are stuck in the pendingNodes queue

@KevinWikant
Copy link
Contributor Author

If a DN goes dead and then re-registers, it will be added back into the pending nodes, so I don't think we need to continue to track it in the decommission monitor when it goes dead. We can just handle the dead event in the decommission monitor and stop tracking it, clearing a slot for another node. Then it will be re-added if it comes back by existing logic above.

Thanks for sharing this! I will make this change, test it, & get back to you

@sodonnel
Copy link
Contributor

Ah yes, the BackoffMonitor gets calls the method below which is indeed limits by max Tracked Nodes:

private void processPendingNodes() {
    while (!pendingNodes.isEmpty() &&
        (maxConcurrentTrackedNodes == 0 ||
            outOfServiceNodeBlocks.size() < maxConcurrentTrackedNodes)) {
      outOfServiceNodeBlocks.put(pendingNodes.poll(), null);
    }
  }

@KevinWikant
Copy link
Contributor Author

KevinWikant commented Nov 26, 2021

If a DN goes dead and then re-registers, it will be added back into the pending nodes, so I don't think we need to continue to track it in the decommission monitor when it goes dead. We can just handle the dead event in the decommission monitor and stop tracking it, clearing a slot for another node. Then it will be re-added if it comes back by existing logic above.

@sodonnel After going through the re-implementation of the logic, I think I found a caveat worth mentioning

Note the implementation of "isNodeHealthyForDecommissionOrMaintenance":

Here's the happy case:

  • a dead datanode is removed from DatanodeAdminManager (i.e. removed from both outOfServiceNodeBlocks & pendingNodes) with AdminState=DECOMMISSION_INPROGRESS
  • a dead datanode comes back alive, gets re-added to DatanodeAdminManager, & transitions to AdminState=DECOMMISSIONED

Here's a less happy case:

  • a dead datanode is removed from DatanodeAdminManager with AdminState=DECOMMISSION_INPROGRESS
  • the dead datanode never comes back alive & therefore remains with AdminState=DECOMMISSION_INPROGRESS forever

The problem is that dead datanodes can be eventually DECOMMISSIONED if the namenode eliminates all lowRedundancy blocks (because of "isNodeHealthyForDecommissionOrMaintenance" logic), but by removing the dead datanode from the DatanodeAdminManager entirely (until it comes back alive) we prevent the dead datanode from transitioning to DECOMMISSIONED even though it could have

So the impact is that a dead datanode which never comes back alive again will never transition from DECOMMISSION_INPROGRESS to DECOMMISSIONED even though it could have

@KevinWikant
Copy link
Contributor Author

@sodonnel let me know your thoughts, but I think the problem with removing a dead node from the DatanodeAdminManager until it comes back alive again is that it will never be decommissioned if it never comes alive again

Do you see any major downside in keeping the dead nodes in the pendingNodes priority queue behind all the alive nodes? Because of the priority queue ordering the dead nodes will not block decommissioning of alive nodes

@sodonnel
Copy link
Contributor

For me, DECOMMISSION_IN_PROGRESS + DEAD is an error state that means decommission has effectively failed. There is a case where it can complete, but what does that really mean - if the node is dead, it has not been gracefully stopped. If it wasn't for the way decommission is triggered using the hosts files, I would suggest switching it back to IN_SERVICE + DEAD, and let it be treated like any other dead host.

If you have some monitoring tool tracking the decommission, and it sees "DECOMMISSIONED", then it assumes the decommission went fine.

If if sees DECOMMISSION_IN_PROGRESS + DEAD, then its a flag that the admin needs to go look into it, as it should not have happened - perhaps they need to bring the node back, or conclude that the cluster is still OK without it (no missing blocks) and add it to the exclude list and forget about it.

My feeling is that the priority queue idea adds some more complexity to an already hard to follow process / code area and I wonder if it is better to just remove the node from the monitor and let it be dealt with manually, which may be required a lot of the time anyway?

@virajjasani
Copy link
Contributor

the priority queue idea adds some more complexity to an already hard to follow process / code area and I wonder if it is better to just remove the node from the monitor and let it be dealt with manually, which may be required a lot of the time anyway?

Having faced the similar situation of DECOMMISSION_IN_PROGRESS + DEAD state requiring manual intervention, I agree with approach of removing the node and also the fact that it's already too complex process to follow so introducing new priority queue would complicate it even further.

@KevinWikant
Copy link
Contributor Author

KevinWikant commented Nov 30, 2021

DECOMMISSION_IN_PROGRESS + DEAD is an error state that means decommission has effectively failed. There is a case where it can complete, but what does that really mean - if the node is dead, it has not been gracefully stopped.

The case which I have described where dead node decommissioning completes can occur when:

  • a decommissioning node goes dead, but all of its blocks still have block replicas on other live nodes
  • the namenode is eventually able to satisfy the minimum replication of all blocks (by replicating the under-replicated blocks from the live nodes)
  • the dead decommissioning node is transitioned to decommissioned

In this case, the node did go dead while decommissioning, but there was no LowRedundancy blocks thanks to redundant block replicas. From the user perspective, the loss of the decommissioning node did not impact the outcome of the decommissioning process. Had the node not gone dead while decommissioning, the eventual outcome is the same in that the node is decommissioned & there is no LowRedundancy blocks.

If there is LowRedundancy blocks then a dead datanode will remain decommissioning, because if the dead node were to come alive again then it may be able to recover the LowRedundancy blocks. But if there is no LowRedundancy blocks then the when the node comes alive again it will be immediately transition to decommissioned anyway, so why not make it decommissioned while its still dead?

Also, I don't think the priority queue is adding much complexity, it's just putting healthy nodes (with more recent heartbeat times) ahead of unhealthy nodes (with older heartbeat times); such that healthy nodes are decommissioned first


I also want to call out another caveat with the approach of removing the node from the DatanodeAdminManager which I uncovered while unit testing

If we leave the node in DECOMMISSION_IN_PROGRESS & remove the node from DatanodeAdminManager, then the following callstack should re-add the datanode to the DatanodeAdminManager when it comes alive again:

The problem is this condition "!node.isDecommissionInProgress()":

Because the dead datanode is left in DECOMMISSION_INPROGRESS, "startTrackingNode" is not invoked because of the "!node.isDecommissionInProgress()" condition

Simply removing the condition "!node.isDecommissionInProgress()" will not function well because "startTrackingNode" is not idempotent:

I can think of 2 obvious ways to handle this:
A) make calls to "startTrackingNode" idempotent (meaning that if the DatanodeDescriptor is already tracked, it does not get added to the DatanodeAdminManager)
B) modify startDecommission such that its aware of if the invocation is for a datanode which was just restarted after being dead such that it can still invoke "startTrackingNode" even though "node.isDecommissionInProgress()"
C) add a flag to DatanodeDescriptor which indicates if the datanode is being tracked within DatanodeAdminManager, then check this flag in startDecommission

For "A)", the challenge is that we need to ensure the DatanodeDescriptor is not in "pendingReplication" or "outOfServiceBlocks" which could be a fairly costly call to execute repeatedly. Also, I am not even sure such a check is thread-safe given there is no locking used as part of "startDecommission" or "startTrackingNode"

For "B)", the awareness of if a registerDatanode call is related to a restarted datanode is available here. So this information would need to be passed down the callstack to a method "startDecommission(DatanodeDescriptor node, boolean isNodeRestart)". Because of the modified method signature, all the other invocations of "startDecommission" will need to specify isNodeRestart=false

Given this additional hurdle in the approach of removing a dead datanode from the DatanodeAdminManager, are we sure it will be less complex than the proposed changed?


In short:

  • I don't think there is any downside in moving a dead datanode to decommissioned when there are no LowRedundancy blocks because this would happen immediately anyway were the node to come back alive (and get re-added to DatanodeAdminManager)
  • the approach of removing a dead datanode from the DatanodeAdminManager will not work properly without some significant refactoring of the "startDecommission" method & related code

@sodonnel @virajjasani @aajisaka let me know your thoughts, I am still more in favor of tracking dead datanodes in DatanodeAdminManager (when there are LowRedundancy blocks), but if the community thinks its better to remove the dead datanodes from DatanodeAdminManager I can implement proposal "C)"

@KevinWikant
Copy link
Contributor Author

I have implemented the alternative where dead DECOMMISSION_INPROGRESS nodes are removed from the DatanodeAdminManager immediately: #3746

I created a separate PR so that the 2 alternatives can be compared, the diff for each alternative is quite different

@sodonnel @virajjasani @aajisaka let me know your thoughts

@KevinWikant
Copy link
Contributor Author

@sodonnel The existing test "TestDecommissioningStatus.testDecommissionStatusAfterDNRestart" will be problematic for the proposed alternative of removing a dead DECOMMISSION_INPROGRESS node from the DatanodeAdminManager: #3746

As previously stated, removing the dead DECOMMISSION_INPROGRESS node from the DatanodeAdminManager means that when there are no LowRedundancy blocks the dead node will remain in DECOMMISSION_INPROGRESS rather than transitioning to DECOMMISSIONED

This violates the expectation the the unit test is enforcing which is that a dead DECOMMISSION_INPROGRESS node should transition to DECOMMISSIONED when there are no LowRedundancy blocks

"Delete the under-replicated file, which should let the DECOMMISSION_IN_PROGRESS node become DECOMMISSIONED"

Therefore, I think this is a good argument to remain more in favor of the original proposed change

@KevinWikant
Copy link
Contributor Author

I would also add, that if you look at the implementation of the proposed alternative of removing a dead DECOMMISSION_INPROGRESS node from the DatanodeAdminManager: https://github.com/apache/hadoop/pull/3746/files

It is not any less complex than this change, due to aforementioned caveats that need to be dealt with

@aajisaka
Copy link
Member

aajisaka commented Dec 7, 2021

Thank you @KevinWikant @sodonnel @virajjasani for your discussion, and thanks @KevinWikant again for the deep dive.

I'm in favor of the approach in this PR.

This violates the expectation the the unit test is enforcing which is that a dead DECOMMISSION_INPROGRESS node should transition to DECOMMISSIONED when there are no LowRedundancy blocks

The expectation is added by HDFS-7374. This JIRA allowed decommissioning of dead DataNodes, and the feature has been enabled for a long time since Hadoop 2.7.0. I don't want to break the expectation because breaking it surprises the administrators.

@aajisaka
Copy link
Member

aajisaka commented Dec 8, 2021

ping @sodonnel let me know your thoughts.

@hadoop-yetus

This comment has been minimized.

Copy link
Member

@aajisaka aajisaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @KevinWikant for your update. I reviewed your patch again and added some comments.

@hadoop-yetus

This comment has been minimized.

@hadoop-yetus

This comment has been minimized.

@KevinWikant
Copy link
Contributor Author

Unit testing failed due to unrelated flaky tests

[ERROR] Errors:
[ERROR] org.apache.hadoop.hdfs.TestRollingUpgrade.testRollback(org.apache.hadoop.hdfs.TestRollingUpgrade)
[ERROR] Run 1: TestRollingUpgrade.testRollback:329->waitForNullMxBean:361 » Timeout Timed out...
[ERROR] Run 2: TestRollingUpgrade.testRollback:329->waitForNullMxBean:361 » Timeout Timed out...
[ERROR] Run 3: TestRollingUpgrade.testRollback:329->waitForNullMxBean:361 » Timeout Timed out...
[INFO]
[WARNING] Flakes:
[WARNING] org.apache.hadoop.hdfs.TestRollingUpgrade.testCheckpoint(org.apache.hadoop.hdfs.TestRollingUpgrade)
[ERROR] Run 1: TestRollingUpgrade.testCheckpoint:599->testCheckpoint:686 Test resulted in an unexpected exit
[INFO] Run 2: PASS

Copy link
Member

@aajisaka aajisaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I'll commit tomorrow if there is no objection.

@virajjasani
Copy link
Contributor

virajjasani commented Dec 20, 2021

Sorry, I could not get to this PR last week. I will review later this week but I don't mean to block this work. If I find something odd or something as an improvement over this, we can anyways get it clarified later on the PR/Jira or create addendum PR later.
Thanks for your work @KevinWikant, this might be really helpful going forward.

With a quick glance, just one question for now: Overall it seems the goal is to improve and continue the decommissioning of healthy nodes over unhealthy ones (by removing and then re-queueing the entries), hence if few nodes are really in bad state (hardware/network issues), the plan is to keep re-queueing them until more nodes are getting decommissioned than max tracked nodes right? Since unhealthy node getting decommissioned might anyways require some sort of retry, shall we requeue them even if the condition is not met (i.e. total no of decomm in progress < max tracked nodes) as a limited retries? I am just thinking at high level, yet to catch up with the PR.

Also, good to know HDFS-7374 is not broken.

@virajjasani
Copy link
Contributor

Unit testing failed due to unrelated flaky tests

[ERROR] Errors:
[ERROR] org.apache.hadoop.hdfs.TestRollingUpgrade.testRollback(org.apache.hadoop.hdfs.TestRollingUpgrade)
[ERROR] Run 1: TestRollingUpgrade.testRollback:329->waitForNullMxBean:361 » Timeout Timed out...
[ERROR] Run 2: TestRollingUpgrade.testRollback:329->waitForNullMxBean:361 » Timeout Timed out...
[ERROR] Run 3: TestRollingUpgrade.testRollback:329->waitForNullMxBean:361 » Timeout Timed out...
[INFO]
[WARNING] Flakes:
[WARNING] org.apache.hadoop.hdfs.TestRollingUpgrade.testCheckpoint(org.apache.hadoop.hdfs.TestRollingUpgrade)
[ERROR] Run 1: TestRollingUpgrade.testCheckpoint:599->testCheckpoint:686 Test resulted in an unexpected exit
[INFO] Run 2: PASS

Yeah this test failure is not relevant. Even after the recent attempt, it is still flaky, we might require better insights for this test.

@KevinWikant
Copy link
Contributor Author

KevinWikant commented Dec 20, 2021

@virajjasani, please see my response to your comments below

hence if few nodes are really in bad state (hardware/network issues), the plan is to keep re-queueing them until more nodes are getting decommissioned than max tracked nodes right?

It's the opposite, the unhealthy nodes will only be re-queued when there are more nodes being decommissioned than max tracked nodes. Otherwise, if there are fewer nodes being decommissioned than max tracked nodes, then the unhealthy nodes will not be re-queued because they do not risk blocking the decommissioning of queued healthy nodes (i.e. because the queue is empty).

One potential performance impact that comes to mind is that if there are say 200 unhealthy decommissioning nodes & max tracked nodes = 100, then this may cause some churn in the queueing/de-queueing process because each DatanodeAdminMonitor tick all 100 tracked nodes will be re-queued & then 100 queued nodes will be de-queued/tracked. Note that this churn (and any associated performance impact) will only take effect when:

  • there are more nodes being decommissioned than max tracked nodes
  • AND either:
    • number of healthy decommissioning nodes < max tracked nodes
    • number of unhealthy decommissioning nodes > max tracked nodes

The amount of re-queued/de-queued nodes per tick can be quantified as:

numRequeue = numDecommissioning <= numTracked ? 0 : numDeadDecommissioning - (numDecommissioning - numTracked)

This churn of queueing/de-queueing will not occur at all under typical decommissioning scenarios (i.e. where there isn't a large number of dead decommissioning nodes).

One idea to mitigate this is to have DatanodeAdminMonitor maintain counters used to track the number of healthy nodes in the pendingNodes queue; then this count can be used to make an improved re-queue decision. In particular, unhealthy nodes are only re-queued if there are healthy nodes in the pendingNodes queue. But this approach has some flaws, for example an unhealthy node in the queue could come alive again, but then an unhealthy node in the tracked set wouldn't be re-queued because the healthy queued node count hasn't been updated. To solve this, we would need to scan the pendingNodes queue to update the healthy/unhealthy node counts periodically, this scan could prove expensive.

Since unhealthy node getting decommissioned might anyways require some sort of retry, shall we requeue them even if the condition is not met (i.e. total no of decomm in progress < max tracked nodes) as a limited retries?

If there are fewer nodes being decommissioned than max tracked nodes, then there are no nodes in the pendingNodes queue & all nodes are being tracked for decommissioning. Therefore, there is no possibility that any healthy nodes are blocked in the pendingNodes queue (preventing them from being decommissioned) & so in my opinion there is no benefit to re-queueing the unhealthy nodes in this case. Furthermore, this will negatively impact performance through frequent re-queueing & de-queueing.

@virajjasani
Copy link
Contributor

If there are fewer nodes being decommissioned than max tracked nodes, then there are no nodes in the pendingNodes queue & all nodes are being tracked for decommissioning. Therefore, there is no possibility that any healthy nodes are blocked in the pendingNodes queue

Yes makes sense. Thanks

Copy link
Contributor

@virajjasani virajjasani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left few minor comments, looks good overall. Thanks @KevinWikant

Copy link
Contributor

@virajjasani virajjasani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 (non-binding), pending QA

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 1m 20s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 4 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 42m 1s trunk passed
+1 💚 compile 1m 40s trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 compile 1m 25s trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 checkstyle 1m 4s trunk passed
+1 💚 mvnsite 1m 35s trunk passed
+1 💚 javadoc 1m 12s trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 1m 35s trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 spotbugs 3m 43s trunk passed
+1 💚 shadedclient 26m 52s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 26s the patch passed
+1 💚 compile 1m 37s the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 javac 1m 37s the patch passed
+1 💚 compile 1m 22s the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 javac 1m 22s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 57s hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 45 unchanged - 1 fixed = 45 total (was 46)
+1 💚 mvnsite 1m 33s the patch passed
+1 💚 javadoc 0m 58s the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 1m 29s the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 spotbugs 3m 46s the patch passed
+1 💚 shadedclient 27m 18s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 350m 2s hadoop-hdfs in the patch passed.
+1 💚 asflicense 1m 13s The patch does not generate ASF License warnings.
471m 19s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3675/5/artifact/out/Dockerfile
GITHUB PR #3675
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname Linux 11c131aeedfc 4.15.0-162-generic #170-Ubuntu SMP Mon Oct 18 11:38:05 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / b57cbef
Default Java Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3675/5/testReport/
Max. process+thread count 1803 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3675/5/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

@KevinWikant
Copy link
Contributor Author

Thanks @aajisaka & @virajjasani for the PR approvals! Are we good to merge now?

@aajisaka aajisaka merged commit d20b598 into apache:trunk Dec 23, 2021
@aajisaka
Copy link
Member

Merged. Thank you @KevinWikant @virajjasani @sodonnel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants