HDFS-16303. Improve handling of datanode lost while decommissioning #3746

KevinWikant · 2021-12-02T21:40:33Z

Description of PR

Fixes a bug in Hadoop HDFS where if more than "dfs.namenode.decommission.max.concurrent.tracked.nodes" datanodes are lost while in state decommissioning, then all forward progress towards decommissioning any datanodes (including healthy datanodes) is blocked

JIRA: https://issues.apache.org/jira/browse/HDFS-16303

Additional Details

To solve this HDFS bug, there are 2 different proposals:

the original proposal in this PR which continues to track a dead DECOMMISSION_INPROGRESS node in the DatanodeAdminManager until there are no LowRedundancy blocks on HDFS
this new PR which removes a dead DECOMMISSION_INPROGRESS node from the DatanodeAdminManager immediately leaving the node in DECOMMISSION_INPROGRESS regardless of whether or not there are any LowRedundancy blocks on HDFS

These 2 different implementations will largely behave the same from a user perspective. There is however 1 key difference:

in the original PR when there a no LowRedundancy blocks a dead DECOMMISSION_INPROGRESS node will be transitioned to DECOMMISSIONED.
in the new PR a dead DECOMMISSION_INPROGRESS node will remain in DECOMMISSION_INPROGRESS indefinitely regardless of if there are no LowRedundancy blocks. That is, unless the node either comes alive again or the user removes it from the exclude file.

How was this patch tested?

3 new unit tests added to both "TestDecommission" & "TestDecommissionWithBackoffMonitor":

testDeadNodesRemainDecommissionInProgress
testRevivedDeadNodeIsDecommissioned
testDeadNodeWithDecommissionInProgressRemovedFromExcludeFile

For code changes:

[yes] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
[n/a] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
[n/a] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
[no] If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

KevinWikant · 2021-12-02T21:41:15Z

...p-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManagerTestUtil.java

@@ -25,9 +25,11 @@
 import java.util.Set;
 import java.util.concurrent.ExecutionException;

+import org.apache.hadoop.fs.Path;


will remove the unused imports added to this class in the next revision

hadoop-yetus · 2021-12-03T04:49:26Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 57s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 1s		codespell was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 3 new or modified test files.
			_ trunk Compile Tests _
+1 💚	mvninstall	39m 29s		trunk passed
+1 💚	compile	1m 45s		trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚	compile	1m 35s		trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚	checkstyle	1m 12s		trunk passed
+1 💚	mvnsite	1m 43s		trunk passed
+1 💚	javadoc	1m 20s		trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚	javadoc	1m 56s		trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚	spotbugs	4m 15s		trunk passed
+1 💚	shadedclient	30m 27s		branch has no errors when building and testing our client artifacts.
-0 ⚠️	patch	30m 51s		Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
			_ Patch Compile Tests _
+1 💚	mvninstall	1m 41s		the patch passed
+1 💚	compile	1m 38s		the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚	javac	1m 38s		the patch passed
+1 💚	compile	1m 28s		the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚	javac	1m 28s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	1m 6s	/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt	hadoop-hdfs-project/hadoop-hdfs: The patch generated 7 new + 159 unchanged - 0 fixed = 166 total (was 159)
+1 💚	mvnsite	1m 39s		the patch passed
+1 💚	javadoc	0m 56s		the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚	javadoc	1m 31s		the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚	spotbugs	3m 52s		the patch passed
+1 💚	shadedclient	24m 6s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
-1 ❌	unit	236m 49s	/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt	hadoop-hdfs in the patch passed.
+1 💚	asflicense	0m 48s		The patch does not generate ASF License warnings.
		357m 4s

Reason	Tests
Failed junit tests	hadoop.hdfs.server.namenode.TestDecommissioningStatus
	hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier
	hadoop.hdfs.server.namenode.TestDecommissioningStatusWithBackoffMonitor

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3746/1/artifact/out/Dockerfile
GITHUB PR	#3746
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname	Linux 443350b2f5e6 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `8d48eea`
Default Java	Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3746/1/testReport/
Max. process+thread count	3184 (vs. ulimit of 5500)
modules	C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3746/1/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

KevinWikant · 2021-12-03T16:14:54Z

@sodonnel The existing test "TestDecommissioningStatus.testDecommissionStatusAfterDNRestart" will be problematic for this change

As previously stated, removing the dead DECOMMISSION_INPROGRESS node from the DatanodeAdminManager means that when there are no LowRedundancy blocks the dead node will remain in DECOMMISSION_INPROGRESS rather than transitioning to DECOMMISSIONED

This violates the expectation the the unit test is enforcing which is that a dead DECOMMISSION_INPROGRESS node should transition to DECOMMISSIONED when there are no LowRedundancy blocks

"Delete the under-replicated file, which should let the DECOMMISSION_IN_PROGRESS node become DECOMMISSIONED"

hadoop/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestDecommissioningStatus.java

Line 451 in 6342d5e

assertTrue("the node should be decommissioned",

Therefore, I think this is a good argument to remain more in favor of the original proposed change: #3675

KevinWikant · 2021-12-23T13:58:33Z

Closing this PR in favor of this alternate solution: #3675

HDFS-16303. Improve handling of datanode lost while decommissioning

e2c252c

KevinWikant commented Dec 2, 2021

View reviewed changes

KevinWikant mentioned this pull request Dec 2, 2021

HDFS-16303. Improve handling of datanode lost while decommissioning #3675

Merged

Merge branch 'trunk' into feature/hdfs-16303-v2

8d48eea

KevinWikant closed this Dec 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDFS-16303. Improve handling of datanode lost while decommissioning #3746

HDFS-16303. Improve handling of datanode lost while decommissioning #3746

KevinWikant commented Dec 2, 2021 •

edited

Loading

KevinWikant Dec 2, 2021

hadoop-yetus commented Dec 3, 2021

KevinWikant commented Dec 3, 2021

KevinWikant commented Dec 23, 2021

HDFS-16303. Improve handling of datanode lost while decommissioning #3746

HDFS-16303. Improve handling of datanode lost while decommissioning #3746

Conversation

KevinWikant commented Dec 2, 2021 • edited Loading

Description of PR

Additional Details

How was this patch tested?

For code changes:

KevinWikant Dec 2, 2021

Choose a reason for hiding this comment

hadoop-yetus commented Dec 3, 2021

KevinWikant commented Dec 3, 2021

KevinWikant commented Dec 23, 2021

KevinWikant commented Dec 2, 2021 •

edited

Loading