Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDFS-16303. Improve handling of datanode lost while decommissioning #3746

Closed
wants to merge 2 commits into from

Conversation

KevinWikant
Copy link
Contributor

@KevinWikant KevinWikant commented Dec 2, 2021

Description of PR

Fixes a bug in Hadoop HDFS where if more than "dfs.namenode.decommission.max.concurrent.tracked.nodes" datanodes are lost while in state decommissioning, then all forward progress towards decommissioning any datanodes (including healthy datanodes) is blocked

JIRA: https://issues.apache.org/jira/browse/HDFS-16303

Additional Details

To solve this HDFS bug, there are 2 different proposals:

  • the original proposal in this PR which continues to track a dead DECOMMISSION_INPROGRESS node in the DatanodeAdminManager until there are no LowRedundancy blocks on HDFS
  • this new PR which removes a dead DECOMMISSION_INPROGRESS node from the DatanodeAdminManager immediately leaving the node in DECOMMISSION_INPROGRESS regardless of whether or not there are any LowRedundancy blocks on HDFS

These 2 different implementations will largely behave the same from a user perspective. There is however 1 key difference:

  • in the original PR when there a no LowRedundancy blocks a dead DECOMMISSION_INPROGRESS node will be transitioned to DECOMMISSIONED.
  • in the new PR a dead DECOMMISSION_INPROGRESS node will remain in DECOMMISSION_INPROGRESS indefinitely regardless of if there are no LowRedundancy blocks. That is, unless the node either comes alive again or the user removes it from the exclude file.

How was this patch tested?

3 new unit tests added to both "TestDecommission" & "TestDecommissionWithBackoffMonitor":

  • testDeadNodesRemainDecommissionInProgress
  • testRevivedDeadNodeIsDecommissioned
  • testDeadNodeWithDecommissionInProgressRemovedFromExcludeFile

For code changes:

  • [yes] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • [n/a] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • [n/a] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • [no] If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

@@ -25,9 +25,11 @@
import java.util.Set;
import java.util.concurrent.ExecutionException;

import org.apache.hadoop.fs.Path;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove the unused imports added to this class in the next revision

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 57s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 3 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 39m 29s trunk passed
+1 💚 compile 1m 45s trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 compile 1m 35s trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 checkstyle 1m 12s trunk passed
+1 💚 mvnsite 1m 43s trunk passed
+1 💚 javadoc 1m 20s trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 1m 56s trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 spotbugs 4m 15s trunk passed
+1 💚 shadedclient 30m 27s branch has no errors when building and testing our client artifacts.
-0 ⚠️ patch 30m 51s Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 41s the patch passed
+1 💚 compile 1m 38s the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 javac 1m 38s the patch passed
+1 💚 compile 1m 28s the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 javac 1m 28s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 1m 6s /results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt hadoop-hdfs-project/hadoop-hdfs: The patch generated 7 new + 159 unchanged - 0 fixed = 166 total (was 159)
+1 💚 mvnsite 1m 39s the patch passed
+1 💚 javadoc 0m 56s the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 1m 31s the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 spotbugs 3m 52s the patch passed
+1 💚 shadedclient 24m 6s patch has no errors when building and testing our client artifacts.
_ Other Tests _
-1 ❌ unit 236m 49s /patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt hadoop-hdfs in the patch passed.
+1 💚 asflicense 0m 48s The patch does not generate ASF License warnings.
357m 4s
Reason Tests
Failed junit tests hadoop.hdfs.server.namenode.TestDecommissioningStatus
hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier
hadoop.hdfs.server.namenode.TestDecommissioningStatusWithBackoffMonitor
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3746/1/artifact/out/Dockerfile
GITHUB PR #3746
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname Linux 443350b2f5e6 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 8d48eea
Default Java Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3746/1/testReport/
Max. process+thread count 3184 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3746/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

@KevinWikant
Copy link
Contributor Author

@sodonnel The existing test "TestDecommissioningStatus.testDecommissionStatusAfterDNRestart" will be problematic for this change

As previously stated, removing the dead DECOMMISSION_INPROGRESS node from the DatanodeAdminManager means that when there are no LowRedundancy blocks the dead node will remain in DECOMMISSION_INPROGRESS rather than transitioning to DECOMMISSIONED

This violates the expectation the the unit test is enforcing which is that a dead DECOMMISSION_INPROGRESS node should transition to DECOMMISSIONED when there are no LowRedundancy blocks

"Delete the under-replicated file, which should let the DECOMMISSION_IN_PROGRESS node become DECOMMISSIONED"

Therefore, I think this is a good argument to remain more in favor of the original proposed change: #3675

@KevinWikant
Copy link
Contributor Author

Closing this PR in favor of this alternate solution: #3675

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants