Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDFS-16513. [SBN read] Observer Namenode should not trigger the edits rolling of active Namenode #4087

Open
wants to merge 2 commits into
base: trunk
Choose a base branch
from

Conversation

tomscut
Copy link
Contributor

@tomscut tomscut commented Mar 21, 2022

JIRA: HDFS-16513.

To avoid frequent edtis rolling, we should disable OBN from triggering the edits rolling of active Namenode.

It is sufficient to retain only the triggering of SNN and the auto rolling of ANN.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 51s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 35m 29s trunk passed
+1 💚 compile 1m 31s trunk passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚 compile 1m 26s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 1m 1s trunk passed
+1 💚 mvnsite 1m 32s trunk passed
+1 💚 javadoc 1m 9s trunk passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 1m 34s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 3m 16s trunk passed
+1 💚 shadedclient 23m 44s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 22s the patch passed
+1 💚 compile 1m 25s the patch passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚 javac 1m 25s the patch passed
+1 💚 compile 1m 15s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 javac 1m 15s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 52s the patch passed
+1 💚 mvnsite 1m 23s the patch passed
+1 💚 javadoc 0m 55s the patch passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 1m 21s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 3m 29s the patch passed
+1 💚 shadedclient 24m 1s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 237m 34s hadoop-hdfs in the patch passed.
+1 💚 asflicense 0m 46s The patch does not generate ASF License warnings.
343m 30s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4087/1/artifact/out/Dockerfile
GITHUB PR #4087
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname Linux 956ddb79f286 4.15.0-169-generic #177-Ubuntu SMP Thu Feb 3 10:50:38 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 0cf9128
Default Java Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4087/1/testReport/
Max. process+thread count 2880 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4087/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

@tomscut tomscut changed the title HDFS-16513. [SBN read] Observer Namenode does not trigger the edits r… HDFS-16513. [SBN read] Observer Namenode does not trigger the edits rolling of active Namenode Mar 22, 2022
@tomscut tomscut changed the title HDFS-16513. [SBN read] Observer Namenode does not trigger the edits rolling of active Namenode HDFS-16513. [SBN read] Observer Namenode should not trigger the edits rolling of active Namenode Mar 28, 2022
@tomscut
Copy link
Contributor Author

tomscut commented Mar 28, 2022

Hi @xkrogen @sunchao @tamaashu @ayushtkn @ferhui @virajjasani , please take a look at this. Thanks.

@tomscut
Copy link
Contributor Author

tomscut commented Mar 30, 2022

To avoid frequent edtis rolling, we should disable OBN from triggering the edits rolling of active Namenode.

Hi @sunchao, please have a look at this. Thank you very much.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is pretty similar to https://issues.apache.org/jira/browse/HDFS-14378. If I remember correctly, the better approach would be to have ANN roll its own edit logs.

Even though we address the observer issue here, in a real scenario there could still be multiple SNNs.

@tomscut
Copy link
Contributor Author

tomscut commented Mar 31, 2022

I think this is pretty similar to https://issues.apache.org/jira/browse/HDFS-14378. If I remember correctly, the better approach would be to have ANN roll its own edit logs.

Even though we address the observer issue here, in a real scenario there could still be multiple SNNs.

Thank you @sunchao very much for your review.

Active Namenode does automatically roll logs periodically. It might be risky(we can look at here HDFS-2737) by simply disabling all SNN to trigger active roll edits log. However, disabling OBN rolle active edits has no side effects. What do you think of this?

@tomscut
Copy link
Contributor Author

tomscut commented Apr 6, 2022

Hi @xkrogen, could you please also take a look? Thanks.

Copy link
Contributor

@xkrogen xkrogen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tomscut, sorry for the delay in my response.

I am inclined to agree with @sunchao that the approach laid out in HDFS-14378 is a better long-term solution.

It might be risky(we can look at here HDFS-2737) by simply disabling all SNN to trigger active roll edits log.

Can you clarify what from HDFS-2737 makes you feel that it is risky? I skimmed the discussed and didn't notice anything alarming. You may also want to see this comment on HDFS-14378 where this same point was discussed.

That all being said, I think this PR may be a good step in the interim, since HDFS-14378 is a more substantial change. I would appreciate some other opinions, though.
cc @simbadzina @aajisaka @shvachko

@@ -1938,6 +1938,14 @@ public boolean isInStandbyState() {
HAServiceState.OBSERVER == haContext.getState().getServiceState();
}

public boolean isInObserverState() {
if (haContext == null || haContext.getState() == null) {
return haEnabled;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this was probably copied from isInStandbyState()? But I don't think it's right. If we can't find a state, we assume STANDBY state. If we assume STANDBY state because a valid state could not be found, then isInObserverState() should be false. So I think we should just return false here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this was probably copied from isInStandbyState()? But I don't think it's right. If we can't find a state, we assume STANDBY state. If we assume STANDBY state because a valid state could not be found, then isInObserverState() should be false. So I think we should just return false here.

I agree with you. Thanks.

@tomscut
Copy link
Contributor Author

tomscut commented Apr 13, 2022

Hi @tomscut, sorry for the delay in my response.

I am inclined to agree with @sunchao that the approach laid out in HDFS-14378 is a better long-term solution.

It might be risky(we can look at here HDFS-2737) by simply disabling all SNN to trigger active roll edits log.

Can you clarify what from HDFS-2737 makes you feel that it is risky? I skimmed the discussed and didn't notice anything alarming. You may also want to see this comment on HDFS-14378 where this same point was discussed.

That all being said, I think this PR may be a good step in the interim, since HDFS-14378 is a more substantial change. I would appreciate some other opinions, though. cc @simbadzina @aajisaka @shvachko

Thanks you @xkrogen very much for your comments.
It is mentioned in the description of HDFS-2737:

Currently, the edit log tailing process can only read finalized log segments. So, if the active NN is not rolling its logs periodically, the SBN will lag a lot. This also causes many datanode messages to be queued up in the PendingDatanodeMessage structure.

To combat this, the active NN needs to roll its logs periodically – perhaps based on a time threshold, or perhaps based on a number of transactions. I'm not sure yet whether it's better to have the NN roll on its own or to have the SBN ask the active NN to roll its logs.

The pendingDatanodeMessage issue mentioned here strikes me as a bit risky. However, after supporting SBN READ, Journal supports read inProgress. If we enable read inProgress, even if we disable all SNN to roll edits, the pendingDatanodeMessage problem is not too serious.

I would also appreciate some other opinions.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 44s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 43m 23s trunk passed
+1 💚 compile 1m 42s trunk passed with JDK Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04
+1 💚 compile 1m 33s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 1m 13s trunk passed
+1 💚 mvnsite 1m 37s trunk passed
+1 💚 javadoc 1m 9s trunk passed with JDK Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04
+1 💚 javadoc 1m 37s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 3m 57s trunk passed
+1 💚 shadedclient 26m 30s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 24s the patch passed
+1 💚 compile 1m 30s the patch passed with JDK Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04
+1 💚 javac 1m 30s the patch passed
+1 💚 compile 1m 21s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 javac 1m 21s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 1m 0s the patch passed
+1 💚 mvnsite 1m 30s the patch passed
+1 💚 javadoc 0m 55s the patch passed with JDK Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04
+1 💚 javadoc 1m 33s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 3m 52s the patch passed
+1 💚 shadedclient 26m 3s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 240m 42s hadoop-hdfs in the patch passed.
+1 💚 asflicense 0m 49s The patch does not generate ASF License warnings.
360m 33s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4087/2/artifact/out/Dockerfile
GITHUB PR #4087
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname Linux ec3b877e8bf4 4.15.0-169-generic #177-Ubuntu SMP Thu Feb 3 10:50:38 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 0b1946b
Default Java Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4087/2/testReport/
Max. process+thread count 3012 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4087/2/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

@xkrogen
Copy link
Contributor

xkrogen commented Apr 13, 2022

The pendingDatanodeMessage issue mentioned here strikes me as a bit risky. ...

I'm not following. The issue described from HDFS-2737 says that "if the active NN is not rolling its logs periodically ... many datanode messages [will] be queued up in the PendingDatanodeMessage structure". Certainly it is bad if we don't have a way to ensure that the logs are rolled regularly. But HDFS-14378 just proposes making the ANN roll its own edit logs, instead of relying on the SbNN to roll them. I don't see the risk -- we are still ensuring that the logs are rolled periodically, just triggered by the ANN itself instead of the SbNN.

@tomscut
Copy link
Contributor Author

tomscut commented Apr 14, 2022

Thank you @xkrogen for your detailed explanation. I left out some information. You are right.

I thought it was ANN automatic rolledits feature first, then discuss whether to let SNN trigger ANN to rolledits. I got the order of the two wrong.

And I thought that "if the active NN is not rolling its logs periodically" meant that the configuration cycle is very large, or that EditLogTailerThread exits because of some UnknowException. As a result, ANN cannot normally roll its logs. Let SNN trigger ANN to roll edits, just to add another layer of assurance. I made a mistake here.

@tomscut
Copy link
Contributor Author

tomscut commented Apr 14, 2022

In summary, at this stage, should we disable OBN triggerActiveLogRoll first, or disable all SNN triggerActiveLogRoll directly?

@xkrogen @sunchao I look forward to your discussion. Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants