HDFS-17129. mis-order of ibr and fbr on datanode #6244

LiuGuH · 2023-11-02T09:47:52Z

Description of PR

HDFS-16016 , provide new thread to handler IBR. That is a greate improvement. But it maybe casue the mis-order of ibr and fbr

hadoop-yetus · 2023-11-02T14:57:17Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 30s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
-1 ❌	test4tests	0m 0s		The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
			_ trunk Compile Tests _
+1 💚	mvninstall	37m 21s		trunk passed
+1 💚	compile	0m 58s		trunk passed with JDK Ubuntu-11.0.20.1+1-post-Ubuntu-0ubuntu120.04
+1 💚	compile	0m 54s		trunk passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05
+1 💚	checkstyle	0m 52s		trunk passed
+1 💚	mvnsite	0m 57s		trunk passed
+1 💚	javadoc	0m 54s		trunk passed with JDK Ubuntu-11.0.20.1+1-post-Ubuntu-0ubuntu120.04
+1 💚	javadoc	1m 24s		trunk passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05
+1 💚	spotbugs	2m 13s		trunk passed
+1 💚	shadedclient	27m 10s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	0m 48s		the patch passed
+1 💚	compile	0m 51s		the patch passed with JDK Ubuntu-11.0.20.1+1-post-Ubuntu-0ubuntu120.04
+1 💚	javac	0m 51s		the patch passed
+1 💚	compile	0m 46s		the patch passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05
+1 💚	javac	0m 46s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
+1 💚	checkstyle	0m 39s		the patch passed
+1 💚	mvnsite	0m 50s		the patch passed
+1 💚	javadoc	0m 42s		the patch passed with JDK Ubuntu-11.0.20.1+1-post-Ubuntu-0ubuntu120.04
+1 💚	javadoc	1m 17s		the patch passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05
+1 💚	spotbugs	2m 17s		the patch passed
+1 💚	shadedclient	22m 50s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
-1 ❌	unit	201m 40s	/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt	hadoop-hdfs in the patch passed.
+1 💚	asflicense	0m 36s		The patch does not generate ASF License warnings.
		308m 9s

Reason	Tests
Failed junit tests	hadoop.hdfs.TestDFSUtil

Subsystem	Report/Notes
Docker	ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6244/1/artifact/out/Dockerfile
GITHUB PR	#6244
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname	Linux 92e94f4dd354 4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `2508505`
Default Java	Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.20.1+1-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6244/1/testReport/
Max. process+thread count	3639 (vs. ulimit of 5500)
modules	C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6244/1/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

yuanboliu · 2023-11-06T02:24:41Z

+1.
We've seen that a block is marked as missing block because of the mis-order after applying HDFS-16016. This bug can be reproduced as below:

client write block to dn1 and dn2 as pipeline
dn1/dn2 has over 1 million blocks and fbr is reported one by one disk.
when fbr is being reported, client finish writing, dn1/dn2 starts to trigger ibr in different thread.
nn receives all the operations and because of the namesystem write lock, the operations are executed in line as dn1-ibr, dn2-ibr, dn1-fbr-disk, dn2-fbr-disk. After this, the block is marked as a missing block.

hadoop-yetus · 2023-11-14T06:51:49Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 28s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 1s		codespell was not available.
+0 🆗	detsecrets	0m 1s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
-1 ❌	test4tests	0m 0s		The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
			_ trunk Compile Tests _
+1 💚	mvninstall	40m 6s		trunk passed
+1 💚	compile	0m 58s		trunk passed with JDK Ubuntu-11.0.20.1+1-post-Ubuntu-0ubuntu120.04
+1 💚	compile	0m 53s		trunk passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05
+1 💚	checkstyle	0m 50s		trunk passed
+1 💚	mvnsite	0m 56s		trunk passed
+1 💚	javadoc	0m 54s		trunk passed with JDK Ubuntu-11.0.20.1+1-post-Ubuntu-0ubuntu120.04
+1 💚	javadoc	1m 24s		trunk passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05
+1 💚	spotbugs	2m 22s		trunk passed
+1 💚	shadedclient	28m 15s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	0m 40s		the patch passed
+1 💚	compile	0m 51s		the patch passed with JDK Ubuntu-11.0.20.1+1-post-Ubuntu-0ubuntu120.04
+1 💚	javac	0m 51s		the patch passed
+1 💚	compile	0m 44s		the patch passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05
+1 💚	javac	0m 44s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
+1 💚	checkstyle	0m 38s		the patch passed
+1 💚	mvnsite	0m 43s		the patch passed
+1 💚	javadoc	0m 40s		the patch passed with JDK Ubuntu-11.0.20.1+1-post-Ubuntu-0ubuntu120.04
+1 💚	javadoc	1m 13s		the patch passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05
+1 💚	spotbugs	2m 0s		the patch passed
+1 💚	shadedclient	24m 0s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	193m 58s		hadoop-hdfs in the patch passed.
+1 💚	asflicense	0m 35s		The patch does not generate ASF License warnings.
		304m 29s

Subsystem	Report/Notes
Docker	ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6244/2/artifact/out/Dockerfile
GITHUB PR	#6244
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname	Linux 8314d376cf74 4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `272e27d`
Default Java	Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.20.1+1-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6244/2/testReport/
Max. process+thread count	3578 (vs. ulimit of 5500)
modules	C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6244/2/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

LiuGuH · 2023-11-14T07:02:45Z

@Hexiaoqiao , hi sir , do you have time to review it ? The Test Results are all finished without errors. Thanks

LiuGuH · 2023-11-14T07:48:35Z

When BPServiceActor.blockreport() invokes , it executes ibrManager.sendIBRs first. So to prevent mis-order, we can use synchronized with BPServiceActor.blockreport and ibrManager.sendIBRs.

It will make block report order like that:
sendIBRs ,sendIBRs ,BPServiceActor.blockreport(sendIBRs, fbr) ,sendIBRs ,sendIBRs
So with this we can make the right order with blockreport.

LiuGuH · 2023-11-15T03:16:40Z

@virajjasani could you reivew it ?Thanks

ayushtkn · 2023-11-15T08:19:16Z

I have been reading the comments on the original ticket, and mostly share same thought as @Hexiaoqiao, the one case in the end mentioned was answered here, I believe:
https://issues.apache.org/jira/browse/HDFS-16016?focusedCommentId=17783162&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17783162
and the reporter I believe tried without HDFS-16898? In general I didn't see IBR/FBR chocking at Datanode side, FBR anyway is very low frequency, the only place FBR creates problem is at NN side, when almost all DN shoot at one time, there is a logic to handle this scenario as well, Anyway.....

I was under the impression like when a DN pushes a FBR, all the pending IBR are pushed before? it isn't that the case now?

What is the fix, put both IBR & BR under lock? earlier only IBR was under lock or what? what case was the earlier lock solving?

I am just curious, I am not gonna merge this in general, will let other folks involved in the original ticket chase this..

LiuGuH · 2023-11-15T08:59:10Z

the only place FBR creates problem is at NN side

DN side heartbeat will be blocked untils fbr finished .
With large cold data storage disk in one datanode, DN log like that:
INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Successfully sentblock report 0x587d5aa72264fafc with lease ID 0x979291640945f45 to namenode: hostname/ip:8022, containing 42 storage report(s), of which we sent 42. The reports had 15114085 total blocks and used 42 RPC(s). This took 3381 msecs to generate and 25967 msecs for RPC and NN processing.
It may block 30s and if with more blocks and dn is busy with read and write , it will take longer time.

earlier only IBR was under lock or what? what case was the earlier lock solving?

With HDFS-16016 , only IBR was under lock to guarantee ibr atomicity. And #5888 (comment) explains why mis-order

Thanks. @ayushtkn

slfan1989 · 2024-01-15T05:40:13Z

We need to discuss how to address the issue described in this pull request, which is currently blocking the release of hadoop-3.4.0 RC1.

After reading discussions in #5888 and #6244, I've identified two key pull requests:

#5408 (HDFS-16898. Remove write lock for processCommandFromActor of DataNode to reduce impact on heartbeat).
#2998 (HDFS-16016. BPServiceActor to provide a new thread to handle IBR).

It may take our a long time to completely solve the issue. My personal suggestion is to roll back #5408 (HDFS-16898) and #2998 (HDFS-16016).

@ayushtkn @Hexiaoqiao @virajjasani

cc: @LiuGuH @6450

Hexiaoqiao · 2024-01-15T06:40:31Z

It may take me a long time to completely solve the issue. My personal suggestion is to roll back #5408 (HDFS-16898) and #2998 (HDFS-16016).

@slfan1989 Thanks for your works. +1 from my side.

slfan1989 · 2024-01-15T13:43:10Z

It may take me a long time to completely solve the issue. My personal suggestion is to roll back #5408 (HDFS-16898) and #2998 (HDFS-16016).

@slfan1989 Thanks for your works. +1 from my side.

@Hexiaoqiao Thank you for opinions!

slfan1989 · 2024-01-15T13:44:38Z

@ayushtkn How do you think we should handle this?

ayushtkn · 2024-01-15T13:49:32Z

If there is no quick fix, I am good with reverting & reopening the tickets which caused it rather than holding the RC

tasanuma · 2024-01-15T14:02:43Z

Why should we consider reverting HDFS-16898? The original problem of IBR conflicting with the heartbeat was significant in our environment. By backporting HDFS-16016, we managed to improve the situation. There's a possibility that HDFS-16898 could resolve this issue as well, according to the discussion in HDFS-16016. If the mis-order is indeed caused by HDFS-16016, wouldn't reverting just HDFS-16016 be enough? I apologize if I'm wrong, I'm in the middle of reviewing related PR discussions.

tasanuma · 2024-01-16T04:45:12Z

I'd like to share our situation in more detail. In my company, we are using hadoop-3.3.0 with many patches. We had a problem with IBR becoming very slow, which is exactly the issue described in HDFS-16898. Then, we backported HDFS-16016 and the problem was solved (HDFS-16898 didn't exist at that time).

Now, I have confirmed that our problem is fixed by HDFS-16898 without HDFS-16016. So, I would suggest only reverting HDFS-16016 for now. It may no longer be needed.

slfan1989 · 2024-01-16T07:08:10Z

I'd like to share our situation in more detail. In my company, we are using hadoop-3.3.0 with many patches. We had a problem with IBR becoming very slow, which is exactly the issue described in HDFS-16898. Then, we backported HDFS-16016 and the problem was solved (HDFS-16898 didn't exist at that time).

Now, I have confirmed that our problem is fixed by HDFS-16898 without HDFS-16016. So, I would suggest only reverting HDFS-16016 for now. It may no longer be needed.

@tasanuma Thanks for your feedback! It's very valuable. I will only revert HDFS-16016.

cc: @Hexiaoqiao @ayushtkn

github-actions bot added HDFS trunk labels Nov 2, 2023

fix mis-order of ibr and fbr.

272e27d

LiuGuH force-pushed the trunk-HDFS-17129-2 branch from 2508505 to 272e27d Compare November 14, 2023 01:46

ayushtkn requested review from Hexiaoqiao, slfan1989 and tasanuma January 12, 2024 21:21

slfan1989 mentioned this pull request Jan 17, 2024

HDFS-16016. BPServiceActor to provide new thread to handle IBR #2998

Merged

tasanuma mentioned this pull request Jan 17, 2024

HDFS-17343. Revert "HDFS-16016. BPServiceActor to provide new thread to handle IBR (#2998)" #6457

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDFS-17129. mis-order of ibr and fbr on datanode #6244

HDFS-17129. mis-order of ibr and fbr on datanode #6244

LiuGuH commented Nov 2, 2023

hadoop-yetus commented Nov 2, 2023

yuanboliu commented Nov 6, 2023

hadoop-yetus commented Nov 14, 2023

LiuGuH commented Nov 14, 2023 •

edited

Loading

LiuGuH commented Nov 14, 2023 •

edited

Loading

LiuGuH commented Nov 15, 2023

ayushtkn commented Nov 15, 2023

LiuGuH commented Nov 15, 2023 •

edited

Loading

slfan1989 commented Jan 15, 2024 •

edited

Loading

Hexiaoqiao commented Jan 15, 2024

slfan1989 commented Jan 15, 2024

slfan1989 commented Jan 15, 2024

ayushtkn commented Jan 15, 2024

tasanuma commented Jan 15, 2024

tasanuma commented Jan 16, 2024 •

edited

Loading

slfan1989 commented Jan 16, 2024 •

edited

Loading

HDFS-17129. mis-order of ibr and fbr on datanode #6244

Are you sure you want to change the base?

HDFS-17129. mis-order of ibr and fbr on datanode #6244

Conversation

LiuGuH commented Nov 2, 2023

Description of PR

hadoop-yetus commented Nov 2, 2023

yuanboliu commented Nov 6, 2023

hadoop-yetus commented Nov 14, 2023

LiuGuH commented Nov 14, 2023 • edited Loading

LiuGuH commented Nov 14, 2023 • edited Loading

LiuGuH commented Nov 15, 2023

ayushtkn commented Nov 15, 2023

LiuGuH commented Nov 15, 2023 • edited Loading

slfan1989 commented Jan 15, 2024 • edited Loading

Hexiaoqiao commented Jan 15, 2024

slfan1989 commented Jan 15, 2024

slfan1989 commented Jan 15, 2024

ayushtkn commented Jan 15, 2024

tasanuma commented Jan 15, 2024

tasanuma commented Jan 16, 2024 • edited Loading

slfan1989 commented Jan 16, 2024 • edited Loading

LiuGuH commented Nov 14, 2023 •

edited

Loading

LiuGuH commented Nov 14, 2023 •

edited

Loading

LiuGuH commented Nov 15, 2023 •

edited

Loading

slfan1989 commented Jan 15, 2024 •

edited

Loading

tasanuma commented Jan 16, 2024 •

edited

Loading

slfan1989 commented Jan 16, 2024 •

edited

Loading