-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDFS-17129. mis-order of ibr and fbr on datanode #6244
base: trunk
Are you sure you want to change the base?
Conversation
💔 -1 overall
This message was automatically generated. |
+1.
|
2508505
to
272e27d
Compare
💔 -1 overall
This message was automatically generated. |
@Hexiaoqiao , hi sir , do you have time to review it ? The Test Results are all finished without errors. Thanks |
When BPServiceActor.blockreport() invokes , it executes ibrManager.sendIBRs first. So to prevent mis-order, we can use synchronized with BPServiceActor.blockreport and ibrManager.sendIBRs. It will make block report order like that: |
@virajjasani could you reivew it ?Thanks |
I have been reading the comments on the original ticket, and mostly share same thought as @Hexiaoqiao, the one case in the end mentioned was answered here, I believe: I was under the impression like when a DN pushes a FBR, all the pending IBR are pushed before? it isn't that the case now? What is the fix, put both IBR & BR under lock? earlier only IBR was under lock or what? what case was the earlier lock solving? I am just curious, I am not gonna merge this in general, will let other folks involved in the original ticket chase this.. |
DN side heartbeat will be blocked untils fbr finished .
With HDFS-16016 , only IBR was under lock to guarantee ibr atomicity. And #5888 (comment) explains why mis-order Thanks. @ayushtkn |
We need to discuss how to address the issue described in this pull request, which is currently blocking the release of hadoop-3.4.0 RC1. After reading discussions in #5888 and #6244, I've identified two key pull requests: #5408 (HDFS-16898. Remove write lock for processCommandFromActor of DataNode to reduce impact on heartbeat). It may take our a long time to completely solve the issue. My personal suggestion is to roll back #5408 (HDFS-16898) and #2998 (HDFS-16016). |
@slfan1989 Thanks for your works. +1 from my side. |
@Hexiaoqiao Thank you for opinions! |
@ayushtkn How do you think we should handle this? |
If there is no quick fix, I am good with reverting & reopening the tickets which caused it rather than holding the RC |
Why should we consider reverting HDFS-16898? The original problem of IBR conflicting with the heartbeat was significant in our environment. By backporting HDFS-16016, we managed to improve the situation. There's a possibility that HDFS-16898 could resolve this issue as well, according to the discussion in HDFS-16016. If the mis-order is indeed caused by HDFS-16016, wouldn't reverting just HDFS-16016 be enough? I apologize if I'm wrong, I'm in the middle of reviewing related PR discussions. |
I'd like to share our situation in more detail. In my company, we are using hadoop-3.3.0 with many patches. We had a problem with IBR becoming very slow, which is exactly the issue described in HDFS-16898. Then, we backported HDFS-16016 and the problem was solved (HDFS-16898 didn't exist at that time). Now, I have confirmed that our problem is fixed by HDFS-16898 without HDFS-16016. So, I would suggest only reverting HDFS-16016 for now. It may no longer be needed. |
@tasanuma Thanks for your feedback! It's very valuable. I will only revert HDFS-16016. cc: @Hexiaoqiao @ayushtkn |
Description of PR
HDFS-16016 , provide new thread to handler IBR. That is a greate improvement. But it maybe casue the mis-order of ibr and fbr