-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDFS-16898. Remove write lock for processCommandFromActor of DataNode to reduce impact on heartbeat #5330
Conversation
hi, @srinivasst,@ayushtkn ,@goiri, @tomscut . could you please help me review the code ? does this modification is right? |
💔 -1 overall
This message was automatically generated. |
75f1c4c
to
81cc655
Compare
💔 -1 overall
This message was automatically generated. |
Thanks for the PR @hfutatzhanghb. |
if (actor == bpServiceToActive) { | ||
return processCommandFromActive(cmd, actor); | ||
} else { | ||
return processCommandFromStandby(cmd, actor); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even before HDFS-6788, this part was at least covered by synchronized lock on the actor thread
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @virajjasani . thanks for your careful review. Surely, before HDFS-6788, this part was covered by synchronized lock.
but in method processCommandFromActive
and processCommandFromStandby
, it just use the parameter actor to print log info. The lock here is just trying to decide actor is whether bpServiceToActive or not and determine to execute either processCommandFromActive or processCommandFromStandby.
when occurs switchover between active namenode and standby namenode, the datanodes would be set to stale status, in stale status, we are not allowed to delete blocks directly, we put those blocks into postponedMisreplicatedBlocks. So, even we execute the DatanodeCommand from the previous active namenode(now standby), it is okay.
hi, @virajjasani . Thanks for your replying. Some logs are like below: and we grep some logs as below: we can draw a conclusion that the execution time of processCommandFromActor method is very high, even more than 119 seconds. And in processCommandFromActor method, it uses the write lock which is the same one as updateActorStatesFromHeartbeat method used. The updateActorStatesFromHeartbeat method is in offerService method, so this could hang the hearbeat thread. In our production cluster, we have use this feature, it works well. |
81cc655
to
3d74257
Compare
@hfutatzhanghb Thanks for your works here. Totally agree issue it caused, however I am concerned about if it is safe to move process command out of write lock. do you have any lock analysis here? Thanks again. |
@Hexiaoqiao , thank for your replying~, I will try to draw some pictures to describe it soonly. |
Great. It will be more helpful to push this improvement forward. |
💔 -1 overall
This message was automatically generated. |
Thank you @hfutatzhanghb. I did a quick glance and we don't hit this log line in our clusters so far but this PR has interesting fix. I will check this further for any more resource contention. |
In the meantime, I have two nits if you would like to consider:
With WARN level, it will likely come up front while debugging any slowness issues. |
It is great to prevent the heartbeat from being affected by command processing. I checked that processCommandFromXXX() doesn't access any members inside BPOfferService that can be changed. |
@virajjasani , i totally agree with your opinions. i will modify the code laterly. |
3d74257
to
a1cc014
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. +1 from my side. Let's wait what will Yetus say.
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 (non-binding), there is no resource contention within processCommandFromActive and processCommandFromStandby as they don't access Namesapce info, BP actor instance that represent active namenode, registration info etc. We should be safe here.
Thanks @hfutatzhanghb for addressing previous comment. Would be good to backport to 3.3
The failed unit test |
… to reduce impact on heartbeat (#5330). Contributed by ZhangHB. Reviewed-by: zhangshuyan <[email protected]> Reviewed-by: Viraj Jasani <[email protected]> Signed-off-by: He Xiaoqiao <[email protected]>
The git message is not correct. Try to revert and committed again. |
@hfutatzhanghb This PR could not cherrypick to branch-3.3 smoothly. Would you mind to submit another PR for branch-3.3? |
@Hexiaoqiao , done~, please have a look. thanks. |
…tor (apache#5330). Contributed by ZhangHB. Reviewed-by: zhangshuyan <[email protected]> Reviewed-by: Viraj Jasani <[email protected]> Signed-off-by: He Xiaoqiao <[email protected]>
…ndFromActor (apache#5330). Contributed by ZhangHB." This reverts commit eb04ecd.
… to reduce impact on heartbeat (apache#5330). Contributed by ZhangHB. Reviewed-by: zhangshuyan <[email protected]> Reviewed-by: Viraj Jasani <[email protected]> Signed-off-by: He Xiaoqiao <[email protected]>
Hi @Hexiaoqiao , sir, the original discuss is here~ please take a look, thanks a lot for your reviewing. |
…r of DataNode to reduce impact on heartbeat (apache#5330)
Now in method processCommandFromActor, we have code like below:
if method processCommandFromActive costs much time, the write lock would not release.
It maybe block the
updateActorStatesFromHeartbeat
method in offerService,furthermore, it can cause the lastcontact of datanode very high, even dead when lastcontact beyond 600s.here we can make write lock fine-grain in processCommandFromActor method to address this problem