fix issue #264: private network halts randomly #323
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Proposed changes
This PR fix issue #264.
Types of changes
What types of changes does your code introduce to XDC network?
Impacted Components
Which part of the codebase this PR will touch base on,
Checklist
Comments
about issue 264
The issue #264 reported that our XDC private network halts after a few days, because some nodes stop producing blocks at random time. This problem occurs mostly in 2 weeks, minimum 2 days, maximum 3 months. I found the root cause after long time test and debug.
The function
syncWithPeer
posts some eventsThe function
syncWithPeer
in downloader postsStartEvent
event at start time. and postDoneEvent
orFailedEvent
event at exit time.https://github.com/XinFinOrg/XDPoSChain/blob/master/eth/downloader/downloader.go#L410-L419
The function
update
set the variableself.mining
according to the events from the functionsyncWithPeer
The miner is stopped when it receives
StartEvent
event, then variableself.mining
is set to 0. And the miner is started when it receivesDoneEvent
orFailedEvent
event.https://github.com/XinFinOrg/XDPoSChain/blob/master/miner/miner.go#L85-L127
The function
commitNewWork
exits whenself.mining
equals 0But the function
syncWithPeer
is blocked sometimes, only postStartEvent
event, not postDoneEvent
orFailedEvent
event. This behavior makes the miner stopped for ever. When it's time to produce block, the miner exits in functioncommitNewWork
untimely whenself.mining
is 0:https://github.com/XinFinOrg/XDPoSChain/blob/master/miner/worker.go#L520-L544
The function
syncWithPeer
calls the functionspawnSync
The function
syncWithPeer
calls the functionspawnSync
:https://github.com/XinFinOrg/XDPoSChain/blob/master/eth/downloader/downloader.go#L410-L481
The function
spawnSync
The function
spawnSync
calls the functions:https://github.com/XinFinOrg/XDPoSChain/blob/master/eth/downloader/downloader.go#L485-L513
When the problem occured, I found that the function
spawnSync
is blocked aterr = <-errc
in loop when i is 4, because the functionprocessFullSyncContent
did not return, and did not send value to the channelerrc
.The function
d.queue.Close
calls the functionq.active.Close
The function
d.queue.Close
calls the functionq.active.Close
:https://github.com/XinFinOrg/XDPoSChain/blob/master/eth/downloader/queue.go#L148-L151
Please notice that there is no
q.lock.Lock()
in this function.The function
processFullSyncContent
calls the functionqueue.Results
The function
processFullSyncContent
calls the functiond.queue.Results
in loop:https://github.com/XinFinOrg/XDPoSChain/blob/master/eth/downloader/downloader.go#L1328-L1330
The function
queue.Results
calls the functionq.active.Wait()
The function
Results
in queue calls the functionq.active.Wait()
:https://github.com/XinFinOrg/XDPoSChain/blob/master/eth/downloader/queue.go#L350-L360
When the problem occured, I found that the function
Results
is blocked atq.active.Wait()
.Why
Results
is blockedIn most cases,
q.active.Broadcast()
in the functionClose
ran afterq.active.Wait()
in the functionResults
, soq.active.Wait()
will retrun, and the functionResults
will continue run. But in rare cases:Results
:nproc
is 0 andq.closed
is false, run into loopClose
: callq.active.Broadcast()
Results
: callq.active.Wait()
so
q.active.Broadcast()
ran beforeq.active.Wait()
, this makesq.active.Wait()
wait forever, and the functionResults
is blocked and does not return the result.How to fix this bug
I changes the function
Close
to:Since the function
Results
gets lock already, and the functionClose
is waiting for the lock, soq.active.Broadcast()
can not be run beforeq.active.Wait()
. When the functionResults
reaches atq.active.Wait()
, the functionClose
can get the lock, and callq.active.Broadcast()
, letq.active.Wait()
pass, and the functionResults
will continue to run.Please refer https://github.com/ethereum/go-ethereum/blob/master/eth/downloader/queue.go#L192-L197: