fix issue #264: private network halts randomly #323

gzliudan · 2023-09-06T03:15:37Z

Proposed changes

This PR fix issue #264.

Types of changes

What types of changes does your code introduce to XDC network?

[✅] Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation Update (if none of the other choices apply)
Regular KTLO or any of the maintaince work. e.g code style
CICD Improvement

Impacted Components

Which part of the codebase this PR will touch base on,

Checklist

[✅] This PR has sufficient test coverage (unit/integration test) OR I have provided reason in the PR description for not having test coverage
Provide an end-to-end test plan in the PR description on how to manually test it on the devnet/testnet.
Tested the backwards compatibility.
Tested with XDC nodes running this version co-exist with those running the previous version.
Relevant documentation has been updated as part of this PR
N/A

Comments

about issue 264

The issue #264 reported that our XDC private network halts after a few days, because some nodes stop producing blocks at random time. This problem occurs mostly in 2 weeks, minimum 2 days, maximum 3 months. I found the root cause after long time test and debug.

The function `syncWithPeer` posts some events

The function syncWithPeer in downloader posts StartEvent event at start time. and post DoneEvent or FailedEvent event at exit time.

https://github.com/XinFinOrg/XDPoSChain/blob/master/eth/downloader/downloader.go#L410-L419

func (d *Downloader) syncWithPeer(p *peerConnection, hash common.Hash, td *big.Int) (err error) {
	d.mux.Post(StartEvent{})
	defer func() {
		// reset on error
		if err != nil {
			d.mux.Post(FailedEvent{err})
		} else {
			d.mux.Post(DoneEvent{})
		}
	}()

The function `update` set the variable `self.mining` according to the events from the function `syncWithPeer`

The miner is stopped when it receives StartEvent event, then variable self.mining is set to 0. And the miner is started when it receives DoneEvent or FailedEvent event.

https://github.com/XinFinOrg/XDPoSChain/blob/master/miner/miner.go#L85-L127

func (self *Miner) update() {
	events := self.mux.Subscribe(downloader.StartEvent{}, downloader.DoneEvent{}, downloader.FailedEvent{})
	for ev := range events.Chan() {
		switch ev.Data.(type) {
		case downloader.StartEvent:
			atomic.StoreInt32(&self.canStart, 0)
			if self.Mining() {
				self.Stop()
				atomic.StoreInt32(&self.shouldStart, 1)
				log.Info("Mining aborted due to sync")
			}
		case downloader.DoneEvent, downloader.FailedEvent:
			shouldStart := atomic.LoadInt32(&self.shouldStart) == 1

			atomic.StoreInt32(&self.canStart, 1)
			atomic.StoreInt32(&self.shouldStart, 0)
			if shouldStart {
				self.Start(self.coinbase)
			}
		}
	}
}

func (self *Miner) Start(coinbase common.Address) {
	atomic.StoreInt32(&self.shouldStart, 1)
	self.SetEtherbase(coinbase)

	if atomic.LoadInt32(&self.canStart) == 0 {
		log.Info("Network syncing, will start miner afterwards")
		return
	}
	atomic.StoreInt32(&self.mining, 1)

	log.Info("Starting mining operation")
	self.worker.start()
	self.worker.commitNewWork()
}

func (self *Miner) Stop() {
	self.worker.stop()
	atomic.StoreInt32(&self.mining, 0)
	atomic.StoreInt32(&self.shouldStart, 0)
}

The function `commitNewWork` exits when `self.mining` equals 0

But the function syncWithPeer is blocked sometimes, only post StartEvent event, not post DoneEvent or FailedEvent event. This behavior makes the miner stopped for ever. When it's time to produce block, the miner exits in function commitNewWork untimely when self.mining is 0:

https://github.com/XinFinOrg/XDPoSChain/blob/master/miner/worker.go#L520-L544

func (self *worker) commitNewWork() {
    // ......

	if !self.announceTxs && atomic.LoadInt32(&self.mining) == 0 {
		return
	}

The function `syncWithPeer` calls the function `spawnSync`

The function syncWithPeer calls the function spawnSync:

https://github.com/XinFinOrg/XDPoSChain/blob/master/eth/downloader/downloader.go#L410-L481

func (d *Downloader) syncWithPeer(p *peerConnection, hash common.Hash, td *big.Int) (err error) {
    // ......

	fetchers := []func() error{
		func() error { return d.fetchHeaders(p, origin+1, pivot) }, // Headers are always retrieved
		func() error { return d.fetchBodies(origin + 1) },          // Bodies are retrieved during normal and fast sync
		func() error { return d.fetchReceipts(origin + 1) },        // Receipts are retrieved during fast sync
		func() error { return d.processHeaders(origin+1, pivot, td) },
	}
	if d.mode == FastSync {
		fetchers = append(fetchers, func() error { return d.processFastSyncContent(latest) })
	} else if d.mode == FullSync {
		fetchers = append(fetchers, func() error { return d.processFullSyncContent(height) })
	}
	return d.spawnSync(fetchers)
}

The function `spawnSync`

The function spawnSync calls the functions:

d.queue.Close
d.processFullSyncContent

https://github.com/XinFinOrg/XDPoSChain/blob/master/eth/downloader/downloader.go#L485-L513

func (d *Downloader) spawnSync(fetchers []func() error) error {
	var wg sync.WaitGroup
	errc := make(chan error, len(fetchers))
	wg.Add(len(fetchers))
	for _, fn := range fetchers {
		fn := fn
		go func() { defer wg.Done(); errc <- fn() }()
	}
	// Wait for the first error, then terminate the others.
	var err error
	for i := 0; i < len(fetchers); i++ {
		if i == len(fetchers)-1 {
			// Close the queue when all fetchers have exited.
			// This will cause the block processor to end when
			// it has processed the queue.
			d.queue.Close()
		}
		if err = <-errc; err != nil {
			break
		}
	}

When the problem occured, I found that the function spawnSync is blocked at err = <-errc in loop when i is 4, because the function processFullSyncContent did not return, and did not send value to the channel errc.

The function `d.queue.Close` calls the function `q.active.Close`

The function d.queue.Close calls the function q.active.Close:

https://github.com/XinFinOrg/XDPoSChain/blob/master/eth/downloader/queue.go#L148-L151

func (q *queue) Close() {
	q.closed = true
	q.active.Broadcast()
}

Please notice that there is no q.lock.Lock() in this function.

The function `processFullSyncContent` calls the function `queue.Results`

The function processFullSyncContent calls the function d.queue.Results in loop:

https://github.com/XinFinOrg/XDPoSChain/blob/master/eth/downloader/downloader.go#L1328-L1330

func (d *Downloader) processFullSyncContent(height uint64) error {
	for {
		results := d.queue.Results(true)

The function `queue.Results` calls the function `q.active.Wait()`

The function Results in queue calls the function q.active.Wait():

https://github.com/XinFinOrg/XDPoSChain/blob/master/eth/downloader/queue.go#L350-L360

func (q *queue) Results(block bool) []*fetchResult {
	q.lock.Lock()
	defer q.lock.Unlock()

	// Count the number of items available for processing
	nproc := q.countProcessableItems()
	for nproc == 0 && !q.closed {
		if !block {
			return nil
		}
		q.active.Wait()
        // ...

When the problem occured, I found that the function Results is blocked at q.active.Wait().

Why `Results` is blocked

In most cases, q.active.Broadcast() in the function Close ran after q.active.Wait() in the function Results, so q.active.Wait() will retrun, and the function Results will continue run. But in rare cases:

in the function Results: nproc is 0 and q.closed is false, run into loop
run function Close: call q.active.Broadcast()
in the function Results: call q.active.Wait()

so q.active.Broadcast() ran before q.active.Wait(), this makes q.active.Wait() wait forever, and the function Results is blocked and does not return the result.

How to fix this bug

I changes the function Close to:

func (q *queue) Close() {
	q.lock.Lock()
	q.closed = true
	q.lock.Unlock()
	q.active.Broadcast()
}

Since the function Results gets lock already, and the function Close is waiting for the lock, so q.active.Broadcast() can not be run before q.active.Wait(). When the function Results reaches at q.active.Wait(), the function Close can get the lock, and call q.active.Broadcast(), let q.active.Wait() pass, and the function Results will continue to run.

Please refer https://github.com/ethereum/go-ethereum/blob/master/eth/downloader/queue.go#L192-L197:

func (q *queue) Close() {
	q.lock.Lock()
	q.closed = true
	q.active.Signal()
	q.lock.Unlock()
}

gzliudan force-pushed the fix-issue-264 branch from 4b98725 to 8fb5fa7 Compare December 18, 2023 02:46

eth/downloader: make Broadcast run after Wait (XinFinOrg#264)

bd3e260

gzliudan force-pushed the fix-issue-264 branch from 8fb5fa7 to bd3e260 Compare December 18, 2023 02:49

liam-lai approved these changes Dec 18, 2023

View reviewed changes

liam-lai merged commit d249d30 into XinFinOrg:dev-upgrade Dec 18, 2023

liam-lai mentioned this pull request Dec 27, 2023

Dev upgrade resolve sync issue #385

Merged

19 tasks

gzliudan deleted the fix-issue-264 branch February 27, 2024 11:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix issue #264: private network halts randomly #323

fix issue #264: private network halts randomly #323

gzliudan commented Sep 6, 2023 •

edited

Loading

fix issue #264: private network halts randomly #323

fix issue #264: private network halts randomly #323

Conversation

gzliudan commented Sep 6, 2023 • edited Loading

Proposed changes

Types of changes

Impacted Components

Checklist

Comments

about issue 264

The function syncWithPeer posts some events

The function update set the variable self.mining according to the events from the function syncWithPeer

The function commitNewWork exits when self.mining equals 0

The function syncWithPeer calls the function spawnSync

The function spawnSync

The function d.queue.Close calls the function q.active.Close

The function processFullSyncContent calls the function queue.Results

The function queue.Results calls the function q.active.Wait()

Why Results is blocked

How to fix this bug

gzliudan commented Sep 6, 2023 •

edited

Loading

The function `syncWithPeer` posts some events

The function `update` set the variable `self.mining` according to the events from the function `syncWithPeer`

The function `commitNewWork` exits when `self.mining` equals 0

The function `syncWithPeer` calls the function `spawnSync`

The function `spawnSync`

The function `d.queue.Close` calls the function `q.active.Close`

The function `processFullSyncContent` calls the function `queue.Results`

The function `queue.Results` calls the function `q.active.Wait()`

Why `Results` is blocked