Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix issue #264: private network halts randomly #323

Merged
merged 1 commit into from
Dec 18, 2023

Conversation

gzliudan
Copy link
Collaborator

@gzliudan gzliudan commented Sep 6, 2023

Proposed changes

This PR fix issue #264.

Types of changes

What types of changes does your code introduce to XDC network?

  • [✅] Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation Update (if none of the other choices apply)
  • Regular KTLO or any of the maintaince work. e.g code style
  • CICD Improvement

Impacted Components

Which part of the codebase this PR will touch base on,

  • Consensus
  • Account
  • [✅] Network
  • Geth
  • Smart Contract
  • External components
  • Not sure (Please specify below)

Checklist

  • [✅] This PR has sufficient test coverage (unit/integration test) OR I have provided reason in the PR description for not having test coverage
  • Provide an end-to-end test plan in the PR description on how to manually test it on the devnet/testnet.
  • Tested the backwards compatibility.
  • Tested with XDC nodes running this version co-exist with those running the previous version.
  • Relevant documentation has been updated as part of this PR
  • N/A

Comments

about issue 264

The issue #264 reported that our XDC private network halts after a few days, because some nodes stop producing blocks at random time. This problem occurs mostly in 2 weeks, minimum 2 days, maximum 3 months. I found the root cause after long time test and debug.

The function syncWithPeer posts some events

The function syncWithPeer in downloader posts StartEvent event at start time. and post DoneEvent or FailedEvent event at exit time.

https://github.com/XinFinOrg/XDPoSChain/blob/master/eth/downloader/downloader.go#L410-L419

func (d *Downloader) syncWithPeer(p *peerConnection, hash common.Hash, td *big.Int) (err error) {
	d.mux.Post(StartEvent{})
	defer func() {
		// reset on error
		if err != nil {
			d.mux.Post(FailedEvent{err})
		} else {
			d.mux.Post(DoneEvent{})
		}
	}()

The function update set the variable self.mining according to the events from the function syncWithPeer

The miner is stopped when it receives StartEvent event, then variable self.mining is set to 0. And the miner is started when it receives DoneEvent or FailedEvent event.

https://github.com/XinFinOrg/XDPoSChain/blob/master/miner/miner.go#L85-L127

func (self *Miner) update() {
	events := self.mux.Subscribe(downloader.StartEvent{}, downloader.DoneEvent{}, downloader.FailedEvent{})
	for ev := range events.Chan() {
		switch ev.Data.(type) {
		case downloader.StartEvent:
			atomic.StoreInt32(&self.canStart, 0)
			if self.Mining() {
				self.Stop()
				atomic.StoreInt32(&self.shouldStart, 1)
				log.Info("Mining aborted due to sync")
			}
		case downloader.DoneEvent, downloader.FailedEvent:
			shouldStart := atomic.LoadInt32(&self.shouldStart) == 1

			atomic.StoreInt32(&self.canStart, 1)
			atomic.StoreInt32(&self.shouldStart, 0)
			if shouldStart {
				self.Start(self.coinbase)
			}
		}
	}
}

func (self *Miner) Start(coinbase common.Address) {
	atomic.StoreInt32(&self.shouldStart, 1)
	self.SetEtherbase(coinbase)

	if atomic.LoadInt32(&self.canStart) == 0 {
		log.Info("Network syncing, will start miner afterwards")
		return
	}
	atomic.StoreInt32(&self.mining, 1)

	log.Info("Starting mining operation")
	self.worker.start()
	self.worker.commitNewWork()
}

func (self *Miner) Stop() {
	self.worker.stop()
	atomic.StoreInt32(&self.mining, 0)
	atomic.StoreInt32(&self.shouldStart, 0)
}

The function commitNewWork exits when self.mining equals 0

But the function syncWithPeer is blocked sometimes, only post StartEvent event, not post DoneEvent or FailedEvent event. This behavior makes the miner stopped for ever. When it's time to produce block, the miner exits in function commitNewWork untimely when self.mining is 0:

https://github.com/XinFinOrg/XDPoSChain/blob/master/miner/worker.go#L520-L544

func (self *worker) commitNewWork() {
    // ......

	if !self.announceTxs && atomic.LoadInt32(&self.mining) == 0 {
		return
	}

The function syncWithPeer calls the function spawnSync

The function syncWithPeer calls the function spawnSync:

https://github.com/XinFinOrg/XDPoSChain/blob/master/eth/downloader/downloader.go#L410-L481

func (d *Downloader) syncWithPeer(p *peerConnection, hash common.Hash, td *big.Int) (err error) {
    // ......

	fetchers := []func() error{
		func() error { return d.fetchHeaders(p, origin+1, pivot) }, // Headers are always retrieved
		func() error { return d.fetchBodies(origin + 1) },          // Bodies are retrieved during normal and fast sync
		func() error { return d.fetchReceipts(origin + 1) },        // Receipts are retrieved during fast sync
		func() error { return d.processHeaders(origin+1, pivot, td) },
	}
	if d.mode == FastSync {
		fetchers = append(fetchers, func() error { return d.processFastSyncContent(latest) })
	} else if d.mode == FullSync {
		fetchers = append(fetchers, func() error { return d.processFullSyncContent(height) })
	}
	return d.spawnSync(fetchers)
}

The function spawnSync

The function spawnSync calls the functions:

  • d.queue.Close
  • d.processFullSyncContent

https://github.com/XinFinOrg/XDPoSChain/blob/master/eth/downloader/downloader.go#L485-L513

func (d *Downloader) spawnSync(fetchers []func() error) error {
	var wg sync.WaitGroup
	errc := make(chan error, len(fetchers))
	wg.Add(len(fetchers))
	for _, fn := range fetchers {
		fn := fn
		go func() { defer wg.Done(); errc <- fn() }()
	}
	// Wait for the first error, then terminate the others.
	var err error
	for i := 0; i < len(fetchers); i++ {
		if i == len(fetchers)-1 {
			// Close the queue when all fetchers have exited.
			// This will cause the block processor to end when
			// it has processed the queue.
			d.queue.Close()
		}
		if err = <-errc; err != nil {
			break
		}
	}

When the problem occured, I found that the function spawnSync is blocked at err = <-errc in loop when i is 4, because the function processFullSyncContent did not return, and did not send value to the channel errc.

The function d.queue.Close calls the function q.active.Close

The function d.queue.Close calls the function q.active.Close:

https://github.com/XinFinOrg/XDPoSChain/blob/master/eth/downloader/queue.go#L148-L151

func (q *queue) Close() {
	q.closed = true
	q.active.Broadcast()
}

Please notice that there is no q.lock.Lock() in this function.

The function processFullSyncContent calls the function queue.Results

The function processFullSyncContent calls the function d.queue.Results in loop:

https://github.com/XinFinOrg/XDPoSChain/blob/master/eth/downloader/downloader.go#L1328-L1330

func (d *Downloader) processFullSyncContent(height uint64) error {
	for {
		results := d.queue.Results(true)

The function queue.Results calls the function q.active.Wait()

The function Results in queue calls the function q.active.Wait():

https://github.com/XinFinOrg/XDPoSChain/blob/master/eth/downloader/queue.go#L350-L360

func (q *queue) Results(block bool) []*fetchResult {
	q.lock.Lock()
	defer q.lock.Unlock()

	// Count the number of items available for processing
	nproc := q.countProcessableItems()
	for nproc == 0 && !q.closed {
		if !block {
			return nil
		}
		q.active.Wait()
        // ...

When the problem occured, I found that the function Results is blocked at q.active.Wait().

Why Results is blocked

In most cases, q.active.Broadcast() in the function Close ran after q.active.Wait() in the function Results, so q.active.Wait() will retrun, and the function Results will continue run. But in rare cases:

  • in the function Results: nproc is 0 and q.closed is false, run into loop
  • run function Close: call q.active.Broadcast()
  • in the function Results: call q.active.Wait()

so q.active.Broadcast() ran before q.active.Wait(), this makes q.active.Wait() wait forever, and the function Results is blocked and does not return the result.

How to fix this bug

I changes the function Close to:

func (q *queue) Close() {
	q.lock.Lock()
	q.closed = true
	q.lock.Unlock()
	q.active.Broadcast()
}

Since the function Results gets lock already, and the function Close is waiting for the lock, so q.active.Broadcast() can not be run before q.active.Wait(). When the function Results reaches at q.active.Wait(), the function Close can get the lock, and call q.active.Broadcast(), let q.active.Wait() pass, and the function Results will continue to run.

Please refer https://github.com/ethereum/go-ethereum/blob/master/eth/downloader/queue.go#L192-L197:

func (q *queue) Close() {
	q.lock.Lock()
	q.closed = true
	q.active.Signal()
	q.lock.Unlock()
}

@liam-lai liam-lai merged commit d249d30 into XinFinOrg:dev-upgrade Dec 18, 2023
@liam-lai liam-lai mentioned this pull request Dec 27, 2023
19 tasks
@gzliudan gzliudan deleted the fix-issue-264 branch February 27, 2024 11:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants