Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

miner, consensus/clique: avoid memory leak during block stasis #23861

Merged
merged 1 commit into from
Nov 5, 2021

Conversation

holiman
Copy link
Contributor

@holiman holiman commented Nov 5, 2021

When a clique network is not progressing, we see a memory leak:

Screenshot 2021-11-05 at 14-50-10 Single Geth - Grafana

This is pretty bad, because whenever one signer falls off, then the rest of them eventually will crash too -- so one failing sealer will wreak havoc, causing other sealers to crash and making it very difficult to get the network live again.

Every 3 seconds (miner.recommit interval) we shove a new task to be handed, where we e.g. deep-copy the receipts and other stuff and send it over a channel w.taskCh <- &task{receipts: receipts, state: s, block: block, createdAt: time.Now()}:

That get's shoved into a map

             w.pendingTasks[sealHash] = task

So every 3 seconds (miner recommit), we leak a block into memory, basically.
The routine that would have cleaned that up :

    clearPending := func(number uint64) {
        w.pendingMu.Lock()
        for h, t := range w.pendingTasks {
            if t.block.NumberU64()+staleThreshold <= number {
                delete(w.pendingTasks, h)
            }
        }
        w.pendingMu.Unlock()
    }

Won't clean close to head, and besides, it's not executed unless blocks progress.

When we hit this case:

Nov 05 15:08:18 rinkeby-aws-eu-west-3-001 geth INFO [11-05|14:08:18.254] Signed recently, must wait for others
Nov 05 15:08:18 rinkeby-aws-eu-west-3-001 geth INFO [11-05|14:08:18.254] Commit new mining work number=9,589,541 sealhash=df3e3b..b20b0c uncles=0 txs=237 gas=29,998,852 fees=5.408818027 elapsed=74.878ms
Nov 05 15:08:21 rinkeby-aws-eu-west-3-001 geth INFO [11-05|14:08:21.253] Signed recently, must wait for others
Nov 05 15:08:21 rinkeby-aws-eu-west-3-001 geth INFO [11-05|14:08:21.253] Commit new mining work number=9,589,541 sealhash=16ca1f..b229ee uncles=0 txs=237 gas=29,998,852 fees=5.408818027 elapsed=73.620ms
Nov 05 15:08:24 rinkeby-aws-eu-west-3-001 geth INFO [11-05|14:08:24.253] Signed recently, must wait for others
Nov 05 15:08:24 rinkeby-aws-eu-west-3-001 geth INFO [11-05|14:08:24.254] Commit new mining work number=9,589,541 sealhash=c3e488..6dd7a3 uncles=0 txs=237 gas=29,998,852 fees=5.408818027 elapsed=72.891ms

Then Signed recently, must wait for others means that we did not actually start a sealing-task at all, so there's no point tracking the work package. This PR changes the behavior so that

  1. We return an error from Clique, instead of logging + return nil,
  2. When the engine returns an error, we un-track the work package.

@holiman holiman requested a review from karalabe as a code owner November 5, 2021 14:32
@holiman holiman added this to the 1.10.12 milestone Nov 5, 2021
@holiman
Copy link
Contributor Author

holiman commented Nov 5, 2021

Was running latest stable until 15:40, then restarted with this PR. So seems to work fine:

Screenshot 2021-11-05 at 16-05-01 Single Geth - Grafana

Copy link
Member

@karalabe karalabe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@holiman holiman merged commit 476fb56 into ethereum:master Nov 5, 2021
sidhujag pushed a commit to syscoin/go-ethereum that referenced this pull request Nov 7, 2021
…eum#23861)

This PR fixes a problem which arises on clique networks when there is a network stall. Previously, the worker packages were tracked, even if the sealing engine decided not to seal the block (due to clique rules about recent signing). These tracked-but-not-sealed blocks kept building up in memory. 
This PR changes the situation so the sealing engine instead returns an error, and the worker can thus un-track the package.
@holiman holiman deleted the fix_clique branch November 10, 2021 18:32
yongjun925 pushed a commit to DODOEX/go-ethereum that referenced this pull request Dec 3, 2022
…eum#23861)

This PR fixes a problem which arises on clique networks when there is a network stall. Previously, the worker packages were tracked, even if the sealing engine decided not to seal the block (due to clique rules about recent signing). These tracked-but-not-sealed blocks kept building up in memory. 
This PR changes the situation so the sealing engine instead returns an error, and the worker can thus un-track the package.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants