Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core/state/snapshot: fix BAD BLOCK error when snapshot is generating #23635

Merged

Conversation

zzyalbert
Copy link
Contributor

@zzyalbert zzyalbert commented Sep 24, 2021

This pr is trying to resolve the BAD BLOCK problem in #23531, it is a little obscure to find out

It occurred when geth was syncing from the block of some days ago in full sync mode, while generating snapshots from the beginning

The following is the process how 'BAD BLOCK' occurrs

  • The snapshot is generating, its genMarker comes to some key, let's assume diskLayer.genMarker = accountHash + storageHash0
  • Then a storage key accountHash + storageHash1 (storageHash1 < storageHash0) was read from snapshot diskLayer because accountHash + storageHash1 < diskLayer.genMarker. And the storage kv <accountHash + storageHash1, value1> will be cached in diskLayer.cache
  • After that a new block is written to the chain by BlockChain.writeBlockWithState(), and the bottom diffLayer is prepared to be merged to diskLayer by 'snapshot.diffToDisk()', thus will send a abort signal to snapshot generating process
  • Then generating process comes to checkAndFlush(core/state/snapshot/generate.go:638, see picture below), receives the abort signal. But it will set diskLayer.genMarker=accountHash and thus makes diskLayer.genMarker become smaller(accountHash < accountHash + storageHash1)

image

  • 'snapshot.diffToDisk()' receives abort finish signal from snapshot generation process and continues to write data to diskLayer. A new value of key accountHash + storageHash1 is in that data but will be ignored because its key is greater than diskLayer.genMarker(accountHash)
  • As the snapshot generation proceeding, the new value of accountHash + storageHash1 will be finally written to diskLayer(core/state/snapshot/generate.go:673, see picture below) and diskLayer.genMarker will become larger than accountHash + storageHash1, let's assume diskLayer.genMarker = accountHash + storageHash3, but the value in diskLayer.cache still remains OLD

image

  • Then next time when accountHash + storageHash1 is read, the old value in diskLayer.cache will be returned and that will cause a BAD BLOCK error

In terms of the resolution, I have 2 proposals

  1. Make sure the diskLayer.genMarker will never go smaller, which was applied in this pr
  2. Delete the key in diskLayer.cache to make sure any stale value will be evicted, like following code(zzyalbert@ad30ba7)
    badblockfix2

I currently prefer the first one because I think the second will increase a lot of cache query

@buddh0
Copy link
Contributor

buddh0 commented Sep 24, 2021

Good job!
and I prefer the first one.
@holiman please review this patch, because I think a lot of people is suffering from this problem.

@zzyalbert
Copy link
Contributor Author

@holiman @karalabe @rjl493456442 could you please review my pr?

@holiman
Copy link
Contributor

holiman commented Oct 18, 2021

Sorry, I somehow missed this PR -- looks like you went through a lot of debugging to find the problem. I'll investigate this asap, and try to think about if we can create a testcase to somehow find this

@rjl493456442
Copy link
Member

rjl493456442 commented Oct 19, 2021

Then generating process comes to checkAndFlush(core/state/snapshot/generate.go:638, see picture below), receives the abort signal. But it will set diskLayer.genMarker=accountHash and thus makes diskLayer.genMarker become smaller(accountHash < accountHash + storageHash1)

Hi @zzyalbert I don't get it why the genMarker can be smaller. checkAndFlush will be called in onAccount callback and onStorage callback. If generators are generating the storage data(account hash x + storage hash y), then the genMarker will be set to account hash x + storage hash y in case of abort signal occurs. The generation marker should be strictly matched with the latest generated state.

@zzyalbert
Copy link
Contributor Author

Hi @zzyalbert I don't get it why the genMarker can be smaller. checkAndFlush will be called in onAccount callback and onStorage callback. If generators are generating the storage data(account hash x + storage hash y), then the genMarker will be set to account hash x + storage hash y in case of abort signal occurs. The generation marker should be strictly matched with the latest generated state.

@rjl493456442 Yeah, it sure is in most cases. But checkAndFlush in onAccount will be called in front of that in onStorage . Therefore, if the genMarker is account hash x + storage hash y when the snap generating process is
resumed and interrupted again, then checkAndFlush in onAccount will be called and set genMarker to account hash x(storage hash y is missing) and return back. That's why genMarker becomes smaller(account hash x < account hash x + storage hash y)

I think this case could happen because the goroutine of resuming snapshot generation may not be scheduled in time

@rjl493456442
Copy link
Member

rjl493456442 commented Oct 19, 2021

But checkAndFlush in onAccount will be called in front of that in onStorage is it possible?

Whenever generator receives interruption signal in storage callback(is generating storage data), then the aborted error will be bubbled up to interrupt the entire generation. checkAndFlush in onAccount shouldn't be called.

@holiman
Copy link
Contributor

holiman commented Oct 19, 2021

I agree with @rjl493456442 -- I don't see how that could happen either. However, I'll add this and see if I can catch it:

			if bytes.Compare(currentLocation, dl.genMarker) < 0 {
				panic(fmt.Sprintf("curr < genMarker: %x < %x", currentLocation, dl.genMarker))
			}

@holiman
Copy link
Contributor

holiman commented Oct 19, 2021

Oh lookie (did a goerli sync)

INFO [10-19|10:12:26.255] Aborting state snapshot generation       root=715fc8..264cec in=006ce8..ad5d07 at=036b63..bb3db0 accounts=12791                slots=32074               storage=3.26MiB elapsed=1.296s      eta=12m59.165s
INFO [10-19|10:12:26.255] Resuming state snapshot generation       root=e7fcfd..acf130 in=006ce8..ad5d07 at=036b63..bb3db0 accounts=12791                slots=32074               storage=3.26MiB elapsed=1.297s      eta=12m59.766s
INFO [10-19|10:12:26.276] Aborting state snapshot generation       root=e7fcfd..acf130 at=006e85..d3da8a accounts=12978                slots=32337               storage=3.29MiB elapsed=1.318s      eta=13m0.806s
INFO [10-19|10:12:26.277] Resuming state snapshot generation       root=bbdf9d..d28662 at=006e85..d3da8a accounts=12978                slots=32337               storage=3.29MiB elapsed=1.318s      eta=13m0.806s
INFO [10-19|10:12:26.304] Aborting state snapshot generation       root=bbdf9d..d28662 in=00712f..6e5e53 at=0175b7..a01db9 accounts=13271                slots=32847               storage=3.35MiB elapsed=1.345s      eta=12m58.016s
INFO [10-19|10:12:26.304] Resuming state snapshot generation       root=d2a390..31a8ae in=00712f..6e5e53 at=0175b7..a01db9 accounts=13271                slots=32847               storage=3.35MiB elapsed=1.346s      eta=12m58.594s
panic: curr < genMarker: 00712f25a5ae002cfec1f6407b07331e3b919ee425027043359f42769e6e5e53 < 00712f25a5ae002cfec1f6407b07331e3b919ee425027043359f42769e6e5e530175b7a638427703f0dbe7bb9bbf987a2551717b34e79f33b5b1008d1fa01db9

goroutine 214313 [running]:
github.com/ethereum/go-ethereum/core/state/snapshot.(*diskLayer).generate.func1({0xc017010fa0, 0x46, 0x46})
	github.com/ethereum/go-ethereum/core/state/snapshot/generate.go:564 +0x445
github.com/ethereum/go-ethereum/core/state/snapshot.(*diskLayer).generate.func2({0xc016f31340, 0x20, 0x20}, {0xc0399fc9b0, 0x46, 0x46}, 0x0, 0x0)
	github.com/ethereum/go-ethereum/core/state/snapshot/generate.go:641 +0x62e
github.com/ethereum/go-ethereum/core/state/snapshot.(*diskLayer).generateRange.func1({0xc016f31340, 0x12, 0x5}, {0xc0399fc9b0, 0x2, 0x2})
	github.com/ethereum/go-ethereum/core/state/snapshot/generate.go:407 +0x2e
github.com/ethereum/go-ethereum/core/state/snapshot.(*proofResult).forEach(0xc031376dc0, 0xc001435c80)
	github.com/ethereum/go-ethereum/core/state/snapshot/generate.go:237 +0x8b
github.com/ethereum/go-ethereum/core/state/snapshot.(*diskLayer).generateRange(0xc016db95f0, {0xd2, 0xa3, 0x90, 0xcd, 0x4c, 0x30, 0xe1, 0x58, 0x2b, ...}, ...)
	github.com/ethereum/go-ethereum/core/state/snapshot/generate.go:407 +0xa32
github.com/ethereum/go-ethereum/core/state/snapshot.(*diskLayer).generate(0xc016db95f0, 0xc03d176540)
	github.com/ethereum/go-ethereum/core/state/snapshot/generate.go:711 +0x3a6
created by github.com/ethereum/go-ethereum/core/state/snapshot.diffToDisk
	github.com/ethereum/go-ethereum/core/state/snapshot/snapshot.go:645 +0xbb8

@buddh0
Copy link
Contributor

buddh0 commented Oct 19, 2021

But checkAndFlush in onAccount will be called in front of that in onStorage is it possible?

Whenever generator receives interruption signal in storage callback(is generating storage data), then the aborted error will be bubbled up to interrupt the entire generation. checkAndFlush in onAccount shouldn't be called.

checkAndFlush in onAccount would be called when the snap generating process is resumed
then error happens

@fjl fjl added this to the 1.10.11 milestone Oct 19, 2021
@fjl fjl removed the status:triage label Oct 19, 2021
@holiman
Copy link
Contributor

holiman commented Oct 19, 2021

alternative fix, courtesy of @karalabe :

diff --git a/core/state/snapshot/generate.go b/core/state/snapshot/generate.go
index 3e11b4ac6b..c179a14539 100644
--- a/core/state/snapshot/generate.go
+++ b/core/state/snapshot/generate.go
@@ -563,6 +563,9 @@ func (dl *diskLayer) generate(stats *generatorStats) {
 			// Flush out the batch anyway no matter it's empty or not.
 			// It's possible that all the states are recovered and the
 			// generation indeed makes progress.
+			if bytes.Compare(currentLocation, dl.genMarker) < 0 {
+				panic(fmt.Sprintf("curr < genMarker: %x < %x", currentLocation, dl.genMarker))
+			}
 			journalProgress(batch, currentLocation, stats)
 
 			if err := batch.Write(); err != nil {
@@ -635,7 +638,11 @@ func (dl *diskLayer) generate(stats *generatorStats) {
 			stats.accounts++
 		}
 		// If we've exceeded our batch allowance or termination was requested, flush to disk
-		if err := checkAndFlush(accountHash[:]); err != nil {
+		marker := accountHash[:]
+		if accMarker != nil && bytes.Equal(accountHash[:], accMarker) && len(dl.genMarker) > common.HashLength {
+			marker = append(marker, dl.genMarker[common.HashLength:]...)
+		}
+		if err := checkAndFlush(marker); err != nil {
 			return err
 		}
 		// If the iterated account is the contract, create a further loop to

I'll spin up syncs on our benchmarking machines to validate it

@zzyalbert
Copy link
Contributor Author

alternative fix, courtesy of @karalabe :

diff --git a/core/state/snapshot/generate.go b/core/state/snapshot/generate.go
index 3e11b4ac6b..c179a14539 100644
--- a/core/state/snapshot/generate.go
+++ b/core/state/snapshot/generate.go
@@ -563,6 +563,9 @@ func (dl *diskLayer) generate(stats *generatorStats) {
 			// Flush out the batch anyway no matter it's empty or not.
 			// It's possible that all the states are recovered and the
 			// generation indeed makes progress.
+			if bytes.Compare(currentLocation, dl.genMarker) < 0 {
+				panic(fmt.Sprintf("curr < genMarker: %x < %x", currentLocation, dl.genMarker))
+			}
 			journalProgress(batch, currentLocation, stats)
 
 			if err := batch.Write(); err != nil {
@@ -635,7 +638,11 @@ func (dl *diskLayer) generate(stats *generatorStats) {
 			stats.accounts++
 		}
 		// If we've exceeded our batch allowance or termination was requested, flush to disk
-		if err := checkAndFlush(accountHash[:]); err != nil {
+		marker := accountHash[:]
+		if accMarker != nil && bytes.Equal(accountHash[:], accMarker) && len(dl.genMarker) > common.HashLength {
+			marker = append(marker, dl.genMarker[common.HashLength:]...)
+		}
+		if err := checkAndFlush(marker); err != nil {
 			return err
 		}
 		// If the iterated account is the contract, create a further loop to

I'll spin up syncs on our benchmarking machines to validate it

Oh, thanks, I really appreciate it. I'll think about it

@zzyalbert zzyalbert force-pushed the fix_bad_block_with_snapshot_generation branch from 49cd470 to 3f98d0e Compare October 20, 2021 02:12
@zzyalbert
Copy link
Contributor Author

alternative fix, courtesy of @karalabe :
...
I'll spin up syncs on our benchmarking machines to validate it

@holiman I think this fix looks more reasonable. I add some code comments to describe the issue as well as some minor update

// If the snap generation goes here after interrupted, genMarker may go backward
// when last genMarker is consisted of accountHash and storageHash
if accMarker != nil && bytes.Equal(marker, accMarker) && len(dl.genMarker) > common.HashLength {
marker = dl.genMarker[:]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes marker use the same backing-slice as dl.genMarker, whereas previously it was a copy. Might not matter in this case, but generally it's a less safe pattern

@holiman holiman merged commit 312e02b into ethereum:master Oct 20, 2021
@holiman
Copy link
Contributor

holiman commented Oct 20, 2021

Thanks a lot for this @zzyalbert , I've had a feeling there was a bug there somewhere, but I haven't been able to pinpoint it, so kudos to you for doing that!

sidhujag pushed a commit to syscoin/go-ethereum that referenced this pull request Oct 20, 2021
…thereum#23635)

* core/state/snapshot: fix BAD BLOCK error when snapshot is generating

* core/state/snapshot: alternative fix for the snapshot generator

* add comments and minor update

Co-authored-by: Martin Holst Swende <[email protected]>
maoueh pushed a commit to streamingfast/go-ethereum that referenced this pull request Nov 12, 2021
…thereum#23635) (ethereum#485)

* core/state/snapshot: fix BAD BLOCK error when snapshot is generating

* core/state/snapshot: alternative fix for the snapshot generator

* add comments and minor update

Co-authored-by: Martin Holst Swende <[email protected]>

* core/state/snapshot: fix BAD BLOCK error when snapshot is generating

* core/state/snapshot: alternative fix for the snapshot generator

* add comments and minor update

Co-authored-by: Martin Holst Swende <[email protected]>

Co-authored-by: Ziyuan Zhong <[email protected]>
Co-authored-by: Martin Holst Swende <[email protected]>
greatestvillain added a commit to greatestvillain/go-ethereum that referenced this pull request Jan 24, 2022
yongjun925 pushed a commit to DODOEX/go-ethereum that referenced this pull request Dec 3, 2022
…thereum#23635)

* core/state/snapshot: fix BAD BLOCK error when snapshot is generating

* core/state/snapshot: alternative fix for the snapshot generator

* add comments and minor update

Co-authored-by: Martin Holst Swende <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants