Snap sync leads to corrupted state #22534

holiman · 2021-03-19T10:14:49Z

This was against Geth/v1.10.0-unstable-8f03e3b1-20210220/linux-amd64/go1.16. I have the full logs, with TRACE on some portions of it.

We've seen this on @MariusVanDerWijden 's node aswell, and something similar recently happened on one of the benchmarkers.

Interesting parts of the logs:

Terminates snapshot sync cycle, pending is 0:

TRACE[02-22|19:10:36.935] Delivering set of healing trienodes      peer=a751eb04         reqid=3279015607415421165 trienodes=1   bytes=39.00B
DEBUG[02-22|19:10:36.935] Persisted set of healing data            type=trienodes               bytes=6.21KiB
DEBUG[02-22|19:10:36.935] Terminating snapshot sync cycle          root="56b98f…549127"
INFO [02-22|19:10:36.935] State heal in progress                   [email protected] [email protected]       pending=0
TRACE[02-22|19:10:36.935] State sync spinning down                 active=0 finished=0
TRACE[02-22|19:10:36.945] Peer throughput measurements updated     peer=ca5d01b0         hps=2603.508  bps=308.852 rps=155.252 sps=0.000 miss=0 rtt=127.002199ms

It then continues downloading headers/bodies for another five minutes, and then it decides that snapshot sync is already completed:

DEBUG[02-22|19:16:09.248] Inserting fast-sync blocks               items=15       firstnum=11908300 firsthash="4ff1bc…43c652" lastnumn=11908314 lasthash="8bd984…7f4ce4"
INFO [02-22|19:16:09.274] Imported new block receipts              count=15      elapsed=26.263ms    number=11908314 hash="8bd984…7f4ce4" age=20m30s     size=1.44MiB
TRACE[02-22|19:16:09.274] State sync starting                      root="56b98f…549127"
DEBUG[02-22|19:16:09.274] Snapshot sync already completed 
TRACE[02-22|19:16:09.274] State sync spinning down                 active=0 finished=0
DEBUG[02-22|19:16:09.274] Committing fast sync pivot as new head   number=11908315 hash="b8473c…b87bf0"
INFO [02-22|19:16:09.276] Imported new block receipts              count=1       elapsed=1.639ms     number=11908315 hash="b8473c…b87bf0" age=20m9s      size=113.69KiB
INFO [02-22|19:16:09.279] Rebuilding state snapshot 
INFO [02-22|19:16:09.281] Committed new head block                 number=11908315 hash="b8473c…b87bf0"
INFO [02-22|19:16:09.282] Wiper running, state snapshotting paused accounts=0                  slots=0                  storage=0.00B elapsed="934.583µs"
INFO [02-22|19:16:09.380] Deallocated state bloom                  items=762715723 errorrate=0.001
DEBUG[02-22|19:16:09.380] Inserting downloaded chain               items=3         firstnum=11908316 firsthash="e33586…12d149" lastnum=11908318 lasthash="32db31…17e4ba"
TRACE[02-22|19:16:09.955] Fetching full headers                    peer=ca5d01b0         count=192     from=11908406
TRACE[02-22|19:16:09.987] Scheduling new headers                   peer=ca5d01b0         count=1       from=11908406
TRACE[02-22|19:16:09.987] Fetching full headers                    peer=ca5d01b0         count=192     from=11908407
TRACE[02-22|19:16:10.020] All headers delayed, waiting             peer=ca5d01b0
INFO [02-22|19:16:11.303] Deleted state snapshot leftovers         kind=accounts wiped=8893 elapsed=2.022s
INFO [02-22|19:16:12.214] Imported new chain segment               blocks=3 txs=671 mgas=37.249 elapsed=2.811s      mgasps=13.248 number=11908318 hash="32db31…17e4ba" age=19m27s     dirty=5.14MiB
INFO [02-22|19:16:12.216] Upgrading chain index                    type=bloombits               percentage=40
INFO [02-22|19:16:12.216] Imported new block headers               count=1       elapsed=2.228s      number=11908406 hash="0209a1…11a442"
DEBUG[02-22|19:16:12.216] Inserting downloaded chain               items=87        firstnum=11908319 firsthash="398748…54d37f" lastnum=11908405 lasthash="445a39…8a8e14"
INFO [02-22|19:16:12.957] Deleted state snapshot leftovers         kind=storage  wiped=0    elapsed=1.653s
INFO [02-22|19:16:12.957] Compacting snapshot account area

And it starts generating the snapshot. However, after another 9 minutes, it finally reports Snap sync complete -- by which time the generator has already been at work for a while.

INFO [02-22|19:25:02.688] Aborting state snapshot generation       root="69e485…8c7900" in="0042a5…807326" at="f3be76…5f8c01" accounts=121217             slots=843141             storage=65.73MiB  elapsed=8m29.614s   eta=139h3m27.721s
INFO [02-22|19:25:02.694] Resuming state snapshot generation       root="8e11f0…0ff2f1" in="0042a5…807326" at="f3be76…5f8c01" accounts=121217             slots=843141             storage=65.73MiB  elapsed=8m29.620s   eta=139h3m33.615s
INFO [02-22|19:25:04.076] Aborting state snapshot generation       root="8e11f0…0ff2f1" in="0042a5…807326" at="f580c0…17d51a" accounts=121217             slots=848398             storage=66.10MiB  elapsed=8m31.002s   eta=139h26m11.189s
INFO [02-22|19:25:04.081] Imported new chain segment               blocks=2      txs=334      mgas=24.995 elapsed=3.502s      mgasps=7.136  number=11908448 hash="80804a…25bf8e" age=2m38s      dirty=217.28MiB
INFO [02-22|19:25:04.081] Downloader queue stats                   receiptTasks=0     blockTasks=0     itemSize=206.21KiB throttle=318
DEBUG[02-22|19:25:04.081] Synchronisation terminated               elapsed=1h56m27.889s
INFO [02-22|19:25:04.081] Fast sync complete, auto disabling 
INFO [02-22|19:25:04.082] Snap sync complete, auto disabling 
INFO [02-22|19:25:04.083] Resuming state snapshot generation       root="b9ebf5…555f64" in="0042a5…807326" at="f580c0…17d51a" accounts=121217             slots=848398             storage=66.10MiB  elapsed=8m31.008s    eta=139h26m17.083s
DEBUG[02-22|19:25:04.112] Synchronising with the network           peer=7166e5a694a38132a01eacc075f77c54f5d4051195aa166fc3b1562f6988d278 eth=65 head="6e3021…164c43" td=21344725780625395672749 mode=full
DEBUG[02-22|19:25:04.112] Retrieving remote chain head             peer=7166e5a6
DEBUG[02-22|19:25:04.137] Remote head identified, no pivot         peer=7166e5a6         number=11908459 hash="6e3021…164c43"
DEBUG[02-22|19:25:04.137] Looking for common ancestor              peer=7166e5a6         local=11908448 remote=11908459

And eventually

INFO [02-24|05:54:56.134] Generating state snapshot                root="01ecf5…29122c" at="f9fe47…b2aa84" accounts=116658241          slots=386540660          storage=32.90GiB   elapsed=34h38m23.060s eta=49m56.31s
ERROR[02-24|05:54:59.299] Generator failed to access storage trie  root="01ecf5…29122c" account="fa03dd…fe0ce8" stroot="0e72a1…351856" err="missing trie node 0e72a13c4752aa80f72dd27b1c859a78c6b90d8df0a898c131b9b2253b351856 (path )"

The text was updated successfully, but these errors were encountered:

holiman · 2021-03-19T10:45:34Z

This problem seem to be more prone to happen when the node is totally clean, i.e, also has do download bodies. This causes the snap-part of the sync to complete before the block-part, and that seems to sometimes trigger this behaviour.

holiman · 2021-03-19T12:22:51Z

On bench03

https://my.papertrailapp.com/systems/5646602342/events?q=program%3Ageth&focus=1309285025577111555&selected=1309285025577111555

[03-18|19:18:17.895] Rebuilding state snapshot
[03-18|19:18:17.898] Committed new head block number=12064387 hash="ac8b02…72635d

https://my.papertrailapp.com/systems/5646602342/events?q=program%3Ageth&focus=1309285820703903747&selected=1309285820703903747

[03-18|19:21:27.471] Snap sync complete, auto disabling

rjl493456442 · 2021-03-22T08:18:35Z

Looks like it's the expected behavior.

the state sync is done(by snap sync), the pivot point state is committed so that the background generator starts to run
the chain sync is still not finished, so that it keeps syncing for a while
the pivot point won't be changed if it's committed
after several minutes, the chain sync is eventually finished, so that it reports Snap sync complete, auto disabling

In theory, it can be wrong in this case. The generation target is the pivot point trie and all the updates of the following blocks will be handled by diffToDisk function.

holiman · 2021-03-22T09:01:23Z

What kicks it off is that the pivot is committed, but the downloader has another chunk of blocks afterP. Once the pivot is committed, the afterP blocks are imported "normally". This kicks off the generator. which starts wiping.

But the downloader doesn't exit until after importing the afterP blocks, and only then does the external caller say "Fast sync complete".

But I don't see anything important that we actually do around "Fast sync complete" that is very important, other than log the message. So I still don't see why this would lead to data corruption.

* eth/protocols/snap: fix snap sync * eth/protocols/snap: fix tests * eth: fix tiny * eth: update tests * eth: update tests * core/state/snapshot: testcase for #22534 * eth/protocols/snap: fix boundary loss on full-but-proven range * core/state/snapshot: lintfix * eth: address comment * eth: fix handler Co-authored-by: Martin Holst Swende <[email protected]> Co-authored-by: Péter Szilágyi <[email protected]>

holiman · 2021-03-25T08:44:30Z

As far as we know, this is solved by #22553

* eth/protocols/snap: fix snap sync * eth/protocols/snap: fix tests * eth: fix tiny * eth: update tests * eth: update tests * core/state/snapshot: testcase for ethereum#22534 * eth/protocols/snap: fix boundary loss on full-but-proven range * core/state/snapshot: lintfix * eth: address comment * eth: fix handler Co-authored-by: Martin Holst Swende <[email protected]> Co-authored-by: Péter Szilágyi <[email protected]>

holiman added the type:bug label Mar 19, 2021

holiman added a commit to rjl493456442/go-ethereum that referenced this issue Mar 24, 2021

core/state/snapshot: testcase for ethereum#22534

aeb0f80

holiman closed this as completed Mar 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snap sync leads to corrupted state #22534

Snap sync leads to corrupted state #22534

holiman commented Mar 19, 2021

holiman commented Mar 19, 2021

holiman commented Mar 19, 2021

rjl493456442 commented Mar 22, 2021

holiman commented Mar 22, 2021

holiman commented Mar 25, 2021

Snap sync leads to corrupted state #22534

Snap sync leads to corrupted state #22534

Comments

holiman commented Mar 19, 2021

holiman commented Mar 19, 2021

holiman commented Mar 19, 2021

rjl493456442 commented Mar 22, 2021

holiman commented Mar 22, 2021

holiman commented Mar 25, 2021