Automatic sync fails after transient error syncing to Freezer DB with "missing parent" error #22112

vdamle · 2021-01-04T20:20:25Z

System information

Geth version: 1.9.7
OS & Version: Linux (Kubernetes on Azure) - running clique/POA
Commit hash : stable

Expected behaviour

After a transient failure to write blocks to freezer the node restarts (as expected), and should be able to sync with it's peers and catch up to the rest of the chain.

Actual behaviour

Node truncates freezer to a very old block (due to a missing full block) and when it receives a batch of blocks from a peer, it is unable to sync those blocks with a missing parent error. As seen in the logs below, the node experiences a temporary failure to flush entries to the freezer. On a restart, it finds that:

Tip of freezer is at 150571
Tip of the chain is at 240572
Most recent full block is 33605

This is a non-archive node, but even then it is strange that the most recent full block should be so far behind. This results in the freezer getting truncated and the node attempts to sync the most recent blocks. However, that action fails because of a missing parent error on a batch of 26 most recent blocks. The same error keeps repeated across a rolling window of blocks. Eventually, we were able to recover the node only after performing a fresh sync (wiping all chain data).

INFO [12-23|22:44:14.269] Deep froze chain segment                 blocks=12    elapsed=179.503ms number=150533 hash=d41e42…b264c1
INFO [12-23|22:45:14.452] Deep froze chain segment                 blocks=12    elapsed=182.750ms number=150545 hash=103e90…ef8568
INFO [12-23|22:46:42.021] Imported new chain segment               blocks=1     txs=0 mgas=0.000 elapsed=2.394ms   mgasps=0.000  number=240564 hash=b5f25c…030f1b dirty=87.87KiB
INFO [12-23|22:46:47.022] Imported new chain segment               blocks=1     txs=0 mgas=0.000 elapsed=2.617ms   mgasps=0.000  number=240565 hash=fe3a1b…a8e538 dirty=87.87KiB
CRIT [12-23|22:47:21.184] Failed to flush frozen tables            err="[sync /qdata/ethereum/geth/chaindata/ancient/headers.cidx: bad file descriptor]"

WARN [12-23|22:47:27.767] Head state missing, repairing chain      number=240571 hash=ceb1b7…c0d84f
INFO [12-23|22:47:27.932] Deep froze chain segment                 blocks=13 elapsed=695.626ms number=150570 hash=3cb347…a1c6bd
DEBUG[12-23|22:48:28.127] Ancient blocks frozen already            number=240571 hash=ceb1b7…c0d84f frozen=150571
DEBUG[12-23|22:49:28.128] Ancient blocks frozen already            number=240571 hash=ceb1b7…c0d84f frozen=150571
DEBUG[12-23|22:50:28.128] Ancient blocks frozen already            number=240571 hash=ceb1b7…c0d84f frozen=150571
INFO [12-23|22:50:40.494] Rewound blockchain to past state         number=33604  hash=8bdb53…4cef1b
INFO [12-23|22:50:40.638] Loaded most recent local header          number=240571 hash=ceb1b7…c0d84f td=240572 age=3m23s
INFO [12-23|22:50:40.638] Loaded most recent local full block      number=33604  hash=8bdb53…4cef1b td=33605  age=1w4d23h
INFO [12-23|22:50:40.638] Loaded most recent local fast block      number=240571 hash=ceb1b7…c0d84f td=240572 age=3m23s
WARN [12-23|22:51:11.596] Truncating freezer table                 database=/qdata/ethereum/geth/chaindata/ancient table=diffs    items=150571 limit=33605
WARN [12-23|22:51:11.750] Truncating freezer table                 database=/qdata/ethereum/geth/chaindata/ancient table=headers  items=150571 limit=33605
WARN [12-23|22:51:11.866] Truncating freezer table                 database=/qdata/ethereum/geth/chaindata/ancient table=hashes   items=150571 limit=33605
WARN [12-23|22:51:11.953] Truncating freezer table                 database=/qdata/ethereum/geth/chaindata/ancient table=bodies   items=150571 limit=33605
WARN [12-23|22:51:12.044] Truncating freezer table                 database=/qdata/ethereum/geth/chaindata/ancient table=receipts items=150571 limit=33605

DEBUG[12-23|22:51:13.280] Inserting downloaded chain               items=49     firstnum=240566 firsthash=6180ef…9759bf lastnum=240614 lasthash=71d5bb…52d486
DEBUG[12-23|22:51:13.606] Pruned ancestor, inserting as sidechain  number=240566 hash=6180ef…9759bf
DEBUG[12-23|22:51:13.619] Injected sidechain block                 number=240572 hash=ade660…1158f5 diff=1     elapsed=127.799µs txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.622] Injected sidechain block                 number=240573 hash=e9e493…494f1e diff=1     elapsed=96.699µs  txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.624] Injected sidechain block                 number=240574 hash=81aa3d…1cde52 diff=1     elapsed=101.899µs txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.626] Injected sidechain block                 number=240575 hash=bc0ce7…617837 diff=1     elapsed=86.999µs  txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.628] Injected sidechain block                 number=240576 hash=3bbdea…d91170 diff=1     elapsed=85.099µs  txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.630] Injected sidechain block                 number=240577 hash=7b80a6…ebc272 diff=1     elapsed=148.099µs txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.633] Injected sidechain block                 number=240578 hash=3ff37a…36382c diff=1     elapsed=91.699µs  txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.635] Injected sidechain block                 number=240579 hash=dfebcb…78379f diff=1     elapsed=96.199µs  txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.637] Injected sidechain block                 number=240580 hash=a02165…a74d1c diff=1     elapsed=79.799µs  txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.639] Injected sidechain block                 number=240581 hash=3b4cc0…7e87d5 diff=1     elapsed=108.499µs txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.641] Injected sidechain block                 number=240582 hash=cde4a8…e2983f diff=1     elapsed=87.999µs  txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.644] Injected sidechain block                 number=240583 hash=0146ad…71b09e diff=1     elapsed=111.199µs txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.646] Injected sidechain block                 number=240584 hash=2d2e1e…bd08dd diff=1     elapsed=138.599µs txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.648] Injected sidechain block                 number=240585 hash=b35b4b…c04883 diff=1     elapsed=111.099µs txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.651] Injected sidechain block                 number=240586 hash=a38f0e…68a7af diff=1     elapsed=88.999µs  txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.653] Injected sidechain block                 number=240587 hash=19247a…2c90f0 diff=1     elapsed=78.799µs  txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.655] Injected sidechain block                 number=240588 hash=391dda…476f21 diff=1     elapsed=106.498µs txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.657] Injected sidechain block                 number=240589 hash=46463c…70c966 diff=1     elapsed=150.098µs txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.660] Injected sidechain block                 number=240590 hash=aee9e4…e46c1e diff=1     elapsed=63.799µs  txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.661] Injected sidechain block                 number=240591 hash=efe965…ea8c03 diff=1     elapsed=68.099µs  txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.664] Injected sidechain block                 number=240592 hash=b1b592…92e578 diff=1     elapsed=63.6µs    txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.666] Injected sidechain block                 number=240593 hash=da4bd5…5b6677 diff=1     elapsed=95.399µs  txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.668] Injected sidechain block                 number=240594 hash=71b9b8…1f1687 diff=1     elapsed=66.5µs    txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.670] Injected sidechain block                 number=240595 hash=8bf6a9…553deb diff=1     elapsed=79.2µs    txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.672] Injected sidechain block                 number=240596 hash=d0cf30…3a04b1 diff=1     elapsed=98.499µs  txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.674] Injected sidechain block                 number=240597 hash=1bfb25…5bce90 diff=1     elapsed=83.999µs  txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.676] Injected sidechain block                 number=240598 hash=f69b76…a99954 diff=1     elapsed=83.999µs  txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.679] Injected sidechain block                 number=240599 hash=04c285…abf8e1 diff=1     elapsed=91.599µs  txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.681] Injected sidechain block                 number=240600 hash=d48283…acdab2 diff=1     elapsed=77.3µs    txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.683] Injected sidechain block                 number=240601 hash=55bb47…6692de diff=1     elapsed=84.999µs  txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.685] Injected sidechain block                 number=240602 hash=f7de83…70ba11 diff=1     elapsed=67.699µs  txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.690] Injected sidechain block                 number=240603 hash=02b770…dd89fd diff=1     elapsed=55.1µs    txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.694] Injected sidechain block                 number=240604 hash=389ac9…1f111c diff=1     elapsed=67.199µs  txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.698] Injected sidechain block                 number=240605 hash=542fc1…83922e diff=1     elapsed=57.499µs  txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.702] Injected sidechain block                 number=240606 hash=c5c2e3…b1c1eb diff=1     elapsed=77.899µs  txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.706] Injected sidechain block                 number=240607 hash=6dcb2c…dfc10f diff=1     elapsed=81.599µs  txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.710] Injected sidechain block                 number=240608 hash=052e05…9c5a5f diff=1     elapsed=116.699µs txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.714] Injected sidechain block                 number=240609 hash=a391d4…6ba0bc diff=1     elapsed=59.1µs    txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.717] Injected sidechain block                 number=240610 hash=90cfcb…d9012e diff=1     elapsed=187.199µs txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.721] Injected sidechain block                 number=240611 hash=06fb4d…8cadee diff=1     elapsed=53.1µs    txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.725] Injected sidechain block                 number=240612 hash=6affa9…095b4a diff=1     elapsed=67.699µs  txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.729] Injected sidechain block                 number=240613 hash=a678cf…ecda76 diff=1     elapsed=101.899µs txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
DEBUG[12-23|22:51:13.732] Injected sidechain block                 number=240614 hash=71d5bb…52d486 diff=1     elapsed=86.899µs  txs=0 gas=0 uncles=0 root=86bc5d…3ef70d
INFO [12-23|22:51:14.035] Network syncing, will start miner afterwards
DEBUG[12-23|22:51:18.023] Downloaded item processing failed on sidechain import index=49 err="missing parent"
DEBUG[12-23|22:51:18.023] Reset ancient limit to zero
DEBUG[12-23|22:51:18.023] Synchronisation terminated               elapsed=4.751s
DEBUG[12-23|22:51:18.023] Reset ancient limit to zero
WARN [12-23|22:51:18.023] Synchronisation failed, dropping peer    peer=9ef106b661dd09ab err="retrieved hash chain is invalid"
DEBUG[12-23|22:51:18.023] Removing Ethereum peer                   peer=9ef106b661dd09ab

Steps to reproduce the behaviour

We do not know what causes the original issue with flushing contents to freezer DB. However, if one could re-create such a failure via a unit test, it should be pretty straight forward to see the rest of the failure.

Backtrace

None

The text was updated successfully, but these errors were encountered:

DGKSK8LIFE · 2021-01-04T20:42:47Z

Maybe this is instigated by containerization/k8s? Are you using kubectl proxy by any chance?

vdamle · 2021-01-04T21:01:55Z

No, we are not. In general, there is no failure to sync blocks when adding new nodes to an existing chain (even one which is > 90k blocks). So direct and ongoing push to the freezer DB works just fine. The CRIT error when syncing data to the freezer could be due to the underlying filesystem -- but that shouldn't mean that a subsequent recovery/catch up isn't possible, should it?

DGKSK8LIFE · 2021-01-04T21:06:16Z

Well if you're hosting a node, you have to sync (which requires downloading things off the Ethereum blockchain). So a failure in the underlying filesystem could instigate an error in syncing. So if the filesystem is broken/there are issues with it, then a subsequent recovery would probably error out as well, right?

vdamle · 2021-01-04T21:27:56Z

I mentioned in my initial summary that we are running a private chain running clique/POA. Also a re-sync of the same node works just fine (stop geth, wipe data and sync again)

holiman · 2021-01-05T08:01:15Z

This is the root of you problems:

CRIT [12-23|22:47:21.184] Failed to flush frozen tables            err="[sync /qdata/ethereum/geth/chaindata/ancient/headers.cidx: bad file descriptor]"

Apparently, this failure used the CRIT log level, which is not good -- since it causes an abnormal shutdown, which can cause further corruption.

I suspect that the missing parent error comes from a mismatch between what's in leveldb versus what's in ancient db.

…um#22112)

karalabe · 2021-01-05T11:59:29Z

Btw, you're running a fairly old release (20 releases old). I think it's probably time to upgrade before trying again :)

holiman · 2021-01-05T14:17:21Z

Oh, I just saw that you reported this against v1.9.7.
In #21409 , which was merged for 1.9.20, @karalabe reworked a lot of the repair cases, to ensure that geth properly handled all the quirky scenarios that can happen on an abrupt shutdown.

Whatever happened on 1.9.7, it's not really actionable, because it's too old, and with the major refactorings done since then, there's really no point in digging further on this issue.

I'll close this, recommend that you use a more recent version of geth, and if you experience a similar issue again, please file a new ticket.

holiman · 2021-01-05T14:17:51Z

Oh and I just saw that @karalabe already commented just that :/

…um#22112)

vdamle added the type:bug label Jan 4, 2021

holiman added a commit to holiman/go-ethereum that referenced this issue Jan 5, 2021

core/rawdb: don't hard-exit from freezer on fsync failure (ref ethere…

5e1b86a

…um#22112)

holiman mentioned this issue Jan 5, 2021

core/rawdb: avoid exiting geth from freezer on fsync failure (ref #22112) #22118

Closed

holiman closed this as completed Jan 5, 2021

vdamle pushed a commit to kaleido-io/go-ethereum that referenced this issue Jan 5, 2021

core/rawdb: don't hard-exit from freezer on fsync failure (ref ethere…

ca45504

…um#22112)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic sync fails after transient error syncing to Freezer DB with "missing parent" error #22112

Automatic sync fails after transient error syncing to Freezer DB with "missing parent" error #22112

vdamle commented Jan 4, 2021

DGKSK8LIFE commented Jan 4, 2021

vdamle commented Jan 4, 2021 •

edited

Loading

DGKSK8LIFE commented Jan 4, 2021 •

edited

Loading

vdamle commented Jan 4, 2021 •

edited

Loading

holiman commented Jan 5, 2021

karalabe commented Jan 5, 2021

holiman commented Jan 5, 2021

holiman commented Jan 5, 2021

Automatic sync fails after transient error syncing to Freezer DB with "missing parent" error #22112

Automatic sync fails after transient error syncing to Freezer DB with "missing parent" error #22112

Comments

vdamle commented Jan 4, 2021

System information

Expected behaviour

Actual behaviour

Steps to reproduce the behaviour

Backtrace

DGKSK8LIFE commented Jan 4, 2021

vdamle commented Jan 4, 2021 • edited Loading

DGKSK8LIFE commented Jan 4, 2021 • edited Loading

vdamle commented Jan 4, 2021 • edited Loading

holiman commented Jan 5, 2021

karalabe commented Jan 5, 2021

holiman commented Jan 5, 2021

holiman commented Jan 5, 2021

vdamle commented Jan 4, 2021 •

edited

Loading

DGKSK8LIFE commented Jan 4, 2021 •

edited

Loading

vdamle commented Jan 4, 2021 •

edited

Loading