Ledger panics (index out of bounds) on 2nd new epoch after block height > consensus_params.evidence.max_age_num_blocks #1897

iskay · 2023-09-15T16:33:37Z

(using Namada version v0.22.0 and Cometbft version 0.37.2)

All validators and full nodes on our local testnet would panic with the following error shortly after block height 100000, which corresponds to the default value of consensus_params.evidence.max_age_num_blocks:

I[2023-09-01|10:07:45.614] finalizing commit of block                   module=consensus height=100421 hash=F87B325287D0864A6B1D5F964B173137D801EBB9BE39C2D26A3C845B4F1396CB root=735053B51C4A798D319906F46C05351CD135F5078602277381606FE089161B0B num_txs=0
2023-09-01T10:07:45.618931Z  INFO namada_core::ledger::storage::wl_storage: Began a new epoch 457
2023-09-01T10:07:45.618952Z  INFO namada_apps::node::ledger::shell::finalize_block: Block height: 100421, epoch: 457, is new epoch: true.
The application panicked (crashed).
Message:  index out of bounds: the len is 456 but the index is 456
Location: apps/src/lib/node/ledger/shell/finalize_block.rs:695

Backtrace omitted.
Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
2023-09-01T10:07:47.281671Z  INFO namada_apps::node::ledger::shims::abcipp_shim: ABCI response channel didn't respond
E[2023-09-01|10:07:47.282] Stopping abci.socketClient for error: read message: EOF module=abci-client connection=consensus
The application panicked (crashed).
Message:  called `Result::unwrap()` on an `Err` value: RecvError(())
Location: /usr/local/cargo/git/checkouts/tower-abci-0d01b039e0b7a0c9/cf9573d/src/server.rs:163

Backtrace omitted.
Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
I[2023-09-01|10:07:47.283] service stop                                 module=abci-client connection=consensus msg="Stopping socketClient service" impl=socketClient
E[2023-09-01|10:07:47.283] error in proxyAppConn.EndBlock               module=state err="read message: EOF"
E[2023-09-01|10:07:47.284] consensus connection terminated. Did the application crash? Please restart CometBFT module=proxy err="read message: EOF"
E[2023-09-01|10:07:47.284] CONSENSUS FAILURE!!!                         module=consensus err="failed to apply block; error read message: EOF" stack="goroutine 262 [running]:\nruntime/debug.Stack()\n\truntime/debug/stack.go:24 +0x65\ngithub.com/cometbft/cometbft/consensus.(*State).receiveRoutine.func2()\n\tgithub.com/cometbft/cometbft/consensus/state.go:732 +0x4c\npanic({0xe92440, 0xc001a434a0})\n\truntime/panic.go:838 +0x207\ngithub.com/cometbft/cometbft/consensus.(*State).finalizeCommit(0xc0000ae000, 0x18845)\n\tgithub.com/cometbft/cometbft/consensus/state.go:1709 +0xf05\ngithub.com/cometbft/cometbft/consensus.(*State).tryFinalizeCommit(0xc0000ae000, 0x18845)\n\tgithub.com/cometbft/cometbft/consensus/state.go:1609 +0x2ff\ngithub.com/cometbft/cometbft/consensus.(*State).enterCommit.func1()\n\tgithub.com/cometbft/cometbft/consensus/state.go:1544 +0xa5\ngithub.com/cometbft/cometbft/consensus.(*State).enterCommit(0xc0000ae000, 0x18845, 0x0)\n\tgithub.com/cometbft/cometbft/consensus/state.go:1582 +0xcb7\ngithub.com/cometbft/cometbft/consensus.(*State).addVote(0xc0000ae000, 0xc0017345a0, {0xc000108810, 0x28})\n\tgithub.com/cometbft/cometbft/consensus/state.go:2212 +0xcbf\ngithub.com/cometbft/cometbft/consensus.(*State).tryAddVote(0xc0000ae000, 0xc0017345a0, {0xc000108810?, 0xc00027ff00?})\n\tgithub.com/cometbft/cometbft/consensus/state.go:2001 +0x2c\ngithub.com/cometbft/cometbft/consensus.(*State).handleMsg(0xc0000ae000, {{0x127bfa0?, 0xc0012b6820?}, {0xc000108810?, 0x0?}})\n\tgithub.com/cometbft/cometbft/consensus/state.go:861 +0x44b\ngithub.com/cometbft/cometbft/consensus.(*State).receiveRoutine(0xc0000ae000, 0x0)\n\tgithub.com/cometbft/cometbft/consensus/state.go:768 +0x419\ncreated by github.com/cometbft/cometbft/consensus.(*State).OnStart\n\tgithub.com/cometbft/cometbft/consensus/state.go:379 +0x12d\n"
I[2023-09-01|10:07:47.284] service stop                                 module=consensus wal=/root/.local/share/namada/luminara.79474f00ace3ef7ca2712/cometbft/data/cs.wal/wal msg="Stopping baseWAL service" impl=baseWAL
I[2023-09-01|10:07:47.284] signal trapped                               module=main msg="captured terminated, exiting..."
I[2023-09-01|10:07:47.285] service stop                                 module=consensus wal=/root/.local/share/namada/luminara.79474f00ace3ef7ca2712/cometbft/data/cs.wal/wal msg="Stopping Group service" impl=Group
I[2023-09-01|10:07:47.285] service stop                                 module=main msg="Stopping Node service" impl=Node
I[2023-09-01|10:07:47.285] Stopping Node                                module=main

At the start of a new epoch, PoS will recalculate inflation wrt the previous epoch by checking pred_epochs.first_block_heights at the index equal to the previous epoch:

let first_block_of_last_epoch = self
      .wl_storage
      .storage
      .block
      .pred_epochs
      .first_block_heights[last_epoch.0 as usize]
      .0;

but this doesn't take into account that pred_epochs is trimmed to only keep epochs from less than consensus_params.evidence.max_age_num_blocks ago and after reaching that height, indices will no longer directly correspond to epoch number. The first time through after reaching this height will incorrectly reference the current epoch instead of the previous and the second time through will result in an out of bounds error.

Steps:

Network nodes consistently panic at block height ~100200 on new epoch start
Set epoch duration to half; network will produce ~double the number of epochs but still panic at the same block height.
Change the constant evidence_max_age_num_blocks in core/src/ledger/storage/wl_storage.rs from 100000 to 1000; nodes now panic on second new epoch after height 1000 instead.
After making the following change and relaunching network; nodes no longer panic at heights > consensus_params.evidence.max_age_num_blocks. Our localnet is currently at block 111000 and counting

        // Get the number of blocks in the last epoch
        let last_epoch_index: usize = last_epoch.0 as usize - self.wl_storage.storage.block.pred_epochs.first_known_epoch.0 as usize;
        let first_block_of_last_epoch = self
            .wl_storage
            .storage
            .block
            .pred_epochs
            .first_block_heights[last_epoch_index]
            .0;

(I'm not sure if this is the 'right' way to modify the code but at least it seems to confirm the cause/solution)

The text was updated successfully, but these errors were encountered:

Fraccaman · 2023-09-15T17:35:04Z

@iskay thank you for the detailed report! we will look into this!

Fraccaman · 2023-09-26T09:09:31Z

Closed by #1898

iskay added the bug Something isn't working label Sep 15, 2023

Fraccaman assigned tzemanovic and brentstone Sep 15, 2023

brentstone mentioned this issue Sep 15, 2023

Fix the trimming of the record of the first block height of each epoch #1898

Closed

2 tasks

Fraccaman closed this as completed Sep 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ledger panics (index out of bounds) on 2nd new epoch after block height > consensus_params.evidence.max_age_num_blocks #1897

Ledger panics (index out of bounds) on 2nd new epoch after block height > consensus_params.evidence.max_age_num_blocks #1897

iskay commented Sep 15, 2023

Fraccaman commented Sep 15, 2023

Fraccaman commented Sep 26, 2023

Ledger panics (index out of bounds) on 2nd new epoch after block height > consensus_params.evidence.max_age_num_blocks #1897

Ledger panics (index out of bounds) on 2nd new epoch after block height > consensus_params.evidence.max_age_num_blocks #1897

Comments

iskay commented Sep 15, 2023

Fraccaman commented Sep 15, 2023

Fraccaman commented Sep 26, 2023