Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ledger panics (index out of bounds) on 2nd new epoch after block height > consensus_params.evidence.max_age_num_blocks #1897

Closed
iskay opened this issue Sep 15, 2023 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@iskay
Copy link

iskay commented Sep 15, 2023

(using Namada version v0.22.0 and Cometbft version 0.37.2)

All validators and full nodes on our local testnet would panic with the following error shortly after block height 100000, which corresponds to the default value of consensus_params.evidence.max_age_num_blocks:

I[2023-09-01|10:07:45.614] finalizing commit of block                   module=consensus height=100421 hash=F87B325287D0864A6B1D5F964B173137D801EBB9BE39C2D26A3C845B4F1396CB root=735053B51C4A798D319906F46C05351CD135F5078602277381606FE089161B0B num_txs=0
2023-09-01T10:07:45.618931Z  INFO namada_core::ledger::storage::wl_storage: Began a new epoch 457
2023-09-01T10:07:45.618952Z  INFO namada_apps::node::ledger::shell::finalize_block: Block height: 100421, epoch: 457, is new epoch: true.
The application panicked (crashed).
Message:  index out of bounds: the len is 456 but the index is 456
Location: apps/src/lib/node/ledger/shell/finalize_block.rs:695

Backtrace omitted.
Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
2023-09-01T10:07:47.281671Z  INFO namada_apps::node::ledger::shims::abcipp_shim: ABCI response channel didn't respond
E[2023-09-01|10:07:47.282] Stopping abci.socketClient for error: read message: EOF module=abci-client connection=consensus
The application panicked (crashed).
Message:  called `Result::unwrap()` on an `Err` value: RecvError(())
Location: /usr/local/cargo/git/checkouts/tower-abci-0d01b039e0b7a0c9/cf9573d/src/server.rs:163

Backtrace omitted.
Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
I[2023-09-01|10:07:47.283] service stop                                 module=abci-client connection=consensus msg="Stopping socketClient service" impl=socketClient
E[2023-09-01|10:07:47.283] error in proxyAppConn.EndBlock               module=state err="read message: EOF"
E[2023-09-01|10:07:47.284] consensus connection terminated. Did the application crash? Please restart CometBFT module=proxy err="read message: EOF"
E[2023-09-01|10:07:47.284] CONSENSUS FAILURE!!!                         module=consensus err="failed to apply block; error read message: EOF" stack="goroutine 262 [running]:\nruntime/debug.Stack()\n\truntime/debug/stack.go:24 +0x65\ngithub.com/cometbft/cometbft/consensus.(*State).receiveRoutine.func2()\n\tgithub.com/cometbft/cometbft/consensus/state.go:732 +0x4c\npanic({0xe92440, 0xc001a434a0})\n\truntime/panic.go:838 +0x207\ngithub.com/cometbft/cometbft/consensus.(*State).finalizeCommit(0xc0000ae000, 0x18845)\n\tgithub.com/cometbft/cometbft/consensus/state.go:1709 +0xf05\ngithub.com/cometbft/cometbft/consensus.(*State).tryFinalizeCommit(0xc0000ae000, 0x18845)\n\tgithub.com/cometbft/cometbft/consensus/state.go:1609 +0x2ff\ngithub.com/cometbft/cometbft/consensus.(*State).enterCommit.func1()\n\tgithub.com/cometbft/cometbft/consensus/state.go:1544 +0xa5\ngithub.com/cometbft/cometbft/consensus.(*State).enterCommit(0xc0000ae000, 0x18845, 0x0)\n\tgithub.com/cometbft/cometbft/consensus/state.go:1582 +0xcb7\ngithub.com/cometbft/cometbft/consensus.(*State).addVote(0xc0000ae000, 0xc0017345a0, {0xc000108810, 0x28})\n\tgithub.com/cometbft/cometbft/consensus/state.go:2212 +0xcbf\ngithub.com/cometbft/cometbft/consensus.(*State).tryAddVote(0xc0000ae000, 0xc0017345a0, {0xc000108810?, 0xc00027ff00?})\n\tgithub.com/cometbft/cometbft/consensus/state.go:2001 +0x2c\ngithub.com/cometbft/cometbft/consensus.(*State).handleMsg(0xc0000ae000, {{0x127bfa0?, 0xc0012b6820?}, {0xc000108810?, 0x0?}})\n\tgithub.com/cometbft/cometbft/consensus/state.go:861 +0x44b\ngithub.com/cometbft/cometbft/consensus.(*State).receiveRoutine(0xc0000ae000, 0x0)\n\tgithub.com/cometbft/cometbft/consensus/state.go:768 +0x419\ncreated by github.com/cometbft/cometbft/consensus.(*State).OnStart\n\tgithub.com/cometbft/cometbft/consensus/state.go:379 +0x12d\n"
I[2023-09-01|10:07:47.284] service stop                                 module=consensus wal=/root/.local/share/namada/luminara.79474f00ace3ef7ca2712/cometbft/data/cs.wal/wal msg="Stopping baseWAL service" impl=baseWAL
I[2023-09-01|10:07:47.284] signal trapped                               module=main msg="captured terminated, exiting..."
I[2023-09-01|10:07:47.285] service stop                                 module=consensus wal=/root/.local/share/namada/luminara.79474f00ace3ef7ca2712/cometbft/data/cs.wal/wal msg="Stopping Group service" impl=Group
I[2023-09-01|10:07:47.285] service stop                                 module=main msg="Stopping Node service" impl=Node
I[2023-09-01|10:07:47.285] Stopping Node                                module=main

At the start of a new epoch, PoS will recalculate inflation wrt the previous epoch by checking pred_epochs.first_block_heights at the index equal to the previous epoch:

let first_block_of_last_epoch = self
      .wl_storage
      .storage
      .block
      .pred_epochs
      .first_block_heights[last_epoch.0 as usize]
      .0;

but this doesn't take into account that pred_epochs is trimmed to only keep epochs from less than consensus_params.evidence.max_age_num_blocks ago and after reaching that height, indices will no longer directly correspond to epoch number. The first time through after reaching this height will incorrectly reference the current epoch instead of the previous and the second time through will result in an out of bounds error.

Steps:

  1. Network nodes consistently panic at block height ~100200 on new epoch start
  2. Set epoch duration to half; network will produce ~double the number of epochs but still panic at the same block height.
  3. Change the constant evidence_max_age_num_blocks in core/src/ledger/storage/wl_storage.rs from 100000 to 1000; nodes now panic on second new epoch after height 1000 instead.
  4. After making the following change and relaunching network; nodes no longer panic at heights > consensus_params.evidence.max_age_num_blocks. Our localnet is currently at block 111000 and counting
        // Get the number of blocks in the last epoch
        let last_epoch_index: usize = last_epoch.0 as usize - self.wl_storage.storage.block.pred_epochs.first_known_epoch.0 as usize;
        let first_block_of_last_epoch = self
            .wl_storage
            .storage
            .block
            .pred_epochs
            .first_block_heights[last_epoch_index]
            .0;

(I'm not sure if this is the 'right' way to modify the code but at least it seems to confirm the cause/solution)

@iskay iskay added the bug Something isn't working label Sep 15, 2023
@Fraccaman
Copy link
Member

@iskay thank you for the detailed report! we will look into this!

@Fraccaman
Copy link
Member

Closed by #1898

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants