Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bankhash mismatch running #34623 against mainnet #34876

Closed
HaoranYi opened this issue Jan 22, 2024 · 19 comments
Closed

Bankhash mismatch running #34623 against mainnet #34876

HaoranYi opened this issue Jan 22, 2024 · 19 comments
Assignees

Comments

@HaoranYi
Copy link
Contributor

HaoranYi commented Jan 22, 2024

Problem

My node running a recent commit (6a9f729) from master against mainnet crashed with bankhash mismatch on slot 243108000 (https://explorer.solana.com/block/243108000).

   0: rust_begin_unwind
             at ./rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at ./rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:72:14
   2: hashbrown::map::HashMap<K,V,S,A>::retain
   3: solana_core::replay_stage::ReplayStage::dump_then_repair_correct_slots
   4: solana_core::replay_stage::ReplayStage::new::{{closure}}
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
[2024-01-21T03:16:11.202872991Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solReplayStage" one=1i message="panicked at core/src/replay_stage.rs:1414:25:
    We have tried to repair duplicate slot: 243108000 more than 10 times and are unable to freeze a block with bankhash HMk3tMMympeHyBRpoDrqMLxfLjSUZvNiLYx6i2JCuNGZ, instead we have a block with bankhash Some(7h4iAoXfX6KwitVhazib3fCnr3J4koFu7ArJZTY5heFZ). This is most likely a bug in the runtime. At this point manual intervention is needed to make progress. Exiting" location="core/src/replay_stage.rs:1414:25" version="1.18.0 (src:00000000; feat:4046558620, client:SolanaLabs)"
[2024-01-21T03:16:11.580033138Z INFO  solana_metrics::metrics] datapoint: cluster_slots_service-timing lowest_slot_elapsed=2350i process_cluster_slots_updates_elapsed=237083i

Not sure if any of our canary testing nodes catch this error too?

Proposed Solution

Not sure yet, but I will investigate.

@HaoranYi HaoranYi self-assigned this Jan 22, 2024
@HaoranYi HaoranYi changed the title Bankhash mismatch running e2c2029ac4974687f38ec711fa67512fbd710386 against mainnet Bankhash mismatch running https://github.com/solana-labs/solana/pull/34623 against mainnet Jan 22, 2024
@HaoranYi HaoranYi changed the title Bankhash mismatch running https://github.com/solana-labs/solana/pull/34623 against mainnet Bankhash mismatch running #34623 against mainnet Jan 22, 2024
@HaoranYi
Copy link
Contributor Author

243107999 matches
good: bank frozen: 243107999 hash: 7X1s4w65yBnwz18fKg5tamNrSHnK1AZyBhQEZRjctDhc
bad: bank frozen: 243107999 hash: 7X1s4w65yBnwz18fKg5tamNrSHnK1AZyBhQEZRjctDhc

243108000 mismatches
good bank frozen: 243108000 hash: HMk3tMMympeHyBRpoDrqMLxfLjSUZvNiLYx6i2JCuNGZ
bad bank frozen: 243108000 hash: 7h4iAoXfX6KwitVhazib3fCnr3J4koFu7ArJZTY5heFZ

@steviez
Copy link
Contributor

steviez commented Jan 22, 2024

And you have confirmed it with and without the commit you linked above, 6a9f729 ?

@HaoranYi
Copy link
Contributor Author

bad

[2024-01-21T03:16:02.013644297Z INFO  solana_runtime::bank] 
bank frozen: 243108000 hash: 7h4iAoXfX6KwitVhazib3fCnr3J4koFu7ArJZTY5heFZ 
accounts_delta: GCj8CFVaqeHwkVgaLb6f9PaPYeZMofBsKL1op3LXUJJ8 signature_count: 500 last_blockhash: Ant9w6LfnGbG4Jpm5VMxmyRpzN3myx3dDpAQSE3kTwcs 
capitalization: 567558615827727708, epoch_accounts_hash: xwb6iXsG3vdHgUTAFkgUiACYPwYkW8F7465CNL9WVaC, 
stats: BankHashStats { num_updated_accounts: 1465, num_removed_accounts: 14, num_lamports_stored: 39484939731971, total_data_len: 10492554, num_executable_accounts: 1 }

good

[2024-01-21T03:16:02.022355358Z INFO  solana_runtime::bank] 
bank frozen: 243108000 hash: HMk3tMMympeHyBRpoDrqMLxfLjSUZvNiLYx6i2JCuNGZ 
accounts_delta: GCj8CFVaqeHwkVgaLb6f9PaPYeZMofBsKL1op3LXUJJ8 signature_count: 500 last_blockhash: Ant9w6LfnGbG4Jpm5VMxmyRpzN3myx3dDpAQSE3kTwcs 
capitalization: 567558615826502748, epoch_accounts_hash: 824tUYuwAKFv2kKz5m2Xf8YHNYhYUhsqwohjmrvTp3Be, 
stats: BankHashStats { num_updated_accounts: 1465, num_removed_accounts: 14, num_lamports_stored: 39484939731971, total_data_len: 10492554, num_executable_accounts: 1 }

@HaoranYi
Copy link
Contributor Author

And you have confirmed it with and without the commit you linked above, 6a9f729 ?

Yes. #34623 is not the problem.
I have verified with ledger tool that with and without the #34623 we get the same bad bank hash 7h4iAoXfX6KwitVhazib3fCnr3J4koFu7ArJZTY5heFZ

@HaoranYi
Copy link
Contributor Author

It seems on 243108000 the epoch_accounts_hash mismatched, which could cause the bank hash mismatch?

@steviez
Copy link
Contributor

steviez commented Jan 22, 2024

Yes. #34623 is not the problem.
I have verified with ledger tool that with and without the #34623 we get the same bad bank hash 7h4iAoXfX6KwitVhazib3fCnr3J4koFu7ArJZTY5heFZ

Gotcha, maybe we need to bisect then. We seemingly don't have a last known good commit, but I also would have expected canaries to hit this if the issue was too long ago.

And just making sure, you were running 6a9f729 with no other modifications ?

@HaoranYi
Copy link
Contributor Author

HaoranYi commented Jan 22, 2024

@brooksprumo My understanding is that If epoch_account_hash_mismatched, then at the bank when we look for epoch_accounts_hash (i.e. 3/4 of the epoch??), we will get a mismatch at the bank hash? Is that right?

@brooksprumo
Copy link
Contributor

brooksprumo commented Jan 22, 2024

The epoch accounts hash is a (full) accounts hash calculation taken at the rooted slot 1/4 into the epoch. That value is saved into accounts-db and then hashed into the bank (when freezing) that's 3/4 into the epoch.

If the EAH values are different, that would imply an account hash calculation mismatch.

And yes, if the EAH values are different, that will cause the bank hashes to be different.

@steviez
Copy link
Contributor

steviez commented Jan 22, 2024

Here is the snippet where the EAH gets mixed into the bank hash as I was looking up for my own understanding as well:

solana/runtime/src/bank.rs

Lines 6963 to 7004 in 9db4e84

let mut hash = hashv(&[
self.parent_hash.as_ref(),
accounts_delta_hash.0.as_ref(),
&signature_count_buf,
self.last_blockhash().as_ref(),
]);
let epoch_accounts_hash = self.should_include_epoch_accounts_hash().then(|| {
let epoch_accounts_hash = self.wait_get_epoch_accounts_hash();
hash = hashv(&[hash.as_ref(), epoch_accounts_hash.as_ref().as_ref()]);
epoch_accounts_hash
});
let buf = self
.hard_forks
.read()
.unwrap()
.get_hash_data(slot, self.parent_slot());
if let Some(buf) = buf {
let hard_forked_hash = extend_and_hash(&hash, &buf);
warn!("hard fork at slot {slot} by hashing {buf:?}: {hash} => {hard_forked_hash}");
hash = hard_forked_hash;
}
let bank_hash_stats = self
.rc
.accounts
.accounts_db
.get_bank_hash_stats(slot)
.expect("No bank hash stats were found for this bank, that should not be possible");
info!(
"bank frozen: {slot} hash: {hash} accounts_delta: {} signature_count: {} last_blockhash: {} capitalization: {}{}, stats: {bank_hash_stats:?}",
accounts_delta_hash.0,
self.signature_count(),
self.last_blockhash(),
self.capitalization(),
if let Some(epoch_accounts_hash) = epoch_accounts_hash {
format!(", epoch_accounts_hash: {:?}", epoch_accounts_hash.as_ref())
} else {
"".to_string()
}
);

@HaoranYi
Copy link
Contributor Author

Yeah. 243108000 is at the 75% of epoch.

>>> x = 243108000
>>> begin = 242784000
>>> end = 243215999
>>> (x-begin)/(end-begin)
0.7500017361151299

@steviez
Copy link
Contributor

steviez commented Jan 22, 2024

bad

[2024-01-21T03:16:02.013644297Z INFO  solana_runtime::bank] 
bank frozen: 243108000 hash: 7h4iAoXfX6KwitVhazib3fCnr3J4koFu7ArJZTY5heFZ 
accounts_delta: GCj8CFVaqeHwkVgaLb6f9PaPYeZMofBsKL1op3LXUJJ8 signature_count: 500 last_blockhash: Ant9w6LfnGbG4Jpm5VMxmyRpzN3myx3dDpAQSE3kTwcs 
capitalization: 567558615827727708, epoch_accounts_hash: xwb6iXsG3vdHgUTAFkgUiACYPwYkW8F7465CNL9WVaC, 
stats: BankHashStats { num_updated_accounts: 1465, num_removed_accounts: 14, num_lamports_stored: 39484939731971, total_data_len: 10492554, num_executable_accounts: 1 }

good

[2024-01-21T03:16:02.022355358Z INFO  solana_runtime::bank] 
bank frozen: 243108000 hash: HMk3tMMympeHyBRpoDrqMLxfLjSUZvNiLYx6i2JCuNGZ 
accounts_delta: GCj8CFVaqeHwkVgaLb6f9PaPYeZMofBsKL1op3LXUJJ8 signature_count: 500 last_blockhash: Ant9w6LfnGbG4Jpm5VMxmyRpzN3myx3dDpAQSE3kTwcs 
capitalization: 567558615826502748, epoch_accounts_hash: 824tUYuwAKFv2kKz5m2Xf8YHNYhYUhsqwohjmrvTp3Be, 
stats: BankHashStats { num_updated_accounts: 1465, num_removed_accounts: 14, num_lamports_stored: 39484939731971, total_data_len: 10492554, num_executable_accounts: 1 }

Hmm yeah, it looks like the only thing that differs here are the epoch_accounts_hash and capitalization. The fact that you node did not diverge previously would suggest that the account that caused the EAH to diverge did NOT appear as part of any bank hashes recently. So, I don't think replaying the slot will give us any useful information. Rather, I think we would have to examine each account to find the offending one.

@brooksprumo
Copy link
Contributor

Yeah. 243108000 is at the 75% of epoch.

Also the EAH is only part of the "bank frozen" log line for the one1 bank that's including the EAH.

Footnotes

  1. There can be multiple forks such that more than one bank includes the EAH, since this is only "frozen". Eventually only one of these banks will be rooted.

@brooksprumo
Copy link
Contributor

Rather, I think we would have to examine each account to find the offending one.

Yeah, that's what I would think too. The accounts hash calculation happens way after the bank hash is calculated (the eah start slot), so replaying the slot won't include anything about the EAH. We occasionally see accounts hash mismatches on snapshots, and it hasn't been reproducible before. Often theorized to be a HW issue, esp disk related.

@steviez
Copy link
Contributor

steviez commented Jan 22, 2024

Note that I see this at startup when unpacking the snapshot:

[2024-01-22T20:02:23.357305050Z WARN  solana_runtime::bank]
verify failed: slot: 243107789, 4n2M6Y7tgkjUfahST7DFPLL1Su4V3cWGxo3gduJSv99N (calculated) != 95ahPUghb36UjDVtAmpLHGhYWFSwZ22spWW6k78AgGHQ (expected)

[2024-01-22T20:02:23.357326060Z INFO  solana_metrics::metrics]
datapoint: verify_snapshot_bank clean_us=4i shrink_us=1i verify_accounts_us=577i verify_bank_us=9414i
thread 'main' panicked at /solana-labs/solana-secondary/runtime/src/snapshot_bank_utils.rs:374:9:
Snapshot bank for slot 243107789 failed to verify

@HaoranYi
Copy link
Contributor Author

HaoranYi commented Jan 22, 2024

yeah. I see that too.

[2024-01-22T21:29:57.829048056Z INFO  solana_metrics::metrics] datapoint: accounts_db_active hash=0i
[2024-01-22T21:29:57.834151537Z WARN  solana_accounts_db::accounts_db] Mismatched total lamports: 567558617892842802 calculated: 567558617891617842
[2024-01-22T21:29:57.834178147Z WARN  solana_accounts_db::accounts] verify_accounts_hash failed: MismatchedTotalLamports(567558617891617842, 567558617892842802), slot: 243107789
[2024-01-22T21:29:57.834662541Z INFO  solana_runtime::bank] Initial background accounts hash verification has stopped

I am working on a fix for this.
d76a5a3

@HaoranYi
Copy link
Contributor Author

I think this is due to the reward PDA account was created in the previous epoch when run my node with partitioned rewards enabled (#34809). That PR is incomplete. It only patches the bank-hash to ignore the PDA accounts. but it didn't take care of epoch_hash and bank lamport adjustment. Therefore, we fail at the slot when we include epoch hash into the bank hash.

I have pushed fixes for this just now.

@steviez
Copy link
Contributor

steviez commented Jan 22, 2024

And just making sure, you were running 6a9f729 with no other modifications ?

I think this is due to the reward PDA account was created in the previous epoch when run my node with partitioned rewards enabled (#34809). That PR is incomplete.

I have pushed fixes for this just now.

So to confirm, your believe that an account from a PR that has not landed yet altered your account state and caused your node to diverge ? And the fixes were pushed to your PR?

@HaoranYi
Copy link
Contributor Author

HaoranYi commented Jan 22, 2024

So to confirm, your believe that an account from a PR that has not landed yet altered your account state and caused your node to diverge ? And the fixes were pushed to your PR?

Yes, that's correct. All these are specific to my node, which was running with --partitioned-epoch-rewards-force-enable-single-slot on and off, which then messed up with the accounts stored in my local node.

@HaoranYi
Copy link
Contributor Author

HaoranYi commented Jan 22, 2024

Close the issue since this is only specific related to my node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants