Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chain head block is sometimes not available on the node that actually mined the block #6140

Closed
matthew1001 opened this issue Nov 7, 2023 · 7 comments · Fixed by #6143
Closed
Assignees
Labels
bug Something isn't working non mainnet (private networks) not related to mainnet features - covers privacy, permissioning, IBFT2, QBFT

Comments

@matthew1001
Copy link
Contributor

matthew1001 commented Nov 7, 2023

Description

I've been experiencing occasional CHAIN_HEAD_NOT_AVAILABLE errors in a single-validator QBFT chain. Adding some extra sysout logs to a local build I can see that when this happens, the failing check is logged out before the existing "Produced [#block-number]" log entry for that requested block:

MRW: Blockchain head returned nothing for the block hash 0x10b029608b21f63b34884b36e29030332200677ed4ad0890b46342c494e29753?
2023-11-07 11:13:29.056+00:00 | BftProcessorExecutor-QBFT-0 | INFO  | QbftBesuControllerBuilder | Produced #5,182 / 21 tx / 11 pending / 40,001,640 (40.0%) gas / (0x10b029608b21f63b34884b36e29030332200677ed4ad0890b46342c494e29753)
2023-11-07 11:13:29.094+00:00 | EthScheduler-Transactions-0 | WARN  | TransactionPool | rejecting transaction 0x140a7a8af4bd571fdbc251f11e92a84f6796728eff20bff704c1896865a8a87e due to chain head not available yet

Accept

Steps to Reproduce (Bug)

  1. Create a 4 node QBFT chain.
  2. Only one of the nodes is a validator. On a single laptop this appears to be easy to recreate in 5-10 minutes, probably due to CPU load.
  3. Enabled transaction pool trace on the validator node
  4. Submit ~ 50TPS to a non-validator node
  5. After a short period of time you should hit EthScheduler-Transactions-0 | TRACE | TransactionPool | rejecting transaction

(Note that this error is a WARN error as of #6126)

Expected behavior:
No CHAIN_HEAD_NOT_AVAILABLE errors are seen in this sort of load testing

@matthew1001 matthew1001 added the bug Something isn't working label Nov 7, 2023
@matthew1001
Copy link
Contributor Author

matthew1001 commented Nov 7, 2023

A few notes after initial investigation.

  • DefaultBlockchain.appendBlock(...) is synchronized, and calls appendBlockHelper(...) to update both the chain itself and the (volatile) chainHeader reference.
  • TransactionPool.getChainHeadBlockHeader(...) calls 2 methods in DefaultBlockchain:
    • getBlockHeader(...)
    • getChainHeadHash(...)
    • Neither of these are synchronized

I'm not sure if there is sufficient protection around the call to getBlockHeader() being called with the new chainHeader hash, but on a chain that hasn't yet committed the new block. updater.commit() is called at the end of appendBlockHelper but I think by this point chainHeader will have been updated to the new chain head, and a caller of getBlockHeader(...) could end up reading stale data with the new chain header.

As a side note, there is a block cache mechanism that is only used if numberOfBlocksToCache != 0, and after a quick skim through the code I can't see any paths that set it >0

@fab-10 any thoughts? I'm currently running the same test with getBlockHeader() and getChainHeadHash() marked as synchronized. I'm not sure this is the fix needed, but it's just a quick way of proving if it is the logic described above that is causing the issue.

@fab-10
Copy link
Contributor

fab-10 commented Nov 8, 2023

cc: @ahamlat for any perfomance implication of adding synchronization there.

@fab-10
Copy link
Contributor

fab-10 commented Nov 8, 2023

@matthew1001 how the tests are going?
I am also thinking about an approach limited to the tx, that could retry, a bit later, the validation of a tx in case the data it needs is not yet available.

@matthew1001
Copy link
Contributor Author

@fab-10 I've been running now for many hours and haven't recreated the problem. Before I could recreate within 20-30 minutes very reliably.

Note that in the PR I've only added synchronized to getBlockHeader(), not getChainHeadHash(), as I don't think the latter is necessary.

I did wonder about any performance side-effects, but I don't know if getBlockHeader() is on very performance-sensitive code paths. The codepath we are hitting this on is the path for adding a remote transaction. Ultimately, if the blocks and chain header need updating atomically then there needs to be a mutex of some sort.

My gut feel would be to see if performance testing for the next release shows a noticeable change, and if so this could be reverted and we'd be left with the same issue we currently have, to rethink a fix for.

@ahamlat
Copy link
Contributor

ahamlat commented Nov 10, 2023

@matthew1001 Regarding block cache and when numberOfBlocksToCache is set, you can check this PR #6009. This field is initialized when Besu is started with --cache-last-blocks=n. This is specifically done here when BesuControllerBuilder creates the blockchain object.

@matthew1001
Copy link
Contributor Author

Thanks @ahamlat for the info, I wasn't aware of the new option. We might give that a try in some of our performance tests with something small like --cache-last-blocks=48 to see if we observe an improvement in high-throughput tests. We're also looking at whether eth_getBlockReceipts gives us improvements in our applications, and that might reduce the benefit of --cache-last-blocks since getting TX receipts make up a lot of application calls in some tests.

@ahamlat
Copy link
Contributor

ahamlat commented Nov 10, 2023

Actually --cache-last-blocks=n blocks will improve a lot eth_getBlockReceipts with the current implementation, and even better once the implementation is reviewed as I suggested in Discord, because this options also caches the receipts

@non-fungible-nelson non-fungible-nelson added the non mainnet (private networks) not related to mainnet features - covers privacy, permissioning, IBFT2, QBFT label Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non mainnet (private networks) not related to mainnet features - covers privacy, permissioning, IBFT2, QBFT
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants