[DRAFT] [miner] Reduce fork and orphan rate with an interruptable miner #3335

jcnelson · 2022-10-11T00:48:20Z

This PR addresses #3334, #3324, and #3323.

I was looking at ways by which flashblocks and forks can originate at random, especially when the mempool is under load. It turns out, there are a few common ways by which a node can be prevented from learning the latest chain tip in a timely manner, which increases the likelihood of these two phenomena. These ways are both related to how write-locks are acquired in the StacksChainState database state:

Relayer / ChainsCoordinator conflict: When the relayer thread trying to mine a block, it must currently open a write-lock to the underlying chainstate DB's MARF because it will ultimately store a state trie representing the new block (or microblock state) to the database. This in turn prevents the ChainsCoordinator thread from processing new blocks. The reverse is also true -- the ChainsCoordinator can prevent the relayer from mining. The effect of a congested mempool here is that the relayer can sometimes prevent the ChainsCoordinator thread from making progress for some time, even if it's trying to write a new Stacks block. Worse, the block the relayer is mining is going to be an orphan if it wins sortition.
Relayer / P2P conflict: Currently, the relayer processes a block or microblock it produces before it advertises it to the peer network. But, the relayer can get stuck while doing this if the ChainsCoordinator is actively processing new chainstate. This prevents the peer from quickly announcing their block to the peer network, which in turn delays other peers from learning about a new chain tip, which in turn gives their relayer threads a better chance of mining an orphan. Having a congested mempool amplifies this effect, because the relayer thread can start mining and get stuck mining for a long, long time before the ChainsCoordinator thread can act on a newly-announced block.

This PR does the following to address it (see system schematic):

/// System schematic.
/// Legend:
///    |------|    Thread
///    /------\    Shared memory
///    @------@    Database
///    .------.    Code module
///
///
///                           |------------------|
///                           |  RunLoop thread  |   [1,7]
///                           |   .----------.   |--------------------------------------.
///                           |   .StacksNode.   |                                      |
///                           |---.----------.---|                                      |
///                    [1]        |     |    |     [1]                                  |
///              .----------------*     |    *---------------.                          |
///              |                  [3] |                    |                          |
///              V                      |                    V                          V
///      |----------------|             |    [9,10]   |---------------| [11] |--------------------------|
/// .--- | Relayer thread | <-----------|-----------> |   P2P Thread  | <--- | ChainsCoordinator thread | <--.
/// |    |----------------|             V             |---------------|      |--------------------------|    |
/// |            |     |          /-------------\    [2,3]    |    |              |          |               |
/// |        [1] |     *--------> /   Globals   \ <-----------*----|--------------*          | [4]           |
/// |            |     [2,3,7]    /-------------\                  |                         |               |
/// |            V                                                 V [5]                     V               |
/// |    |----------------|                                 @--------------@        @------------------@     |
/// |    |  Miner thread  | <------------------------------ @  Mempool DB  @        @  Chainstate DBs  @     |
/// |    |----------------|             [6]                 @--------------@        @------------------@     |
/// |                                                                                        ^               |
/// |                                               [8]                                      |               |
/// *----------------------------------------------------------------------------------------*               |
/// |                                               [7]                                                      |
/// *--------------------------------------------------------------------------------------------------------*
///
/// [1]  Spawns
/// [2]  Synchronize unconfirmed state
/// [3]  Enable/disable miner
/// [4]  Processes block data
/// [5]  Stores unconfirmed transactions
/// [6]  Reads unconfirmed transactions
/// [7]  Signals block arrival
/// [8]  Store blocks and microblocks
/// [9]  Pushes retrieved blocks and microblocks
/// [10] Broadcasts new blocks, microblocks, and transactions
/// [11] Notifies about new transaction attachment events

First, it refactors testnet/src/stacks-node/neon_node.rs from a hairball of hard-to-follow functions into a set of properly defined structs and struct implementations that contain the state for the threads they represent. It also refactors all shared-thread state into a Globals struct to keep it accessible. I was forced to do this refactoring in order to effectively reason about how the node behaves; it was getting very hard to follow.

Second, it addresses these two problems by doing the following:

Interruptible Miner Thread: Mining blocks and microblocks now occurs in a dedicated miner thread, which the relayer thread spawns. This miner thread can be interrupted when new Stacks block data or burnchain data arrives, which in turn causes the miner to exit and in doing so, releases the write lock. The ChainsCoordinator, Relayer, and RunLoop threads now instruct the miner to stop whenever they need to process new Stacks blocks or burnchain blocks.
Announce Before Processing: The node now announces blocks before it processes them. This is fine because a correct miner will produce a valid block, so there's no need to process it first. It's more important for the network's health that other nodes can receive it and process it.

Regarding testing, this PR adds the beginnings of a couple of "smoke tests" that are too expensive to run in CI, but demonstrate that a set of nodes will produce a high-quality chain even when the mempool is congested with CPU-bound transactions, which forces the miner thread to be active for nearly 100% of the burnchain block time. I'm labeling this PR as a draft until we have a strategy in place for smoke testing.

…hensible

…chain or stacks block, signal to any running miner thread that it should cease what it's doing. This prevents the miner from holding open the write lock on the underlying DB, and in doing so, blocking the coordinator (whose operation takes precedence since the miner builds off of a chain tip produced by the coordinator)

…higher than a certain height; add a read-only DB query to see if we have already processed a Stacks block (so we don't need a write lock on the chainstate); catch microblock forks earlier by checking for common ancestors in addition to sequence numbers;

…d-only snapshot of the the unconfirmed state. Used by a separate miner thread so it doesn't need to hold the relayer thread's chainstate

…hared mutex. Mining is blocked as long as at least one thread wants it blocked, and is unblocked if no threads want it blocked. Blocking/unblocking is idempotent when done from the same thread.

…r can add its own transactions while bypassing the usual checks that would prevent them (such as fees or nonce conflicts). Used to submit poison-microblock transactions

…rrowed Option<&'a u64>

…m ongoing commit state (so another thread can be used to send the burnchain commit)

…er pass to unblock mining (since tests tend to only have one node, and thus no downloader passes complete)

…all the state for all the threads is clearly identified and encapsulated in structs, as is thread-shared global state. Also, everything is now documented. Neon node should be much more approachable now.

…ate, and blocks/unblocks the miner in response to new burnchain data arriving. This prevents the miner from stalling the node in the event of a slow mempool walk.

…forks are detected and the offending miner gets punished, and also start working out multi-miner tests for verifying that the chain continues to work well under load.

…_microblocks to have the node wait for a block to arrive before starting tenure

…en we win tenure, broadcast the block _before_ processing it. Also, because blocks can get mined out-of-order relative to burnchain blocks, advance the miner tip only if the tip is higher in the stacks chain

… miners

obycode · 2022-10-20T02:00:54Z

I'll start a mock miner now and review some time tonight.

testnet/stacks-node/src/neon_node.rs

…VRF keypair in the miner thread if we're in mock-mining mode

…nsider transactions if they have higher nonces than an unmineable transaction

…ails for some reason, then cache them until the end of the iteration loop and store them then as a transaction

…he mempool walk from caching updated nonces, but it cannot prevent it

…we haven't mined a microblock yet. Also, remove gratuitous debug messages.

Fix/retry store nonces

obycode

Ok, to the best of my knowledge, this looks good to me. Thanks for the great documentation. That was very helpful. Might be a good idea to get another review from someone who has looked at neon_node before this.

gregorycoppola · 2022-10-20T21:27:57Z

https://github.com/stacks-network/stacks-blockchain/pull/3335/files/e5e39b1b04f497e4b703d81a0875842d5e3f6c69#r992508541

are these tests set up to run properly now? i still set it as TODO

jcnelson · 2022-10-21T01:17:30Z

@gregorycoppola Yes, they're still disabled in CI. There's no real way to run them in github CI because they take too long. But separately, we could create a smoke test CI that ran them nightly and reported results. I think that's best handled in a separate PR, because it'll take some time to settle on a smoke test procedure we're happy with.

… behavior of mempool walks

jcnelson · 2022-10-21T01:45:07Z

@obycode Thanks! I've added @lgalabru, who wrote the original neon_node implementation.

lgalabru

Thanks @jcnelson! Rewriting the runloop was in my "someday" to do list.
I had a radically different design in mind for the next iteration. I suspected a performance issue, but another big concern that I would love to see addressed somehow is testability of the runloop, which would require a lot of decoupling work.
Anyway, I'm glad you were able to get there!

Question concerning the spec. I have a good understanding of why any new burnchain data needs to interrupt the miner, but not sure to follow why new Stacks block should be interrupting the mining loop? could this be an attack vector? De facto, miners are trusting their bitcoin node, but they don't have this relationship with the stacks-node peers?

Concerning the failing tests, are they passing on your machine? It's a new runloop so we probably need to be diligent.

Beside my question on interruption caused by Stacks blocks, I think my only concern with this PR is around testing: I see a good amount of assertions being removed, and maybe not enough assertions being introduced.
Feel free to merge if you think that the manual testing that went (and/or planned) through this upgrade gives us enough confidence!

lgalabru · 2022-10-21T17:56:32Z

testnet/stacks-node/src/tests/neon_integrations.rs

+        eprintln!("\n\nBuild block {}\n\n", i);
+        btc_regtest_controller.build_next_block(1);
+        sleep_ms(block_time_ms);
+    }


I see the arrange and act phases, but can't see the asserts. Are the assertions happening somewhere else?

Right -- this is the beginning of a smoke test that won't run in CI, but would run as part of a nightly run (it takes too long to run in CI).

lgalabru · 2022-10-21T17:59:25Z

testnet/stacks-node/src/tests/neon_integrations.rs

-    assert_eq!(account_0.nonce, 20);
-
-    let account_1 = get_account(&http_origin, &spender_addrs[1]);
-    assert_eq!(account_1.nonce, 1);


Why are we getting rid of these tests?

This deletion is part of mining_transactions_is_fair.

lgalabru · 2022-10-21T17:59:36Z

testnet/stacks-node/src/tests/neon_integrations.rs

-    conf.events_observers.push(EventObserverConfig {
-        endpoint: format!("localhost:{}", test_observer::EVENT_OBSERVER_PORT),
-        events_keys: vec![EventKeyType::AnyEvent],
-    });


Why are we getting rid of these tests?

This gets rid of mining_transactions_is_fair. This test no longer reflects the mining strategy we use. If Alice sends 10 transactions collectively worth 10 STX, but Bob sends 2 transactions worth 100 STX, and the block can only fit Alice's or Bob's transactions, then Bob's transactions should be prioritized over Alice's. This isn't "fair" in that Alice's low-value transactions will get ignored for being low-fee, but it produces a higher profit for miners (which is what we really want).

jcnelson · 2022-10-21T19:49:20Z

@lgalabru Question concerning the spec. I have a good understanding of why any new burnchain data needs to interrupt the miner, but not sure to follow why new Stacks block should be interrupting the mining loop? could this be an attack vector? De facto, miners are trusting their bitcoin node, but they don't have this relationship with the stacks-node peers?

So, the problem we're trying to address is that a node could start mining before it has received the latest Stacks data for the canonical chain tip. If this happens, then the block the miner will produce will be a sibling block to the chain tip, and could end up getting mined (leading to a short-term fork). By making it so that the arrival of new Stacks blocks or microblocks (i.e. that this node doesn't already have) interrupts the miner, it gives the system a chance to process them first, and in doing so, learn the canonical chain tip before it tries mining again.

I see that the the check that does this here is blindingly trusting the NetworkResult data, which does open a DDoS vector whereby someone can just post the same blocks over and over again and stall mining. I'll fix it so that the miner is only interrupted by the arrival of new block or microblock data, which can be checked and reported by the Relayer::process_network_result() method.

…firmed microblocks when processing a NetworkResult

…ork result contained new data

blockstack-devops · 2024-11-16T00:21:49Z

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

jcnelson added 28 commits October 5, 2022 22:37

Merge branch 'master' into feat/interruptable-miner

627fc84

chore: clean up canonical stacks tip memo code to make it more compre…

6023176

…hensible

chore: remove fmt warning

0b8397b

feat: add make_readonly_owned() function that effectively takes a rea…

6e81ee4

…d-only snapshot of the the unconfirmed state. Used by a separate miner thread so it doesn't need to hold the relayer thread's chainstate

chore: remove fmt warnings

d47df1f

feat: make the miner loop interruptable by checking the status of a s…

b1ceaba

…hared mutex. Mining is blocked as long as at least one thread wants it blocked, and is unblocked if no threads want it blocked. Blocking/unblocking is idempotent when done from the same thread.

chore: add MinerAborted error variant

ab0d95a

feat: add a miner-driven mempool submission function so that the mine…

0ea6586

…r can add its own transactions while bypassing the usual checks that would prevent them (such as fees or nonce conflicts). Used to submit poison-microblock transactions

feat: NetworkResult now reports the p2p thread's burn height

6b0befd

feat: use new NetworkResult contructor

d2c1c59

chore: exit_at_block_height is now a bare Option<u64> instead of a bo…

a27d9f4

…rrowed Option<&'a u64>

feat: make it possible to create a burnchain tx submission client fro…

46a76c6

…m ongoing commit state (so another thread can be used to send the burnchain commit)

chre: make it so the test framework can avoid the need for a download…

567303d

…er pass to unblock mining (since tests tend to only have one node, and thus no downloader passes complete)

fix: accommodate poison-microblock transaction evaluation results

9d60275

feat: untangle and restructure the hairball that was neon_node. Now, …

361a3f2

…all the state for all the threads is clearly identified and encapsulated in structs, as is thread-shared global state. Also, everything is now documented. Neon node should be much more approachable now.

fix: send a copy of the exit at block height

c1d2132

feat: make it so the neon runloop sets up structured global thread st…

69ce0ff

…ate, and blocks/unblocks the miner in response to new burnchain data arriving. This prevents the miner from stalling the node in the event of a slow mempool walk.

docs: document that this is only for helium

8da076e

feat: add a poison-microblock integration test to verify that stream …

dc58cfc

…forks are detected and the offending miner gets punished, and also start working out multi-miner tests for verifying that the chain continues to work well under load.

feat: add Error::ChannelClosed() variant to detect thread hangups

b5e7ee5

feat: log when we store a block

d08cd0a

feat: add wait_time_for_blocks config option similar to wait_time_for…

c19ad0f

…_microblocks to have the node wait for a block to arrive before starting tenure

fix: wait for one inv and download pass before running tenure, and wh…

b3b6525

…en we win tenure, broadcast the block _before_ processing it. Also, because blocks can get mined out-of-order relative to burnchain blocks, advance the miner tip only if the tip is higher in the stacks chain

feat: add integration tests for chain quality when there are multiple…

c3656a3

… miners

fix: avoid gratuitous call to pick_higher_tip

c3148e7

feat: document the structure of the Stacks node

b7b6879

jcnelson added the L1 Working Group Issue or PR related to improving L1 label Oct 11, 2022

jcnelson requested a review from obycode October 11, 2022 00:48

jcnelson added 4 commits October 19, 2022 21:53

Merge branch 'develop' into feat/interruptable-miner

bc91a0a

chore: log when we get microblock data (and poison results)

f933814

fix: fix failing integration test due to using a new block version

9f9795b

fix: use new config options to get tests to pass

458e055

obycode reviewed Oct 20, 2022

View reviewed changes

testnet/stacks-node/src/neon_node.rs Show resolved Hide resolved

jcnelson added 9 commits October 20, 2022 10:36

fix: calculate poll time in millis, not seconds; generate mock-miner …

34189b6

…VRF keypair in the miner thread if we're in mock-mining mode

chore: update test comments to indicate that we're not supposed to co…

0929f2b

…nsider transactions if they have higher nonces than an unmineable transaction

chore: require rust 1.61

940969e

chore: test-only deps shouldn't be included as top-level imports

6338cd3

fix: if the act of storing mempool nonces in a write-through manner f…

2896e40

…ails for some reason, then cache them until the end of the iteration loop and store them then as a transaction

fix: add a test to verify that a concurrent writer thread may delay t…

83f54f2

…he mempool walk from caching updated nonces, but it cannot prevent it

chore: add changelog

a0fddb7

fix: when mock-mining, always re-try minig an anchored block even if …

33458fb

…we haven't mined a microblock yet. Also, remove gratuitous debug messages.

Merge pull request #3352 from stacks-network/fix/retry-store-nonces

049da9d

Fix/retry store nonces

obycode approved these changes Oct 20, 2022

View reviewed changes

jcnelson requested a review from lgalabru October 21, 2022 01:43

fix: remove mining fairness test, since this is no longer the desired…

1f7b295

… behavior of mempool walks

gregorycoppola approved these changes Oct 21, 2022

View reviewed changes

lgalabru reviewed Oct 21, 2022

View reviewed changes

jcnelson added 2 commits October 21, 2022 16:21

feat: report number of *new* blocks, confirmed microblocks, and uncon…

b1b0e2c

…firmed microblocks when processing a NetworkResult

fix: only disable the miner thread if the relayer reports that a netw…

aba47e4

…ork result contained new data

jcnelson merged commit 9d89725 into develop Oct 22, 2022

blockstack-devops added the locked label Nov 16, 2024

stacks-network locked as resolved and limited conversation to collaborators Nov 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] [miner] Reduce fork and orphan rate with an interruptable miner #3335

[DRAFT] [miner] Reduce fork and orphan rate with an interruptable miner #3335

jcnelson commented Oct 11, 2022

obycode commented Oct 20, 2022

obycode left a comment

gregorycoppola commented Oct 20, 2022

jcnelson commented Oct 21, 2022

jcnelson commented Oct 21, 2022

lgalabru left a comment

lgalabru Oct 21, 2022

jcnelson Oct 21, 2022

lgalabru Oct 21, 2022

jcnelson Oct 21, 2022

lgalabru Oct 21, 2022

jcnelson Oct 21, 2022

jcnelson commented Oct 21, 2022

blockstack-devops commented Nov 16, 2024

[DRAFT] [miner] Reduce fork and orphan rate with an interruptable miner #3335

[DRAFT] [miner] Reduce fork and orphan rate with an interruptable miner #3335

Conversation

jcnelson commented Oct 11, 2022

obycode commented Oct 20, 2022

obycode left a comment

Choose a reason for hiding this comment

gregorycoppola commented Oct 20, 2022

jcnelson commented Oct 21, 2022

jcnelson commented Oct 21, 2022

lgalabru left a comment

Choose a reason for hiding this comment

lgalabru Oct 21, 2022

Choose a reason for hiding this comment

jcnelson Oct 21, 2022

Choose a reason for hiding this comment

lgalabru Oct 21, 2022

Choose a reason for hiding this comment

jcnelson Oct 21, 2022

Choose a reason for hiding this comment

lgalabru Oct 21, 2022

Choose a reason for hiding this comment

jcnelson Oct 21, 2022

Choose a reason for hiding this comment

jcnelson commented Oct 21, 2022

blockstack-devops commented Nov 16, 2024