-
Notifications
You must be signed in to change notification settings - Fork 681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DRAFT] [miner] Reduce fork and orphan rate with an interruptable miner #3335
Conversation
…chain or stacks block, signal to any running miner thread that it should cease what it's doing. This prevents the miner from holding open the write lock on the underlying DB, and in doing so, blocking the coordinator (whose operation takes precedence since the miner builds off of a chain tip produced by the coordinator)
…higher than a certain height; add a read-only DB query to see if we have already processed a Stacks block (so we don't need a write lock on the chainstate); catch microblock forks earlier by checking for common ancestors in addition to sequence numbers;
…d-only snapshot of the the unconfirmed state. Used by a separate miner thread so it doesn't need to hold the relayer thread's chainstate
…hared mutex. Mining is blocked as long as at least one thread wants it blocked, and is unblocked if no threads want it blocked. Blocking/unblocking is idempotent when done from the same thread.
…r can add its own transactions while bypassing the usual checks that would prevent them (such as fees or nonce conflicts). Used to submit poison-microblock transactions
…rrowed Option<&'a u64>
…m ongoing commit state (so another thread can be used to send the burnchain commit)
…er pass to unblock mining (since tests tend to only have one node, and thus no downloader passes complete)
…all the state for all the threads is clearly identified and encapsulated in structs, as is thread-shared global state. Also, everything is now documented. Neon node should be much more approachable now.
…ate, and blocks/unblocks the miner in response to new burnchain data arriving. This prevents the miner from stalling the node in the event of a slow mempool walk.
…forks are detected and the offending miner gets punished, and also start working out multi-miner tests for verifying that the chain continues to work well under load.
…_microblocks to have the node wait for a block to arrive before starting tenure
…en we win tenure, broadcast the block _before_ processing it. Also, because blocks can get mined out-of-order relative to burnchain blocks, advance the miner tip only if the tip is higher in the stacks chain
I'll start a mock miner now and review some time tonight. |
…VRF keypair in the miner thread if we're in mock-mining mode
…nsider transactions if they have higher nonces than an unmineable transaction
…ails for some reason, then cache them until the end of the iteration loop and store them then as a transaction
…he mempool walk from caching updated nonces, but it cannot prevent it
…we haven't mined a microblock yet. Also, remove gratuitous debug messages.
Fix/retry store nonces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, to the best of my knowledge, this looks good to me. Thanks for the great documentation. That was very helpful. Might be a good idea to get another review from someone who has looked at neon_node before this.
are these tests set up to run properly now? i still set it as TODO |
@gregorycoppola Yes, they're still disabled in CI. There's no real way to run them in github CI because they take too long. But separately, we could create a smoke test CI that ran them nightly and reported results. I think that's best handled in a separate PR, because it'll take some time to settle on a smoke test procedure we're happy with. |
… behavior of mempool walks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jcnelson! Rewriting the runloop was in my "someday" to do list.
I had a radically different design in mind for the next iteration. I suspected a performance issue, but another big concern that I would love to see addressed somehow is testability of the runloop, which would require a lot of decoupling work.
Anyway, I'm glad you were able to get there!
Question concerning the spec. I have a good understanding of why any new burnchain data needs to interrupt the miner, but not sure to follow why new Stacks block should be interrupting the mining loop? could this be an attack vector? De facto, miners are trusting their bitcoin node, but they don't have this relationship with the stacks-node peers?
Concerning the failing tests, are they passing on your machine? It's a new runloop so we probably need to be diligent.
Beside my question on interruption caused by Stacks blocks, I think my only concern with this PR is around testing: I see a good amount of assertions being removed, and maybe not enough assertions being introduced.
Feel free to merge if you think that the manual testing that went (and/or planned) through this upgrade gives us enough confidence!
eprintln!("\n\nBuild block {}\n\n", i); | ||
btc_regtest_controller.build_next_block(1); | ||
sleep_ms(block_time_ms); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the arrange
and act
phases, but can't see the asserts
. Are the assertions happening somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right -- this is the beginning of a smoke test that won't run in CI, but would run as part of a nightly run (it takes too long to run in CI).
assert_eq!(account_0.nonce, 20); | ||
|
||
let account_1 = get_account(&http_origin, &spender_addrs[1]); | ||
assert_eq!(account_1.nonce, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we getting rid of these tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This deletion is part of mining_transactions_is_fair
.
conf.events_observers.push(EventObserverConfig { | ||
endpoint: format!("localhost:{}", test_observer::EVENT_OBSERVER_PORT), | ||
events_keys: vec![EventKeyType::AnyEvent], | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we getting rid of these tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gets rid of mining_transactions_is_fair
. This test no longer reflects the mining strategy we use. If Alice sends 10 transactions collectively worth 10 STX, but Bob sends 2 transactions worth 100 STX, and the block can only fit Alice's or Bob's transactions, then Bob's transactions should be prioritized over Alice's. This isn't "fair" in that Alice's low-value transactions will get ignored for being low-fee, but it produces a higher profit for miners (which is what we really want).
So, the problem we're trying to address is that a node could start mining before it has received the latest Stacks data for the canonical chain tip. If this happens, then the block the miner will produce will be a sibling block to the chain tip, and could end up getting mined (leading to a short-term fork). By making it so that the arrival of new Stacks blocks or microblocks (i.e. that this node doesn't already have) interrupts the miner, it gives the system a chance to process them first, and in doing so, learn the canonical chain tip before it tries mining again. I see that the the check that does this here is blindingly trusting the |
…firmed microblocks when processing a NetworkResult
…ork result contained new data
This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
This PR addresses #3334, #3324, and #3323.
I was looking at ways by which flashblocks and forks can originate at random, especially when the mempool is under load. It turns out, there are a few common ways by which a node can be prevented from learning the latest chain tip in a timely manner, which increases the likelihood of these two phenomena. These ways are both related to how write-locks are acquired in the
StacksChainState
database state:Relayer / ChainsCoordinator conflict: When the relayer thread trying to mine a block, it must currently open a write-lock to the underlying chainstate DB's MARF because it will ultimately store a state trie representing the new block (or microblock state) to the database. This in turn prevents the ChainsCoordinator thread from processing new blocks. The reverse is also true -- the ChainsCoordinator can prevent the relayer from mining. The effect of a congested mempool here is that the relayer can sometimes prevent the ChainsCoordinator thread from making progress for some time, even if it's trying to write a new Stacks block. Worse, the block the relayer is mining is going to be an orphan if it wins sortition.
Relayer / P2P conflict: Currently, the relayer processes a block or microblock it produces before it advertises it to the peer network. But, the relayer can get stuck while doing this if the ChainsCoordinator is actively processing new chainstate. This prevents the peer from quickly announcing their block to the peer network, which in turn delays other peers from learning about a new chain tip, which in turn gives their relayer threads a better chance of mining an orphan. Having a congested mempool amplifies this effect, because the relayer thread can start mining and get stuck mining for a long, long time before the ChainsCoordinator thread can act on a newly-announced block.
This PR does the following to address it (see system schematic):
First, it refactors
testnet/src/stacks-node/neon_node.rs
from a hairball of hard-to-follow functions into a set of properly defined structs and struct implementations that contain the state for the threads they represent. It also refactors all shared-thread state into aGlobals
struct to keep it accessible. I was forced to do this refactoring in order to effectively reason about how the node behaves; it was getting very hard to follow.Second, it addresses these two problems by doing the following:
Interruptible Miner Thread: Mining blocks and microblocks now occurs in a dedicated miner thread, which the relayer thread spawns. This miner thread can be interrupted when new Stacks block data or burnchain data arrives, which in turn causes the miner to exit and in doing so, releases the write lock. The ChainsCoordinator, Relayer, and RunLoop threads now instruct the miner to stop whenever they need to process new Stacks blocks or burnchain blocks.
Announce Before Processing: The node now announces blocks before it processes them. This is fine because a correct miner will produce a valid block, so there's no need to process it first. It's more important for the network's health that other nodes can receive it and process it.
Regarding testing, this PR adds the beginnings of a couple of "smoke tests" that are too expensive to run in CI, but demonstrate that a set of nodes will produce a high-quality chain even when the mempool is congested with CPU-bound transactions, which forces the miner thread to be active for nearly 100% of the burnchain block time. I'm labeling this PR as a draft until we have a strategy in place for smoke testing.