-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
local-cluster: fix flaky test_rpc_block_subscribe #34421
Conversation
d8bafec
to
c9d6ad0
Compare
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #34421 +/- ##
==========================================
Coverage 81.8% 81.8%
==========================================
Files 767 819 +52
Lines 209267 220904 +11637
==========================================
+ Hits 171303 180839 +9536
- Misses 37964 40065 +2101 |
c9d6ad0
to
21960a6
Compare
Great find!
Hmm interesting, so in RPC service, on failure for notifying/fetching block solana/rpc/src/rpc_subscriptions.rs Line 1056 in 383aa04
S+N , all blocks between S and N show up slots_to_notify : solana/rpc/src/rpc_subscriptions.rs Line 1023 in 383aa04
Idealistically, think it makes sense for the written block notification to be decoupled from the block voted notification so we could notify separately, since block writing/block voting are necessarily coupled, and we wouldn't want block writing to delay notification of block voted |
] | ||
.iter() | ||
.map(|s| (Arc::new(Keypair::from_base58_string(s)), true)) | ||
.take(node_stakes.len()) | ||
.collect::<Vec<_>>(); | ||
let rpc_node_pubkey = &validator_keys[1].0.pubkey(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to document or point to this PR for why we chose this validator key
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added the comment to the subscribe call
correct. So for this example in the first iteration we would get a block update for slot 3 with
Then when slot 4 was voted on we would get 2 notifications
Agreed, I'm not sure why voting is the trigger here. At least for the |
21960a6
to
67272d0
Compare
Problem
When a leader creates a bank, it will apply new entries to the working bank and then sends them to broadcast stage to be shredded and stored in blockstore/transmitted.
There is a race between replay and broadcast stage, and sometimes replay will vote on the bank before broadcast stage has a chance to shred the last batch of entries and store them in blockstore.
When replay votes on the bank, it will notify the rpc subscribers of the new block through the
AggregateCommitmentService
:solana/core/src/replay_stage.rs
Lines 2249 to 2255 in a2d7be0
However in the case mentioned, blockstore does not yet have all of the data shreds and the subscribers will be sent an
RpcBlockUpdateError
as theRpcBlockUpdate
cannot be constructed:solana/ledger/src/blockstore.rs
Line 2189 in a2d7be0
The change #34330 seems to exacerbate this race by increasing the # of shreds in the last FEC set. My naive hypothesis is that the extra coding shreds in the last FEC set slows down broadcast stage's blockstore insert such that the replay vote usually occurs first.
Summary of Changes
Subscribe to the RPC of the non leader node, to avoid getting sent errors due to this race.
If we decide the race is necessary to fix, a solution could be to return an
Option<BlockUpdate>
withNone
in the case that!slot_meta.is_full()
and we are the leader for that slot. This would not notify and not increment the last notified slot, so that it can be sent out on the next iteration:solana/rpc/src/rpc_subscriptions.rs
Lines 1034 to 1050 in 383aa04
Alternatively we could delay the leaders vote or rpc notification entirely until the last fec set has been inserted into blockstore.
Fixes #32863