Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

local-cluster: fix flaky test_rpc_block_subscribe #34421

Merged
merged 3 commits into from
Dec 14, 2023

Conversation

AshwinSekar
Copy link
Contributor

@AshwinSekar AshwinSekar commented Dec 12, 2023

Problem

When a leader creates a bank, it will apply new entries to the working bank and then sends them to broadcast stage to be shredded and stored in blockstore/transmitted.

There is a race between replay and broadcast stage, and sometimes replay will vote on the bank before broadcast stage has a chance to shred the last batch of entries and store them in blockstore.

When replay votes on the bank, it will notify the rpc subscribers of the new block through the AggregateCommitmentService:

Self::update_commitment_cache(
bank.clone(),
bank_forks.read().unwrap().root(),
progress.get_fork_stats(bank.slot()).unwrap().total_stake,
lockouts_sender,
);
update_commitment_cache_time.stop();

However in the case mentioned, blockstore does not yet have all of the data shreds and the subscribers will be sent an RpcBlockUpdateError as the RpcBlockUpdate cannot be constructed:

Err(BlockstoreError::SlotUnavailable)

The change #34330 seems to exacerbate this race by increasing the # of shreds in the last FEC set. My naive hypothesis is that the extra coding shreds in the last FEC set slows down broadcast stage's blockstore insert such that the replay vote usually occurs first.

Summary of Changes

Subscribe to the RPC of the non leader node, to avoid getting sent errors due to this race.

If we decide the race is necessary to fix, a solution could be to return an Option<BlockUpdate> with None in the case that !slot_meta.is_full() and we are the leader for that slot. This would not notify and not increment the last notified slot, so that it can be sent out on the next iteration:

let block_update_result = blockstore
.get_complete_block(s, false)
.map_err(|e| {
error!("get_complete_block error: {}", e);
RpcBlockUpdateError::BlockStoreError
})
.and_then(|block| filter_block_result_txs(block, s, params));
match block_update_result {
Ok(block_update) => {
if let Some(block_update) = block_update {
notifier.notify(
RpcResponse::from(RpcNotificationResponse {
context: RpcNotificationContext { slot: s },
value: block_update,
}),
subscription,

Alternatively we could delay the leaders vote or rpc notification entirely until the last fec set has been inserted into blockstore.

Fixes #32863

Copy link

codecov bot commented Dec 12, 2023

Codecov Report

Merging #34421 (67272d0) into master (e4e043c) will increase coverage by 0.0%.
Report is 1547 commits behind head on master.
The diff coverage is 92.3%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #34421      +/-   ##
==========================================
  Coverage    81.8%    81.8%              
==========================================
  Files         767      819      +52     
  Lines      209267   220904   +11637     
==========================================
+ Hits       171303   180839    +9536     
- Misses      37964    40065    +2101     

@carllin
Copy link
Contributor

carllin commented Dec 14, 2023

Great find!

If we decide the race is necessary to fix, a solution could be to return an Option with None in the case that !slot_meta.is_full() and we are the leader for that slot. This would not notify and not increment the last notified slot, so that it can be sent out on the next iteration:

Hmm interesting, so in RPC service, on failure for notifying/fetching block S right now, for a specific single subscription, if w_last_unnotified_slot is not incremented:

*w_last_unnotified_slot = s + 1;
, does that mean next time we get notification for slot S+N, all blocks between S and N show up slots_to_notify:
for s in slots_to_notify {
, even if some of those blocks were already notified in a previous iteration?

Idealistically, think it makes sense for the written block notification to be decoupled from the block voted notification so we could notify separately, since block writing/block voting are necessarily coupled, and we wouldn't want block writing to delay notification of block voted

carllin
carllin previously approved these changes Dec 14, 2023
]
.iter()
.map(|s| (Arc::new(Keypair::from_base58_string(s)), true))
.take(node_stakes.len())
.collect::<Vec<_>>();
let rpc_node_pubkey = &validator_keys[1].0.pubkey();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to document or point to this PR for why we chose this validator key

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the comment to the subscribe call

@AshwinSekar
Copy link
Contributor Author

AshwinSekar commented Dec 14, 2023

does that mean next time we get notification for slot S+N, all blocks between S and N show up slots_to_notify:

correct. So for this example in the first iteration we would get a block update for slot 3 with

RpcBlockUpdate { 
  slot: 3
  block: None,
  err: BlockStoreError

Then when slot 4 was voted on we would get 2 notifications

RpcBlockUpdate { 
  slot: 3
  block: <actual block>,
  err: None
}

RpcBlockUpdate { 
  slot: 4
  block: None,
  err: BlockStoreError
}

Idealistically, think it makes sense for the written block notification to be decoupled from the block voted notification so we could notify separately, since block writing/block voting are necessarily coupled, and we wouldn't want block writing to delay notification of block voted

Agreed, I'm not sure why voting is the trigger here. At least for the Processed confirmation level I would assume subscribers would want to know as soon as any block is available in blockstore.

@AshwinSekar AshwinSekar merged commit 4f11161 into solana-labs:master Dec 14, 2023
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky Test: solana-local-cluster::local_cluster test_rpc_block_subscribe
2 participants