local-cluster: fix flaky test_rpc_block_subscribe #34421

AshwinSekar · 2023-12-12T20:11:10Z

Problem

When a leader creates a bank, it will apply new entries to the working bank and then sends them to broadcast stage to be shredded and stored in blockstore/transmitted.

There is a race between replay and broadcast stage, and sometimes replay will vote on the bank before broadcast stage has a chance to shred the last batch of entries and store them in blockstore.

When replay votes on the bank, it will notify the rpc subscribers of the new block through the AggregateCommitmentService:

solana/core/src/replay_stage.rs

Lines 2249 to 2255 in a2d7be0

    
           Self::update_commitment_cache( 
        
               bank.clone(), 
        
               bank_forks.read().unwrap().root(), 
        
               progress.get_fork_stats(bank.slot()).unwrap().total_stake, 
        
               lockouts_sender, 
        
           ); 
        
           update_commitment_cache_time.stop();

However in the case mentioned, blockstore does not yet have all of the data shreds and the subscribers will be sent an RpcBlockUpdateError as the RpcBlockUpdate cannot be constructed:

solana/ledger/src/blockstore.rs

Line 2189 in a2d7be0

Err(BlockstoreError::SlotUnavailable)

The change #34330 seems to exacerbate this race by increasing the # of shreds in the last FEC set. My naive hypothesis is that the extra coding shreds in the last FEC set slows down broadcast stage's blockstore insert such that the replay vote usually occurs first.

Summary of Changes

Subscribe to the RPC of the non leader node, to avoid getting sent errors due to this race.

If we decide the race is necessary to fix, a solution could be to return an Option<BlockUpdate> with None in the case that !slot_meta.is_full() and we are the leader for that slot. This would not notify and not increment the last notified slot, so that it can be sent out on the next iteration:

solana/rpc/src/rpc_subscriptions.rs

Lines 1034 to 1050 in 383aa04

    
           let block_update_result = blockstore 
        
               .get_complete_block(s, false) 
        
               .map_err(|e| { 
        
                   error!("get_complete_block error: {}", e); 
        
                   RpcBlockUpdateError::BlockStoreError 
        
               }) 
        
               .and_then(|block| filter_block_result_txs(block, s, params)); 
        
           match block_update_result { 
        
               Ok(block_update) => { 
        
                   if let Some(block_update) = block_update { 
        
                       notifier.notify( 
        
                           RpcResponse::from(RpcNotificationResponse { 
        
                               context: RpcNotificationContext { slot: s }, 
        
                               value: block_update, 
        
                           }), 
        
                           subscription,

Alternatively we could delay the leaders vote or rpc notification entirely until the last fec set has been inserted into blockstore.

Fixes #32863

codecov · 2023-12-12T22:33:39Z

Codecov Report

Merging #34421 (67272d0) into master (e4e043c) will increase coverage by 0.0%.
Report is 1547 commits behind head on master.
The diff coverage is 92.3%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #34421      +/-   ##
==========================================
  Coverage    81.8%    81.8%              
==========================================
  Files         767      819      +52     
  Lines      209267   220904   +11637     
==========================================
+ Hits       171303   180839    +9536     
- Misses      37964    40065    +2101

carllin · 2023-12-14T08:38:35Z

Great find!

If we decide the race is necessary to fix, a solution could be to return an Option with None in the case that !slot_meta.is_full() and we are the leader for that slot. This would not notify and not increment the last notified slot, so that it can be sent out on the next iteration:

Hmm interesting, so in RPC service, on failure for notifying/fetching block S right now, for a specific single subscription, if w_last_unnotified_slot is not incremented:

solana/rpc/src/rpc_subscriptions.rs

Line 1056 in 383aa04

*w_last_unnotified_slot = s + 1;

, does that mean next time we get notification for slot S+N, all blocks between S and N show up slots_to_notify:

solana/rpc/src/rpc_subscriptions.rs

Line 1023 in 383aa04

for s in slots_to_notify {

, even if some of those blocks were already notified in a previous iteration?

Idealistically, think it makes sense for the written block notification to be decoupled from the block voted notification so we could notify separately, since block writing/block voting are necessarily coupled, and we wouldn't want block writing to delay notification of block voted

carllin · 2023-12-14T08:39:48Z

local-cluster/tests/local_cluster.rs

    ]
    .iter()
    .map(|s| (Arc::new(Keypair::from_base58_string(s)), true))
    .take(node_stakes.len())
    .collect::<Vec<_>>();
+    let rpc_node_pubkey = &validator_keys[1].0.pubkey();


Might be good to document or point to this PR for why we chose this validator key

added the comment to the subscribe call

AshwinSekar · 2023-12-14T18:18:07Z

does that mean next time we get notification for slot S+N, all blocks between S and N show up slots_to_notify:

correct. So for this example in the first iteration we would get a block update for slot 3 with

RpcBlockUpdate { 
  slot: 3
  block: None,
  err: BlockStoreError

Then when slot 4 was voted on we would get 2 notifications

RpcBlockUpdate { 
  slot: 3
  block: <actual block>,
  err: None
}

RpcBlockUpdate { 
  slot: 4
  block: None,
  err: BlockStoreError
}

Idealistically, think it makes sense for the written block notification to be decoupled from the block voted notification so we could notify separately, since block writing/block voting are necessarily coupled, and we wouldn't want block writing to delay notification of block voted

Agreed, I'm not sure why voting is the trigger here. At least for the Processed confirmation level I would assume subscribers would want to know as soon as any block is available in blockstore.

AshwinSekar force-pushed the flaky-tests branch 4 times, most recently from d8bafec to c9d6ad0 Compare December 12, 2023 21:20

AshwinSekar requested review from behzadnouri, carllin and wen-coding December 12, 2023 21:55

AshwinSekar force-pushed the flaky-tests branch from c9d6ad0 to 21960a6 Compare December 13, 2023 17:01

carllin previously approved these changes Dec 14, 2023

View reviewed changes

carllin reviewed Dec 14, 2023

View reviewed changes

AshwinSekar added 3 commits December 14, 2023 18:20

local-cluster: fix flaky test_rpc_block_subscribe

121ec05

reenable test

8998ef1

pr feedback: add comment linking pr

67272d0

AshwinSekar dismissed carllin’s stale review via 67272d0 December 14, 2023 18:20

AshwinSekar force-pushed the flaky-tests branch from 21960a6 to 67272d0 Compare December 14, 2023 18:20

AshwinSekar requested a review from carllin December 14, 2023 18:21

carllin approved these changes Dec 14, 2023

View reviewed changes

AshwinSekar merged commit 4f11161 into solana-labs:master Dec 14, 2023
34 checks passed

willhickey mentioned this pull request Mar 28, 2024

v1.18 commits - please ignore anza-xyz/agave#475

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

local-cluster: fix flaky test_rpc_block_subscribe #34421

local-cluster: fix flaky test_rpc_block_subscribe #34421

AshwinSekar commented Dec 12, 2023 •

edited

Loading

codecov bot commented Dec 12, 2023 •

edited

Loading

carllin commented Dec 14, 2023 •

edited

Loading

carllin Dec 14, 2023

AshwinSekar Dec 14, 2023

AshwinSekar commented Dec 14, 2023 •

edited

Loading

	Self::update_commitment_cache(
	bank.clone(),
	bank_forks.read().unwrap().root(),
	progress.get_fork_stats(bank.slot()).unwrap().total_stake,
	lockouts_sender,
	);
	update_commitment_cache_time.stop();

	let block_update_result = blockstore
	.get_complete_block(s, false)
	.map_err(\|e\| {
	error!("get_complete_block error: {}", e);
	RpcBlockUpdateError::BlockStoreError
	})
	.and_then(\|block\| filter_block_result_txs(block, s, params));

	match block_update_result {
	Ok(block_update) => {
	if let Some(block_update) = block_update {
	notifier.notify(
	RpcResponse::from(RpcNotificationResponse {
	context: RpcNotificationContext { slot: s },
	value: block_update,
	}),
	subscription,

local-cluster: fix flaky test_rpc_block_subscribe #34421

local-cluster: fix flaky test_rpc_block_subscribe #34421

Conversation

AshwinSekar commented Dec 12, 2023 • edited Loading

Problem

Summary of Changes

codecov bot commented Dec 12, 2023 • edited Loading

Codecov Report

carllin commented Dec 14, 2023 • edited Loading

carllin Dec 14, 2023

Choose a reason for hiding this comment

AshwinSekar Dec 14, 2023

Choose a reason for hiding this comment

AshwinSekar commented Dec 14, 2023 • edited Loading

AshwinSekar commented Dec 12, 2023 •

edited

Loading

codecov bot commented Dec 12, 2023 •

edited

Loading

carllin commented Dec 14, 2023 •

edited

Loading

AshwinSekar commented Dec 14, 2023 •

edited

Loading