Fix backfill stalling #5192

pawanjay176 · 2024-02-05T20:22:36Z

Issue Addressed

N/A

Proposed Changes

Hopefully fixes all backfill stalled issues we have been seeing.

The root cause of the issue is that the peer_disconnect function returns early if the call to retry_batch_download within it returns an error. This can happen when we have no synced peers to retry from.
This early return fails to set the BatchState for multiple batches from BatchState::Downloading to BatchState::AwaitingDownload
If the processing_target batch was also one of the batches that failed to change states, then backfill effectively stalls.
When we resume backfill after getting new peers, we do the following which could potentially re-request the target:

Call to resume_batches

lighthouse/beacon_node/network/src/sync/backfill_sync/mod.rs

Lines 1002 to 1016 in c7e5dd1

    
               fn resume_batches(&mut self, network: &mut SyncNetworkContext<T>) -> Result<(), BackFillError> { 
        
                   let batch_ids_to_retry = self 
        
                       .batches 
        
                       .iter() 
        
                       .filter_map(|(batch_id, batch)| { 
        
                           // In principle there should only ever be on of these, and we could terminate the 
        
                           // loop early, however the processing is negligible and we continue the search 
        
                           // for robustness to handle potential future modification 
        
                           if matches!(batch.state(), BatchState::AwaitingDownload) { 
        
                               Some(*batch_id) 
        
                           } else { 
        
                               None 
        
                           } 
        
                       }) 
        
                       .collect::<Vec<_>>();

Since the processing_target is stuck in BatchState::Downloading, it is not requested in this call

Call to request_batches which does nothing because of this condition:

lighthouse/beacon_node/network/src/sync/backfill_sync/mod.rs

Lines 1083 to 1091 in c7e5dd1

    
           if self 
        
               .batches 
        
               .iter() 
        
               .filter(|&(_epoch, batch)| in_buffer(batch)) 
        
               .count() 
        
               > BACKFILL_BATCH_BUFFER_SIZE as usize 
        
           { 
        
               return None; 
        
           }

In most of the logs that I have seen of this issue, batches.len() is usually > BACKFILL_BATCH_BUFFER_SIZE

Hence, the target can never be requested again and the target state can never change until a restart.
This PR basically handles the error in peer_disconnect instead of short circuiting.

AgeManning

Yeah nice find.

I also notice that in the case that we short-circuit we also don't remove the peer from the participating_peers() mapping, which also looks like it could have dormant issues.

AgeManning · 2024-02-08T01:29:38Z

@Mergifyio queue

mergify · 2024-02-08T01:29:41Z

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at 0b59d10

* Prevent early short circuit in `peer_disconnected` * lint

Prevent early short circuit in peer_disconnected

80b4b08

pawanjay176 force-pushed the fix-backfill branch from 0e3175c to 80b4b08 Compare February 5, 2024 20:25

pawanjay176 added the ready-for-review The code is ready for review label Feb 5, 2024

pawanjay176 requested a review from AgeManning February 5, 2024 20:48

lint

f89eb60

AgeManning approved these changes Feb 8, 2024

View reviewed changes

AgeManning added ready-for-merge This PR is ready to merge. v5.0.0 Q1 2024 and removed ready-for-review The code is ready for review labels Feb 8, 2024

mergify bot added a commit that referenced this pull request Feb 8, 2024

Merge of #5192

81dad59

This was referenced Feb 8, 2024

merge queue: embarking unstable (675a231) and [#5177 + #5192 + #4870 + #5029] together #5217

Closed

improve libp2p connected peer metrics #4870

Merged

chore(docs): amend port guidance to enable QUIC support #5029

Merged

Improve network parameters #5177

Merged

mergify bot merged commit 0b59d10 into sigp:unstable Feb 8, 2024
29 checks passed

danielrachi1 pushed a commit to danielrachi1/lighthouse that referenced this pull request Feb 14, 2024

Fix backfill stalling (sigp#5192)

40a905c

* Prevent early short circuit in `peer_disconnected` * lint

michaelsproul mentioned this pull request Feb 20, 2024

Backfilling is not resumed after paused #3715

Open

chong-he mentioned this pull request Mar 29, 2024

can't synch. #5492

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix backfill stalling #5192

Fix backfill stalling #5192

pawanjay176 commented Feb 5, 2024

AgeManning left a comment

AgeManning commented Feb 8, 2024

mergify bot commented Feb 8, 2024 •

edited

Loading

	fn resume_batches(&mut self, network: &mut SyncNetworkContext<T>) -> Result<(), BackFillError> {
	let batch_ids_to_retry = self
	.batches
	.iter()
	.filter_map(\|(batch_id, batch)\| {
	// In principle there should only ever be on of these, and we could terminate the
	// loop early, however the processing is negligible and we continue the search
	// for robustness to handle potential future modification
	if matches!(batch.state(), BatchState::AwaitingDownload) {
	Some(*batch_id)
	} else {
	None
	}
	})
	.collect::<Vec<_>>();

	if self
	.batches
	.iter()
	.filter(\|&(_epoch, batch)\| in_buffer(batch))
	.count()
	> BACKFILL_BATCH_BUFFER_SIZE as usize
	{
	return None;
	}

Fix backfill stalling #5192

Fix backfill stalling #5192

Conversation

pawanjay176 commented Feb 5, 2024

Issue Addressed

Proposed Changes

AgeManning left a comment

Choose a reason for hiding this comment

AgeManning commented Feb 8, 2024

mergify bot commented Feb 8, 2024 • edited Loading

✅ The pull request has been merged automatically

mergify bot commented Feb 8, 2024 •

edited

Loading