kvserver: a slow follower should not stall the quota pool #113054

erikgrinaker · 2023-10-25T14:57:30Z

The quota pool will halt proposals when a certain amount of proposals are in flight and not yet replicated and applied on all replicas. This means that a single slow follower can stall the entire range. There are exceptions here, i.e. we ignore followers we haven't communicated with in a while, and also pause followers that are IO overloaded:

cockroach/pkg/kv/kvserver/replica_proposal_quota.go

Lines 172 to 174 in d80a9f5

    
           if !r.mu.lastUpdateTimes.isFollowerActiveSince(rep.ReplicaID, now, r.store.cfg.RangeLeaseDuration) { 
        
           	return 
        
           }

cockroach/pkg/kv/kvserver/replica_proposal_quota.go

Lines 217 to 224 in d80a9f5

    
           if _, paused := r.mu.pausedFollowers[roachpb.ReplicaID(id)]; paused { 
        
           	// We are dropping MsgApp to this store, so we are effectively treating 
        
           	// it as non-live for the purpose of replication and are letting it fall 
        
           	// behind intentionally. 
        
           	// 
        
           	// See #79215. 
        
           	return 
        
           }

However, if the follower is live but simply slow, then it will drag the whole range down with it. We saw this happen e.g. in #113053, where the follower was stuck in an append loop, and kept rejecting MsgApps for hours, which stalled the range.

We shouldn't let a minority of followers stall the range. Instead, we should give up on them and allow progress with the remaining quorum. However, this requires the allocator to upreplicate elsewhere, since the range will otherwise stall if we lose one of the healthy replicas.

This is related to several previous discussions, and also replication admission control which we hope will replace the quota pool in the longer term. For example:

Jira issue: CRDB-32733

Epic CRDB-39900

blathers-crl · 2023-10-25T14:57:33Z

cc @cockroachdb/replication

mzone242 · 2023-11-05T04:02:38Z

Hello, @annabelto and I would like to take this issue. We are Masters students at The University of Texas at Austin and are looking to contribute to an open source project concerned with distributed systems for our coursework in the subject.

Can you also provide some more information on the deliverables of this issue? Specifically, we're unfamiliar with "upreplication". We can see that the long term plan is to phase out this feature, but would like to help address it in the short term.

erikgrinaker · 2023-11-09T10:19:44Z

Hi @mzone242, thanks for your interest! We haven't decided on a direction here yet, but may know more next week.

erikgrinaker · 2023-11-14T12:22:50Z

We're leaning towards removing the quota pool outright, since Raft will naturally pace replication at the quorum rate. See #106063.

mzone242 · 2023-11-14T20:39:46Z

Is that something we can help with at this time?

erikgrinaker · 2023-11-27T10:47:03Z

We can start by adding a cluster setting that allows disabling the quota pool. I wrote up a brief outline over in #106063 (comment), let's continue the discussion on that issue.

erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-replication Relating to Raft, consensus, and coordination. T-kv-replication labels Oct 25, 2023

erikgrinaker mentioned this issue Nov 27, 2023

kvserver: disable & delete the quota pool #106063

Open

exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 28, 2024

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Incoming in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: a slow follower should not stall the quota pool #113054

kvserver: a slow follower should not stall the quota pool #113054

erikgrinaker commented Oct 25, 2023 •

edited by exalate-issue-sync bot

Loading

blathers-crl bot commented Oct 25, 2023

mzone242 commented Nov 5, 2023

erikgrinaker commented Nov 9, 2023

erikgrinaker commented Nov 14, 2023 •

edited

Loading

mzone242 commented Nov 14, 2023

erikgrinaker commented Nov 27, 2023

kvserver: a slow follower should not stall the quota pool #113054

kvserver: a slow follower should not stall the quota pool #113054

Comments

erikgrinaker commented Oct 25, 2023 • edited by exalate-issue-sync bot Loading

blathers-crl bot commented Oct 25, 2023

mzone242 commented Nov 5, 2023

erikgrinaker commented Nov 9, 2023

erikgrinaker commented Nov 14, 2023 • edited Loading

mzone242 commented Nov 14, 2023

erikgrinaker commented Nov 27, 2023

erikgrinaker commented Oct 25, 2023 •

edited by exalate-issue-sync bot

Loading

erikgrinaker commented Nov 14, 2023 •

edited

Loading