Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: a slow follower should not stall the quota pool #113054

Open
erikgrinaker opened this issue Oct 25, 2023 · 6 comments
Open

kvserver: a slow follower should not stall the quota pool #113054

erikgrinaker opened this issue Oct 25, 2023 · 6 comments
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Oct 25, 2023

The quota pool will halt proposals when a certain amount of proposals are in flight and not yet replicated and applied on all replicas. This means that a single slow follower can stall the entire range. There are exceptions here, i.e. we ignore followers we haven't communicated with in a while, and also pause followers that are IO overloaded:

if !r.mu.lastUpdateTimes.isFollowerActiveSince(rep.ReplicaID, now, r.store.cfg.RangeLeaseDuration) {
return
}

if _, paused := r.mu.pausedFollowers[roachpb.ReplicaID(id)]; paused {
// We are dropping MsgApp to this store, so we are effectively treating
// it as non-live for the purpose of replication and are letting it fall
// behind intentionally.
//
// See #79215.
return
}

However, if the follower is live but simply slow, then it will drag the whole range down with it. We saw this happen e.g. in #113053, where the follower was stuck in an append loop, and kept rejecting MsgApps for hours, which stalled the range.

We shouldn't let a minority of followers stall the range. Instead, we should give up on them and allow progress with the remaining quorum. However, this requires the allocator to upreplicate elsewhere, since the range will otherwise stall if we lose one of the healthy replicas.

This is related to several previous discussions, and also replication admission control which we hope will replace the quota pool in the longer term. For example:

Jira issue: CRDB-32733

Epic CRDB-39900

@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-replication Relating to Raft, consensus, and coordination. T-kv-replication labels Oct 25, 2023
@blathers-crl
Copy link

blathers-crl bot commented Oct 25, 2023

cc @cockroachdb/replication

@mzone242
Copy link

mzone242 commented Nov 5, 2023

Hello, @annabelto and I would like to take this issue. We are Masters students at The University of Texas at Austin and are looking to contribute to an open source project concerned with distributed systems for our coursework in the subject.

Can you also provide some more information on the deliverables of this issue? Specifically, we're unfamiliar with "upreplication". We can see that the long term plan is to phase out this feature, but would like to help address it in the short term.

@erikgrinaker
Copy link
Contributor Author

Hi @mzone242, thanks for your interest! We haven't decided on a direction here yet, but may know more next week.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Nov 14, 2023

We're leaning towards removing the quota pool outright, since Raft will naturally pace replication at the quorum rate. See #106063.

@mzone242
Copy link

Is that something we can help with at this time?

@erikgrinaker
Copy link
Contributor Author

We can start by adding a cluster setting that allows disabling the quota pool. I wrote up a brief outline over in #106063 (comment), let's continue the discussion on that issue.

@exalate-issue-sync exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 28, 2024
@github-project-automation github-project-automation bot moved this to Incoming in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

2 participants