kvserver: replicate and split queue interactions can cause degraded availability #107520

tbg · 2023-07-25T09:10:14Z

Describe the problem

See the detailed analysis in #104588.

When a lot of (say, size-based) splits happen alongside replica movement, one can end up in a situation where lots of replicas need a raft snapshot, and the raft snapshots trickle in very slowly essentially due to "keyspace contention" between "stale" replicas (that haven't caught up across all splits) and "new" snapshots (that reflect all splits).

Fundamentally this is because splitting a range for which a follower needs a snapshot results in two ranges for which a follower needs a snapshot, but the snapshot for the right hand side can only go through once the snapshot for the left-hand side has. This interdependence between snapshots which is not visible to the raft snapshot queue causes the build-up of snapshots to be processed very slowly, especially once it has ballooned to hundreds of snapshots of backlog, which can easily happen with enough splits. Additionally, if the snapshots involved are large, there are additional pathologies such as the lease changing hands while snapshots are still in flight¹, resulting in wasted work.

To Reproduce

Not sure how to reliably trigger this. These kinds of issues have kept us busy for a long time², usually stressing a test that suitably combines rebalancing and splits while verifying that there aren't any raft snaps is enough to see these kinds of interactions.

Expected behavior

The goal should be that the only raft snapshots we ever see are due to log truncations (in which case we may wonder if the log truncation heuristics could be improved, but this is outside of the scope of this issue).

Jira issue: CRDB-30091

Epic CRDB-39952

this happens a few times in roachtest: splits/largerange/size=32GiB,nodes=6 failed [raft snaps; needs #106813] #104588 but I'm not sure why. ↩
see https://cockroachlabs.atlassian.net/wiki/spaces/CORE/pages/64749670/Raft+Snapshots+and+why+you+see+them+when+you+oughtn+t (internal) ↩

blathers-crl · 2023-07-25T09:10:16Z

cc @cockroachdb/replication

tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv-replication labels Jul 25, 2023

tbg mentioned this issue Jul 25, 2023

roachtest: splits/largerange/size=32GiB,nodes=6 failed [raft snaps; needs #106813] #104588

Closed

exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 28, 2024

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Incoming in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: replicate and split queue interactions can cause degraded availability #107520

kvserver: replicate and split queue interactions can cause degraded availability #107520

tbg commented Jul 25, 2023 •

edited by exalate-issue-sync bot

Loading

blathers-crl bot commented Jul 25, 2023

kvserver: replicate and split queue interactions can cause degraded availability #107520

kvserver: replicate and split queue interactions can cause degraded availability #107520

Comments

tbg commented Jul 25, 2023 • edited by exalate-issue-sync bot Loading

Footnotes

blathers-crl bot commented Jul 25, 2023

tbg commented Jul 25, 2023 •

edited by exalate-issue-sync bot

Loading