kvflowcontrol,admission: use flow control during raft log catchup post node-restart #98710

irfansharif · 2023-03-15T20:18:27Z

Is your feature request related to a problem? Please describe.

We've seen in write-heavy workloads that node restarts can result in LSM inversion due to a rapid onset of raft log catchup appends. This problem was touched on recently in #96521 + #95159 -- those issues amounted to 23.1 changes to avoid an immediate transfer of leases to newly-restarted nodes until their LSM is healthier, in order to stave off latency impact for leaseholder traffic. But it's possible to invert the LSM, which affects non-leaseholder traffic.

This issue proposes using the general flow control mechanism we're introducing in #95563 to pace the rate of catchup raft log appends to prevent LSM inversion entirely. With such a mechanism, we'd be able to transfer leases immediately to newly restarted nodes without lease-holder impact, and also avoid latency impact on follower traffic. #80607 is slightly related -- we could apply flow tokens to raft snapshots too to cover the general case of "catchup write traffic".

Describe the solution you'd like

cockroach/pkg/kv/kvserver/kvflowcontrol/doc.go

Lines 310 to 328 in b84f10c

    
           // I11. What happens when a node is restarted and is being caught up rapidly 
        
           //      through raft log appends? We know of cases where the initial log appends 
        
           //      and subsequent state machine application be large enough to invert the 
        
           //      LSM[^9]. Imagine large block writes with a uniform key distribution; we 
        
           //      may persist log entries rapidly across many replicas (without inverting 
        
           //      the LSM, so follower pausing is also of no help) and during state 
        
           //      machine application, create lots of overlapping files/sublevels in L0. 
        
           // - We want to pace the initial rate of log appends while factoring in the 
        
           //   effect of the subsequent state machine application on L0 (modulo [^9]). We 
        
           //   can use flow tokens for this too. In I3a we outlined how for quorum writes 
        
           //   that includes a replica on some recently re-started node, we need to wait 
        
           //   for it to be sufficiently caught before deducting/blocking for flow tokens. 
        
           //   Until that point we can use flow tokens on sender nodes that wish to send 
        
           //   catchup MsgApps to the newly-restarted node. Similar to the steady state, 
        
           //   flow tokens are only be returned once log entries are logically admitted 
        
           //   (which takes into account any apply-time write amplification, modulo [^9]). 
        
           //   Once the node is sufficiently caught up with respect to all its raft logs, 
        
           //   it can transition into the mode described in I3a where we deduct/block for 
        
           //   flow tokens for subsequent quorum writes.

cockroach/pkg/kv/kvserver/kvflowcontrol/doc.go

Lines 353 to 375 in b84f10c

    
           // [^9]: With async raft storage writes (#17500, etcd-io/raft#8), we can 
        
           //       decouple raft log appends and state machine application (see #94854 and 
        
           //       #94853). So we could append at a higher rate than applying. Since 
        
           //       application can be arbitrarily deferred, we cause severe LSM 
        
           //       inversions. Do we want some form of pacing of log appends then, 
        
           //       relative to observed state machine application? Perhaps specifically in 
        
           //       cases where we're more likely to append faster than apply, like node 
        
           //       restarts. We're likely to defeat AC's IO control otherwise. 
        
           //       - For what it's worth, this "deferred application with high read-amp" 
        
           //         was also a problem before async raft storage writes. Consider many 
        
           //         replicas on an LSM, all of which appended a few raft log entries 
        
           //         without applying, and at apply time across all those replicas, we end 
        
           //         up inverting the LSM. 
        
           //       - Since we don't want to wait below raft, one way bound the lag between 
        
           //         appended entries and applied ones is to only release flow tokens for 
        
           //         an entry at position P once the applied state position >= P - delta. 
        
           //         We'd have to be careful, if we're not applying due to quorum loss 
        
           //         (as a result of remote node failure(s)), we don't want to deplete 
        
           //         flow tokens and cause interference on other ranges. 
        
           //       - If this all proves too complicated, we could just not let state 
        
           //         machine application get significantly behind due to local scheduling 
        
           //         reasons by using the same goroutine to do both async raft log writes 
        
           //         and state machine application.

This is also touched on here: https://reviewable.io/reviews/cockroachdb/cockroach/96642#-NOF0fl20XuBrK4Ywlzp.

Jira issue: CRDB-25460

irfansharif · 2023-04-12T18:21:31Z

#101321 talks about how follower pausing post-restart due to L0 overload as a result of raft log catchup and a rolling restart of another node could lead to range unavailability. If we used flow control during raft log catchup, follower pausing would not kick in on the first node. We'd still want to make sure that we surface in our healthcheck API the fact that a node's being caught up actively using flow tokens, or even that there's follower pausing kicking in somewhere, so operators don't restart another node while under-replicated, lest they lose quorum. (If a follower is paused, or a node is being caught up via flow tokens, there are some ranges that are "under-replicated" in that they don't have all log entries.)

dshjoshi · 2024-10-01T18:15:50Z

@nicktrav @sumeerbhola Is this taken care by RAC v2? If so, can we close this out?

nicktrav · 2024-10-03T01:11:34Z

Let's keep this open until that work completes and we can re-evaluate.

irfansharif added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-admission-control labels Mar 15, 2023

exalate-issue-sync bot added the T-kv KV Team label Mar 15, 2023

This was referenced Mar 22, 2023

admission: add integration tests #89208

Open

storepool: ignore drain state during replica placement #87969

Closed

irfansharif added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label May 17, 2023

sumeerbhola mentioned this issue Oct 19, 2023

kvserver: track per-replica and aggregate to-apply entry bytes #97044

Open

shralex removed the T-kv KV Team label Oct 24, 2023

exalate-issue-sync bot added the T-kv KV Team label Oct 24, 2023

shralex removed the T-kv KV Team label Oct 24, 2023

exalate-issue-sync bot added the T-kv KV Team label Oct 24, 2023

irfansharif mentioned this issue Oct 24, 2023

admission,kvserver: subject snapshot ingestion to admission control #80607

Closed

exalate-issue-sync bot added T-admission-control Admission Control and removed T-kv KV Team labels Oct 26, 2023

kvoli mentioned this issue Dec 5, 2023

roachtest: create test where node rejoins cluster after being down for multiple hours #115648

Closed

nicktrav added the P-3 Issues/test failures with no fix SLA label Jan 17, 2024

sumeerbhola mentioned this issue May 2, 2024

admission,replication: replication admission control for regular work #123509

Open

26 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvflowcontrol,admission: use flow control during raft log catchup post node-restart #98710

kvflowcontrol,admission: use flow control during raft log catchup post node-restart #98710

irfansharif commented Mar 15, 2023 •

edited by exalate-issue-sync bot

Loading

irfansharif commented Apr 12, 2023

dshjoshi commented Oct 1, 2024

nicktrav commented Oct 3, 2024

kvflowcontrol,admission: use flow control during raft log catchup post node-restart #98710

kvflowcontrol,admission: use flow control during raft log catchup post node-restart #98710

Comments

irfansharif commented Mar 15, 2023 • edited by exalate-issue-sync bot Loading

irfansharif commented Apr 12, 2023

dshjoshi commented Oct 1, 2024

nicktrav commented Oct 3, 2024

irfansharif commented Mar 15, 2023 •

edited by exalate-issue-sync bot

Loading