Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvflowcontrol,admission: use flow control during raft log catchup post node-restart #98710

Open
irfansharif opened this issue Mar 15, 2023 · 3 comments
Labels
A-admission-control C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-3 Issues/test failures with no fix SLA T-admission-control Admission Control

Comments

@irfansharif
Copy link
Contributor

irfansharif commented Mar 15, 2023

Is your feature request related to a problem? Please describe.

We've seen in write-heavy workloads that node restarts can result in LSM inversion due to a rapid onset of raft log catchup appends. This problem was touched on recently in #96521 + #95159 -- those issues amounted to 23.1 changes to avoid an immediate transfer of leases to newly-restarted nodes until their LSM is healthier, in order to stave off latency impact for leaseholder traffic. But it's possible to invert the LSM, which affects non-leaseholder traffic.

This issue proposes using the general flow control mechanism we're introducing in #95563 to pace the rate of catchup raft log appends to prevent LSM inversion entirely. With such a mechanism, we'd be able to transfer leases immediately to newly restarted nodes without lease-holder impact, and also avoid latency impact on follower traffic. #80607 is slightly related -- we could apply flow tokens to raft snapshots too to cover the general case of "catchup write traffic".

Describe the solution you'd like

// I11. What happens when a node is restarted and is being caught up rapidly
// through raft log appends? We know of cases where the initial log appends
// and subsequent state machine application be large enough to invert the
// LSM[^9]. Imagine large block writes with a uniform key distribution; we
// may persist log entries rapidly across many replicas (without inverting
// the LSM, so follower pausing is also of no help) and during state
// machine application, create lots of overlapping files/sublevels in L0.
// - We want to pace the initial rate of log appends while factoring in the
// effect of the subsequent state machine application on L0 (modulo [^9]). We
// can use flow tokens for this too. In I3a we outlined how for quorum writes
// that includes a replica on some recently re-started node, we need to wait
// for it to be sufficiently caught before deducting/blocking for flow tokens.
// Until that point we can use flow tokens on sender nodes that wish to send
// catchup MsgApps to the newly-restarted node. Similar to the steady state,
// flow tokens are only be returned once log entries are logically admitted
// (which takes into account any apply-time write amplification, modulo [^9]).
// Once the node is sufficiently caught up with respect to all its raft logs,
// it can transition into the mode described in I3a where we deduct/block for
// flow tokens for subsequent quorum writes.

// [^9]: With async raft storage writes (#17500, etcd-io/raft#8), we can
// decouple raft log appends and state machine application (see #94854 and
// #94853). So we could append at a higher rate than applying. Since
// application can be arbitrarily deferred, we cause severe LSM
// inversions. Do we want some form of pacing of log appends then,
// relative to observed state machine application? Perhaps specifically in
// cases where we're more likely to append faster than apply, like node
// restarts. We're likely to defeat AC's IO control otherwise.
// - For what it's worth, this "deferred application with high read-amp"
// was also a problem before async raft storage writes. Consider many
// replicas on an LSM, all of which appended a few raft log entries
// without applying, and at apply time across all those replicas, we end
// up inverting the LSM.
// - Since we don't want to wait below raft, one way bound the lag between
// appended entries and applied ones is to only release flow tokens for
// an entry at position P once the applied state position >= P - delta.
// We'd have to be careful, if we're not applying due to quorum loss
// (as a result of remote node failure(s)), we don't want to deplete
// flow tokens and cause interference on other ranges.
// - If this all proves too complicated, we could just not let state
// machine application get significantly behind due to local scheduling
// reasons by using the same goroutine to do both async raft log writes
// and state machine application.

This is also touched on here: https://reviewable.io/reviews/cockroachdb/cockroach/96642#-NOF0fl20XuBrK4Ywlzp.

Jira issue: CRDB-25460

@irfansharif irfansharif added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-admission-control labels Mar 15, 2023
@exalate-issue-sync exalate-issue-sync bot added the T-kv KV Team label Mar 15, 2023
@irfansharif
Copy link
Contributor Author

#101321 talks about how follower pausing post-restart due to L0 overload as a result of raft log catchup and a rolling restart of another node could lead to range unavailability. If we used flow control during raft log catchup, follower pausing would not kick in on the first node. We'd still want to make sure that we surface in our healthcheck API the fact that a node's being caught up actively using flow tokens, or even that there's follower pausing kicking in somewhere, so operators don't restart another node while under-replicated, lest they lose quorum. (If a follower is paused, or a node is being caught up via flow tokens, there are some ranges that are "under-replicated" in that they don't have all log entries.)

@irfansharif irfansharif added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label May 17, 2023
@shralex shralex removed the T-kv KV Team label Oct 24, 2023
@exalate-issue-sync exalate-issue-sync bot added the T-kv KV Team label Oct 24, 2023
@shralex shralex removed the T-kv KV Team label Oct 24, 2023
@exalate-issue-sync exalate-issue-sync bot added the T-kv KV Team label Oct 24, 2023
@exalate-issue-sync exalate-issue-sync bot added T-admission-control Admission Control and removed T-kv KV Team labels Oct 26, 2023
@nicktrav nicktrav added the P-3 Issues/test failures with no fix SLA label Jan 17, 2024
@dshjoshi
Copy link

dshjoshi commented Oct 1, 2024

@nicktrav @sumeerbhola Is this taken care by RAC v2? If so, can we close this out?

@nicktrav
Copy link
Collaborator

nicktrav commented Oct 3, 2024

Let's keep this open until that work completes and we can re-evaluate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-3 Issues/test failures with no fix SLA T-admission-control Admission Control
Projects
None yet
Development

No branches or pull requests

4 participants