kvserver: DistSender should detect lease expiration and redirect requests #105168

erikgrinaker · 2023-06-20T07:50:21Z

The DistSender makes an educated guess about who the current leaseholder is, sends a request to it, and then simply waits for the response -- typically a successful response or a NotLeaseHolderError instructing it who the leaseholder is. However, in some cases the request never returns, and the DistSender waits indefinitely. Typical cases are:

Disk stalls.
Replica stalls.
Raft reproposals, e.g. due to network partition.

With expiration-based leases, the remote replica will eventually lose its lease in these cases, but the DistSender cache will keep pointing to the stalled replica. Requests also remain stuck, which can cause the entire workload to stall if it has a bounded number of workers/connections that all get stuck.

The DistSender should detect when the remote replica loses its lease, and try to discover a new leaseholder elsewhere, redirecting the requests there when possible. Write requests can't easily be retried though, so we should only cancel them once we confirm that there is in fact a new leaseholder elsewhere.

This would improve resolution of both partial partitions (#103769) and disk/replica stalls (#104262).

Jira issue: CRDB-28907

Epic CRDB-25200

The text was updated successfully, but these errors were encountered:

blathers-crl · 2023-06-22T20:15:53Z

cc @cockroachdb/replication

erikgrinaker · 2024-01-22T12:38:57Z

Some quick KV/Native/Scan/rows=1 benchmarks to measure the base cost of the tracking. Cases:

base: no changes.

cancel: add a cancellable context.

sendCtx, cancel := context.WithCancel(ctx)
br, err = transport.SendNext(sendCtx, ba)
cancel()

cancel+map: track the cancel function in a mutex-guarded map.

sendCtx, cancel := context.WithCancel(ctx)

ds.mu.Lock()
ds.mu.cancelFns[ba] = cancel
ds.mu.Unlock()

br, err = transport.SendNext(sendCtx, ba)
cancel()

ds.mu.Lock()
delete(ds.mu.cancelFns, ba)
ds.mu.Unlock()

cancel+timer: set up a timer to trip after lease expiration.

sendCtx, cancel := context.WithCancel(ctx)
timer := time.AfterFunc(4*time.Second, func() { // should use lease expiration
	cancel()
})

br, err = transport.SendNext(sendCtx, ba)
timer.Stop()
cancel()

Results:

name          old time/op  new time/op  delta
cancel        40.5µs ± 2%  40.9µs ± 2%   ~      (p=0.113 n=10+9)
cancel+map    40.5µs ± 2%  40.9µs ± 3%   ~      (p=0.280 n=10+10)
cancel+timer  40.5µs ± 2%  41.5µs ± 1%  +2.57%  (p=0.000 n=10+10)

Seems like tracking the cancel functions in a map is likely the cheapest option, although this will cause contention at higher concurrencies since all DistSender outbound requests to a replica will go through this mutex.

Using a timer that only trips after the lease expires would avoid this contention, but has a significant 2.57% baseline cost. There might be some way to optimize this cost down by reusing a timer somehow, but that would likely hit the same contention issues.

We could also shard the mutex and map to alleviate the contention.

In any case, it seems possible to make the base cost here negligible, so I'll prototype this further.

erikgrinaker · 2024-02-07T11:22:36Z

Some takeaways from yesterday's meeting, in response to the prototype in #118755:

The problem can be decomposed into two parts:
1. Detecting a new lease elsewhere, such that subsequent requests don't get stuck.
2. Cancelling in-flight requests.
Cancelling in-flight requests can be challenging.
- Local requests can get stuck on a syscall, and won't respond to context cancellation.
- Follower reads can get stuck on a non-leaseholder replica.
- Requests can get stuck at other places in the stack. Consider e.g. a gateway mutex deadlock or a stuck network syscall.
A generalized solution to cancelling stuck requests probably needs to involve client timeouts.
If we assume client timeouts, which cover a broader set of scenarios, then this reduces to solving problem 1: making sure the retry following the client timeout doesn't get stuck all over again. This might involve:
- Tracking failures by replica, including client timeouts (but possibly not client disconnects).
- Deprioritize problematic replicas (no successful requests) and attempt others first.
- Probe replicas either for a new lease, or for recovery from problems (possibly becomes a circuit breaker).

sumeerbhola · 2024-02-12T14:44:38Z

(drive-by)

A generalized solution to cancelling stuck requests probably needs to involve client timeouts.

Client timeouts make me very very uncomfortable. Stating the obvious, but I think it is worth saying: there are normal reasons for queueing, including AC (which can queue for many seconds), and timeouts can cause throughput to plummet in these cases by re-queueing (causing RPC and AC overhead). Also if some real work was already done, it is wasted, and needs to repeat.

For cases where the server can detect the misbehavior, say syscall stuck for 1s, it should timeout and return an error. For the client=>server network being unavailable, simple heartbeating over the TCP connection should suffice -- aren't we getting this with gRPC?

erikgrinaker · 2024-02-12T15:50:56Z

(drive-by)

A generalized solution to cancelling stuck requests probably needs to involve client timeouts.

Client timeouts make me very very uncomfortable. Stating the obvious, but I think it is worth saying: there are normal reasons for queueing, including AC (which can queue for many seconds), and timeouts can cause throughput to plummet in these cases by re-queueing (causing RPC and AC overhead). Also if some real work was already done, it is wasted, and needs to repeat.

Agreed. This isn't to say that a solution here will require client timeouts, rather than we can't guarantee in 24.1 that requests will never ever get stuck without client timeouts.

The latest prototype in #118943 will send RPC probes before tripping a circuit breaker, regardless of client timeouts.

For cases where the server can detect the misbehavior, say syscall stuck for 1s, it should timeout and return an error. For the client=>server network being unavailable, simple heartbeating over the TCP connection should suffice -- aren't we getting this with gRPC?

We currently can't though. Most IO calls do not currently respond to context cancellation, so the goroutine simply blocks forever. This not only happens with disk IO, we have also had escalations where individual network socket writes blocked indefinitely -- the RPC heartbeats were non-effectual, because the socket close call also blocked indefinitely.

In any case, we should not rely on active cooperation of the server to detect itself as being faulty, it should passively fail when it fails to take action.

119865: kvcoord: don't probe idle replicas with tripped circuit breakers r=erikgrinaker a=erikgrinaker Previously, DistSender circuit breakers would continually probe replicas with tripped circuit breakers. This could result in a large number of concurrent probe goroutines. To reduce the number of probes, this patch instead only probes replicas that have seen traffic in the past few probe intervals. Otherwise, the probe exits (leaving the breaker tripped), and a new probe will be launched on the next request to the replica. This will increase latency for the first request(s) following recovery, as it must wait for the probe to succeed and the DistSender to retry the replica. In the common case (e.g. with a disk stall), the lease will shortly have moved to a different replica and the DistSender will stop sending requests to it. Replicas with tripped circuit breakers are also made eligible for garbage collection, but with a higher threshold of 1 hour to avoid overeager GC. Resolves #119917. Touches #104262. Touches #105168. Epic: none Release note: None 120168: sql: reserve tenant 2 r=dt a=dt We might want this later and it is easier to reserve now than it will be then if we let 24.1 UA use it. Release note: none. Epic: none. 120197: db-console/debug: clarify cpu profile link r=dt a=dt The all-caps MEMORY OVERHEAD in the title for this one profile looks out of place compared to the rest of the page and is somewhat confusing: is it profiling the memory overhead, as implied by inclusion of the title which for all other links says what the tool does? Or does the tool itself have overhead? While in practice it is the latter, serving any request has _some_ overhead so this hardly worth this weird treatment. Release note: none. Epic: none. 120263: restore: record provenance of restored tenant for pcr r=dt a=dt Release note (enterprise change): a restored backup of a virtual cluster, produced via BACKUP VIRTUAL CLUSTER and RESTORE VIRTUAL CLUSTER, can now be used to start PCR replication without an initial scan. Epic: none. Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: David Taylor <[email protected]>

erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team labels Jun 20, 2023

erikgrinaker mentioned this issue Jun 20, 2023

kvserver: investigate failover/partial/lease-leader/lease=expiration unavailability #103769

Closed

exalate-issue-sync bot added T-kv-replication and removed T-kv KV Team labels Jun 22, 2023

exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 22, 2023

erikgrinaker mentioned this issue Aug 8, 2023

kvserver: add generalized circuit-breaking to catch, e.g., mutex deadlocks #77366

Open

exalate-issue-sync bot assigned andrewbaptist Dec 7, 2023

erikgrinaker assigned erikgrinaker and unassigned andrewbaptist Jan 11, 2024

erikgrinaker mentioned this issue Jan 22, 2024

kvserver: DistSender lease detection prototype #105507

Closed

This was referenced Feb 2, 2024

kv: reject lease proposal locally when leader not known #118435

Closed

kvserver: DistSender lease detection prototype #118755

Closed

erikgrinaker mentioned this issue Feb 8, 2024

kvcoord: add DistSender circuit breakers #118943

Merged

erikgrinaker mentioned this issue Mar 4, 2024

kvcoord: don't probe idle replicas with tripped circuit breakers #119865

Merged

craig bot closed this as completed in bf013ea Mar 4, 2024

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Closed in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: DistSender should detect lease expiration and redirect requests #105168

kvserver: DistSender should detect lease expiration and redirect requests #105168

erikgrinaker commented Jun 20, 2023 •

edited by jlinder

Loading

blathers-crl bot commented Jun 22, 2023

erikgrinaker commented Jan 22, 2024 •

edited

Loading

erikgrinaker commented Feb 7, 2024

sumeerbhola commented Feb 12, 2024

erikgrinaker commented Feb 12, 2024

kvserver: DistSender should detect lease expiration and redirect requests #105168

kvserver: DistSender should detect lease expiration and redirect requests #105168

Comments

erikgrinaker commented Jun 20, 2023 • edited by jlinder Loading

blathers-crl bot commented Jun 22, 2023

erikgrinaker commented Jan 22, 2024 • edited Loading

erikgrinaker commented Feb 7, 2024

sumeerbhola commented Feb 12, 2024

erikgrinaker commented Feb 12, 2024

erikgrinaker commented Jun 20, 2023 •

edited by jlinder

Loading

erikgrinaker commented Jan 22, 2024 •

edited

Loading