Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: DistSender should detect lease expiration and redirect requests #105168

Closed
erikgrinaker opened this issue Jun 20, 2023 · 5 comments · Fixed by #118943
Closed

kvserver: DistSender should detect lease expiration and redirect requests #105168

erikgrinaker opened this issue Jun 20, 2023 · 5 comments · Fixed by #118943
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jun 20, 2023

The DistSender makes an educated guess about who the current leaseholder is, sends a request to it, and then simply waits for the response -- typically a successful response or a NotLeaseHolderError instructing it who the leaseholder is. However, in some cases the request never returns, and the DistSender waits indefinitely. Typical cases are:

  • Disk stalls.
  • Replica stalls.
  • Raft reproposals, e.g. due to network partition.

With expiration-based leases, the remote replica will eventually lose its lease in these cases, but the DistSender cache will keep pointing to the stalled replica. Requests also remain stuck, which can cause the entire workload to stall if it has a bounded number of workers/connections that all get stuck.

The DistSender should detect when the remote replica loses its lease, and try to discover a new leaseholder elsewhere, redirecting the requests there when possible. Write requests can't easily be retried though, so we should only cancel them once we confirm that there is in fact a new leaseholder elsewhere.

This would improve resolution of both partial partitions (#103769) and disk/replica stalls (#104262).

Jira issue: CRDB-28907

Epic CRDB-25200

@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team labels Jun 20, 2023
@exalate-issue-sync exalate-issue-sync bot added T-kv-replication and removed T-kv KV Team labels Jun 22, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jun 22, 2023

cc @cockroachdb/replication

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jan 22, 2024

Some quick KV/Native/Scan/rows=1 benchmarks to measure the base cost of the tracking. Cases:

  1. base: no changes.

  2. cancel: add a cancellable context.

    sendCtx, cancel := context.WithCancel(ctx)
    br, err = transport.SendNext(sendCtx, ba)
    cancel()
  3. cancel+map: track the cancel function in a mutex-guarded map.

    sendCtx, cancel := context.WithCancel(ctx)
    
    ds.mu.Lock()
    ds.mu.cancelFns[ba] = cancel
    ds.mu.Unlock()
    
    br, err = transport.SendNext(sendCtx, ba)
    cancel()
    
    ds.mu.Lock()
    delete(ds.mu.cancelFns, ba)
    ds.mu.Unlock()
  4. cancel+timer: set up a timer to trip after lease expiration.

    sendCtx, cancel := context.WithCancel(ctx)
    timer := time.AfterFunc(4*time.Second, func() { // should use lease expiration
    	cancel()
    })
    
    br, err = transport.SendNext(sendCtx, ba)
    timer.Stop()
    cancel()

Results:

name          old time/op  new time/op  delta
cancel        40.5µs ± 2%  40.9µs ± 2%   ~      (p=0.113 n=10+9)
cancel+map    40.5µs ± 2%  40.9µs ± 3%   ~      (p=0.280 n=10+10)
cancel+timer  40.5µs ± 2%  41.5µs ± 1%  +2.57%  (p=0.000 n=10+10)

Seems like tracking the cancel functions in a map is likely the cheapest option, although this will cause contention at higher concurrencies since all DistSender outbound requests to a replica will go through this mutex.

Using a timer that only trips after the lease expires would avoid this contention, but has a significant 2.57% baseline cost. There might be some way to optimize this cost down by reusing a timer somehow, but that would likely hit the same contention issues.

We could also shard the mutex and map to alleviate the contention.

In any case, it seems possible to make the base cost here negligible, so I'll prototype this further.

@erikgrinaker
Copy link
Contributor Author

Some takeaways from yesterday's meeting, in response to the prototype in #118755:

  • The problem can be decomposed into two parts:

    1. Detecting a new lease elsewhere, such that subsequent requests don't get stuck.
    2. Cancelling in-flight requests.
  • Cancelling in-flight requests can be challenging.

    • Local requests can get stuck on a syscall, and won't respond to context cancellation.
    • Follower reads can get stuck on a non-leaseholder replica.
    • Requests can get stuck at other places in the stack. Consider e.g. a gateway mutex deadlock or a stuck network syscall.
  • A generalized solution to cancelling stuck requests probably needs to involve client timeouts.

  • If we assume client timeouts, which cover a broader set of scenarios, then this reduces to solving problem 1: making sure the retry following the client timeout doesn't get stuck all over again. This might involve:

    • Tracking failures by replica, including client timeouts (but possibly not client disconnects).
    • Deprioritize problematic replicas (no successful requests) and attempt others first.
    • Probe replicas either for a new lease, or for recovery from problems (possibly becomes a circuit breaker).

@sumeerbhola
Copy link
Collaborator

(drive-by)

A generalized solution to cancelling stuck requests probably needs to involve client timeouts.

Client timeouts make me very very uncomfortable. Stating the obvious, but I think it is worth saying: there are normal reasons for queueing, including AC (which can queue for many seconds), and timeouts can cause throughput to plummet in these cases by re-queueing (causing RPC and AC overhead). Also if some real work was already done, it is wasted, and needs to repeat.

For cases where the server can detect the misbehavior, say syscall stuck for 1s, it should timeout and return an error. For the client=>server network being unavailable, simple heartbeating over the TCP connection should suffice -- aren't we getting this with gRPC?

@erikgrinaker
Copy link
Contributor Author

(drive-by)

A generalized solution to cancelling stuck requests probably needs to involve client timeouts.

Client timeouts make me very very uncomfortable. Stating the obvious, but I think it is worth saying: there are normal reasons for queueing, including AC (which can queue for many seconds), and timeouts can cause throughput to plummet in these cases by re-queueing (causing RPC and AC overhead). Also if some real work was already done, it is wasted, and needs to repeat.

Agreed. This isn't to say that a solution here will require client timeouts, rather than we can't guarantee in 24.1 that requests will never ever get stuck without client timeouts.

The latest prototype in #118943 will send RPC probes before tripping a circuit breaker, regardless of client timeouts.

For cases where the server can detect the misbehavior, say syscall stuck for 1s, it should timeout and return an error. For the client=>server network being unavailable, simple heartbeating over the TCP connection should suffice -- aren't we getting this with gRPC?

We currently can't though. Most IO calls do not currently respond to context cancellation, so the goroutine simply blocks forever. This not only happens with disk IO, we have also had escalations where individual network socket writes blocked indefinitely -- the RPC heartbeats were non-effectual, because the socket close call also blocked indefinitely.

In any case, we should not rely on active cooperation of the server to detect itself as being faulty, it should passively fail when it fails to take action.

@craig craig bot closed this as completed in bf013ea Mar 4, 2024
craig bot pushed a commit that referenced this issue Mar 12, 2024
119865: kvcoord: don't probe idle replicas with tripped circuit breakers r=erikgrinaker a=erikgrinaker

Previously, DistSender circuit breakers would continually probe replicas with tripped circuit breakers. This could result in a large number of concurrent probe goroutines.

To reduce the number of probes, this patch instead only probes replicas that have seen traffic in the past few probe intervals. Otherwise, the probe exits (leaving the breaker tripped), and a new probe will be launched on the next request to the replica. This will increase latency for the first request(s) following recovery, as it must wait for the probe to succeed and the DistSender to retry the replica. In the common case (e.g. with a disk stall), the lease will shortly have moved to a different replica and the DistSender will stop sending requests to it.

Replicas with tripped circuit breakers are also made eligible for garbage collection, but with a higher threshold of 1 hour to avoid overeager GC.

Resolves #119917.
Touches #104262.
Touches #105168.
Epic: none
Release note: None

120168: sql: reserve tenant 2 r=dt a=dt

We might want this later and it is easier to reserve now than it will be then if we let 24.1 UA use it.

Release note: none.
Epic: none.

120197: db-console/debug: clarify cpu profile link r=dt a=dt

The all-caps MEMORY OVERHEAD in the title for this one profile looks out of place compared to the rest of the page and is somewhat confusing: is it profiling the memory overhead, as implied by inclusion of the title which for all other links says what the tool does? Or does the tool itself have overhead? While in practice it is the latter, serving any request has _some_ overhead so this hardly worth this weird treatment.

Release note: none.
Epic: none.

120263: restore: record provenance of restored tenant for pcr r=dt a=dt

Release note (enterprise change): a restored backup of a virtual cluster, produced via BACKUP VIRTUAL CLUSTER and RESTORE VIRTUAL CLUSTER, can now be used to start PCR replication without an initial scan.
Epic: none.

Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: David Taylor <[email protected]>
@github-project-automation github-project-automation bot moved this to Closed in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
No open projects
Status: Closed
Development

Successfully merging a pull request may close this issue.

3 participants