-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: DistSender should detect lease expiration and redirect requests #105168
Comments
cc @cockroachdb/replication |
Some quick
Results:
Seems like tracking the cancel functions in a map is likely the cheapest option, although this will cause contention at higher concurrencies since all DistSender outbound requests to a replica will go through this mutex. Using a timer that only trips after the lease expires would avoid this contention, but has a significant 2.57% baseline cost. There might be some way to optimize this cost down by reusing a timer somehow, but that would likely hit the same contention issues. We could also shard the mutex and map to alleviate the contention. In any case, it seems possible to make the base cost here negligible, so I'll prototype this further. |
Some takeaways from yesterday's meeting, in response to the prototype in #118755:
|
(drive-by)
Client timeouts make me very very uncomfortable. Stating the obvious, but I think it is worth saying: there are normal reasons for queueing, including AC (which can queue for many seconds), and timeouts can cause throughput to plummet in these cases by re-queueing (causing RPC and AC overhead). Also if some real work was already done, it is wasted, and needs to repeat. For cases where the server can detect the misbehavior, say syscall stuck for 1s, it should timeout and return an error. For the client=>server network being unavailable, simple heartbeating over the TCP connection should suffice -- aren't we getting this with gRPC? |
Agreed. This isn't to say that a solution here will require client timeouts, rather than we can't guarantee in 24.1 that requests will never ever get stuck without client timeouts. The latest prototype in #118943 will send RPC probes before tripping a circuit breaker, regardless of client timeouts.
We currently can't though. Most IO calls do not currently respond to context cancellation, so the goroutine simply blocks forever. This not only happens with disk IO, we have also had escalations where individual network socket writes blocked indefinitely -- the RPC heartbeats were non-effectual, because the socket close call also blocked indefinitely. In any case, we should not rely on active cooperation of the server to detect itself as being faulty, it should passively fail when it fails to take action. |
119865: kvcoord: don't probe idle replicas with tripped circuit breakers r=erikgrinaker a=erikgrinaker Previously, DistSender circuit breakers would continually probe replicas with tripped circuit breakers. This could result in a large number of concurrent probe goroutines. To reduce the number of probes, this patch instead only probes replicas that have seen traffic in the past few probe intervals. Otherwise, the probe exits (leaving the breaker tripped), and a new probe will be launched on the next request to the replica. This will increase latency for the first request(s) following recovery, as it must wait for the probe to succeed and the DistSender to retry the replica. In the common case (e.g. with a disk stall), the lease will shortly have moved to a different replica and the DistSender will stop sending requests to it. Replicas with tripped circuit breakers are also made eligible for garbage collection, but with a higher threshold of 1 hour to avoid overeager GC. Resolves #119917. Touches #104262. Touches #105168. Epic: none Release note: None 120168: sql: reserve tenant 2 r=dt a=dt We might want this later and it is easier to reserve now than it will be then if we let 24.1 UA use it. Release note: none. Epic: none. 120197: db-console/debug: clarify cpu profile link r=dt a=dt The all-caps MEMORY OVERHEAD in the title for this one profile looks out of place compared to the rest of the page and is somewhat confusing: is it profiling the memory overhead, as implied by inclusion of the title which for all other links says what the tool does? Or does the tool itself have overhead? While in practice it is the latter, serving any request has _some_ overhead so this hardly worth this weird treatment. Release note: none. Epic: none. 120263: restore: record provenance of restored tenant for pcr r=dt a=dt Release note (enterprise change): a restored backup of a virtual cluster, produced via BACKUP VIRTUAL CLUSTER and RESTORE VIRTUAL CLUSTER, can now be used to start PCR replication without an initial scan. Epic: none. Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: David Taylor <[email protected]>
The DistSender makes an educated guess about who the current leaseholder is, sends a request to it, and then simply waits for the response -- typically a successful response or a
NotLeaseHolderError
instructing it who the leaseholder is. However, in some cases the request never returns, and the DistSender waits indefinitely. Typical cases are:With expiration-based leases, the remote replica will eventually lose its lease in these cases, but the DistSender cache will keep pointing to the stalled replica. Requests also remain stuck, which can cause the entire workload to stall if it has a bounded number of workers/connections that all get stuck.
The DistSender should detect when the remote replica loses its lease, and try to discover a new leaseholder elsewhere, redirecting the requests there when possible. Write requests can't easily be retried though, so we should only cancel them once we confirm that there is in fact a new leaseholder elsewhere.
This would improve resolution of both partial partitions (#103769) and disk/replica stalls (#104262).
Jira issue: CRDB-28907
Epic CRDB-25200
The text was updated successfully, but these errors were encountered: