kvserver: improve observability of "slow" requests #74509

tbg · 2022-01-06T11:05:02Z

When a request arrives at a Replica, with one (?) exception we generally expect it to complete within "at most a few replication round-trip latencies". The one exception is intent handling: waiting for a contending txn to terminate can take any amount of time, since our txns are interactive and can be held open indefinitely. Perhaps also noteworthy are range merges, which can block the range as well, but I think this shouldn't be treated differently from any other failure to make progress if it takes too long.

The sort of convention that has established itself in the KV area is to print "have been waiting" messages
At the time of writing, we have the following relevant examples (restricted to kvserver). Note that the last one corresponds to the exception mentioned above; it's not usually a KV problem when this occurs, but it's often enough a problem for end-users that we want to highlight it in the same ways.

pkg/kv/kvserver/replica_range_lease.go: log.Warningf(ctx, "have been waiting %s attempting to acquire lease",
pkg/kv/kvserver/replica_write.go: s.Printf(`have been waiting %.2fs for proposing command %s.
pkg/kv/kvserver/spanlatch/manager.go: log.Warningf(ctx, "have been waiting %s to acquire %s latch %s, held by %s latch %s",
pkg/kv/kvserver/txnwait/queue.go: log.Warningf(ctx, "pusher %s: have been waiting %.2fs for pushee %s",

These all cover smaller parts of the entire stack. The lease acquisition one is mostly useless, since the "for proposing command" sits right under it and in the common case a lease acquisition is slow because the proposal isn't going through in time. Slow latches similarly are mostly caused by slow replication.

It seems that there's a place for a unified model, where we give each incoming request a budget (say 15s, the current SlowReplicationThreshold) and do something whenever the request exceeds that for any reason. The question is what the "something" is. If all requests were traced, we could print the (tail of the) recording and this would likely be illuminating (the current "have been waiting" messages could be printed into the trace instead, for instant parity). We don't actually want to trace all requests, but we do want all requests to be traceable on demand (#73407), so maybe certain events (like the entire request taking >15s, or one of the individual messages above firing) flip the span to verbose. We could also have a DumpStore for these partial slow request traces, to shift away from looking at slow request traces inside of logs to getting something in a better format (i.e. jaeger).

There is a lot we could do here and I'm sure @andreimatei has some good ideas or perhaps already a plan.

The reason I started thinking about this issue is in my work on #33007: as of #71806, we will trip the breaker only if a proposal on the local replica hangs for >15s. This isn't a very complete condition. For example, a replica.mu deadlock would never have anyone trip the breaker. If we had the "global budget" and set it up suitably before needing access to & without accessing any locks, we could get something that has no false negatives. If we then also don't outright trip the breaker on a slow request but instead run a suitably goroutined one-off invocation of the probe & trip the breaker only when that fails, we'd be in business.

Related: #72092

Jira issue: CRDB-12126

Epic CRDB-32145

The text was updated successfully, but these errors were encountered:

tbg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-observability labels Jan 6, 2022

blathers-crl bot added the T-kv KV Team label Jan 6, 2022

tbg mentioned this issue Jan 11, 2022

kvserver: consider additional triggers for replica circuit breakers #74712

Closed

tbg self-assigned this Jan 14, 2022

tbg mentioned this issue Jan 14, 2022

kvserver: circuit-break requests to unavailable ranges #33007

Closed

mwang1026 added the T-kv-observability label Jan 24, 2022

exalate-issue-sync bot removed the T-kv KV Team label Jan 25, 2022

tbg removed their assignment May 31, 2022

exalate-issue-sync bot added T-cluster-observability and removed T-kv-observability labels Mar 16, 2023

blathers-crl bot added the T-sql-observability label Mar 16, 2023

exalate-issue-sync bot removed the T-cluster-observability label Mar 5, 2024

exalate-issue-sync bot added T-observability and removed T-sql-observability labels Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: improve observability of "slow" requests #74509

kvserver: improve observability of "slow" requests #74509

tbg commented Jan 6, 2022 •

edited by exalate-issue-sync bot

Loading

kvserver: improve observability of "slow" requests #74509

kvserver: improve observability of "slow" requests #74509

Comments

tbg commented Jan 6, 2022 • edited by exalate-issue-sync bot Loading

tbg commented Jan 6, 2022 •

edited by exalate-issue-sync bot

Loading