kvserver: improve observability of "slow" requests #74509
Labels
A-kv-observability
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-observability
When a request arrives at a Replica, with one (?) exception we generally expect it to complete within "at most a few replication round-trip latencies". The one exception is intent handling: waiting for a contending txn to terminate can take any amount of time, since our txns are interactive and can be held open indefinitely. Perhaps also noteworthy are range merges, which can block the range as well, but I think this shouldn't be treated differently from any other failure to make progress if it takes too long.
The sort of convention that has established itself in the KV area is to print "have been waiting" messages
At the time of writing, we have the following relevant examples (restricted to
kvserver
). Note that the last one corresponds to the exception mentioned above; it's not usually a KV problem when this occurs, but it's often enough a problem for end-users that we want to highlight it in the same ways.These all cover smaller parts of the entire stack. The lease acquisition one is mostly useless, since the "for proposing command" sits right under it and in the common case a lease acquisition is slow because the proposal isn't going through in time. Slow latches similarly are mostly caused by slow replication.
It seems that there's a place for a unified model, where we give each incoming request a budget (say
15s
, the currentSlowReplicationThreshold
) and do something whenever the request exceeds that for any reason. The question is what the "something" is. If all requests were traced, we could print the (tail of the) recording and this would likely be illuminating (the current "have been waiting" messages could be printed into the trace instead, for instant parity). We don't actually want to trace all requests, but we do want all requests to be traceable on demand (#73407), so maybe certain events (like the entire request taking >15s, or one of the individual messages above firing) flip the span to verbose. We could also have a DumpStore for these partial slow request traces, to shift away from looking at slow request traces inside of logs to getting something in a better format (i.e. jaeger).There is a lot we could do here and I'm sure @andreimatei has some good ideas or perhaps already a plan.
The reason I started thinking about this issue is in my work on #33007: as of #71806, we will trip the breaker only if a proposal on the local replica hangs for >15s. This isn't a very complete condition. For example, a
replica.mu
deadlock would never have anyone trip the breaker. If we had the "global budget" and set it up suitably before needing access to & without accessing any locks, we could get something that has no false negatives. If we then also don't outright trip the breaker on a slow request but instead run a suitably goroutined one-off invocation of the probe & trip the breaker only when that fails, we'd be in business.Related: #72092
Jira issue: CRDB-12126
Epic CRDB-32145
The text was updated successfully, but these errors were encountered: