Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: improve observability of "slow" requests #74509

Open
tbg opened this issue Jan 6, 2022 · 0 comments
Open

kvserver: improve observability of "slow" requests #74509

tbg opened this issue Jan 6, 2022 · 0 comments
Labels
A-kv-observability C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-observability

Comments

@tbg
Copy link
Member

tbg commented Jan 6, 2022

When a request arrives at a Replica, with one (?) exception we generally expect it to complete within "at most a few replication round-trip latencies". The one exception is intent handling: waiting for a contending txn to terminate can take any amount of time, since our txns are interactive and can be held open indefinitely. Perhaps also noteworthy are range merges, which can block the range as well, but I think this shouldn't be treated differently from any other failure to make progress if it takes too long.

The sort of convention that has established itself in the KV area is to print "have been waiting" messages
At the time of writing, we have the following relevant examples (restricted to kvserver). Note that the last one corresponds to the exception mentioned above; it's not usually a KV problem when this occurs, but it's often enough a problem for end-users that we want to highlight it in the same ways.

pkg/kv/kvserver/replica_range_lease.go: log.Warningf(ctx, "have been waiting %s attempting to acquire lease",
pkg/kv/kvserver/replica_write.go: s.Printf(`have been waiting %.2fs for proposing command %s.
pkg/kv/kvserver/spanlatch/manager.go: log.Warningf(ctx, "have been waiting %s to acquire %s latch %s, held by %s latch %s",
pkg/kv/kvserver/txnwait/queue.go: log.Warningf(ctx, "pusher %s: have been waiting %.2fs for pushee %s",

These all cover smaller parts of the entire stack. The lease acquisition one is mostly useless, since the "for proposing command" sits right under it and in the common case a lease acquisition is slow because the proposal isn't going through in time. Slow latches similarly are mostly caused by slow replication.

It seems that there's a place for a unified model, where we give each incoming request a budget (say 15s, the current SlowReplicationThreshold) and do something whenever the request exceeds that for any reason. The question is what the "something" is. If all requests were traced, we could print the (tail of the) recording and this would likely be illuminating (the current "have been waiting" messages could be printed into the trace instead, for instant parity). We don't actually want to trace all requests, but we do want all requests to be traceable on demand (#73407), so maybe certain events (like the entire request taking >15s, or one of the individual messages above firing) flip the span to verbose. We could also have a DumpStore for these partial slow request traces, to shift away from looking at slow request traces inside of logs to getting something in a better format (i.e. jaeger).

There is a lot we could do here and I'm sure @andreimatei has some good ideas or perhaps already a plan.

The reason I started thinking about this issue is in my work on #33007: as of #71806, we will trip the breaker only if a proposal on the local replica hangs for >15s. This isn't a very complete condition. For example, a replica.mu deadlock would never have anyone trip the breaker. If we had the "global budget" and set it up suitably before needing access to & without accessing any locks, we could get something that has no false negatives. If we then also don't outright trip the breaker on a slow request but instead run a suitably goroutined one-off invocation of the probe & trip the breaker only when that fails, we'd be in business.

Related: #72092

Jira issue: CRDB-12126

Epic CRDB-32145

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-observability C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-observability
Projects
None yet
Development

No branches or pull requests

2 participants