Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: quota pool observability #79756

Open
12 tasks
tbg opened this issue Apr 11, 2022 · 3 comments
Open
12 tasks

kvserver: quota pool observability #79756

tbg opened this issue Apr 11, 2022 · 3 comments
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@tbg
Copy link
Member

tbg commented Apr 11, 2022

Is your feature request related to a problem? Please describe.

We know of several instances where the behavior of the quota pool was poorly understood.
The quota pool exports no metrics, nor is there a way to pull information from it proactively.

Common questions are:

  • is the quota pool throttling, and if so, how severely?
  • is a given quota pool ignoring followers, and which ones?

Describe the solution you'd like

Metrics:

  • Slow acquisitions, i.e. a Gauge that tracks all ongoing acquisitions that trigger
    logSlowRaftProposalQuotaAcquisition,
  • Quota pool wait times, i.e. a latency histogram for which each Acquire() call records its latency. This provides a measure of how pervasively the quota pool throttles across the replicas.
  • Quota pool allocations, i.e. a bytes histogram into which each allocation (by any replicaID) is recorded. The rate of the sum effectively provides a throughput for the range, and also has the breakdown of raft proposal sizes.
  • the number of times this code ignores a follower for the purposes of proposal quota enforcement, with one counter metric (or prometheus label) for each reason:
    • follower inactive (lastUpdateTimes check)
    • no healthy conn (ConnHealth check)
    • below base index

Observability via kvserverpb.RangeInfo:

  • ApproximateQuota() (already there)
  • Capacity
  • quota release queue length and base index
    // The base index is the index up to (including) which quota was already
    // released. That is, the first element in quotaReleaseQueue below is
    // released as the base index moves up by one, etc.
    proposalQuotaBaseIndex uint64
    // Once the leader observes a proposal come 'out of Raft', we add the size
    // of the associated command to a queue of quotas we have yet to release
    // back to the quota pool. At that point ownership of the quota is
    // transferred from r.mu.proposals to this queue.
    // We'll release the respective quota once all replicas have persisted the
    // corresponding entry into their logs (or once we give up waiting on some
    // replica because it looks like it's dead).
    quotaReleaseQueue []*quotapool.IntAlloc
  • the current result of this logic as applied to each follower, i.e. a slice [replicaID, ignoreReason] mirroring the requested metrics above.

Problem ranges:

  • highlight slow proposal quota (>15s) as this is definitely a problem. (Empty quota is not a problem - it is the expected steady state when a follower is slightly slower than the rest)

Logging:

  • over each (say) 10s interval, save (by distinct rangeID+replicaID) up to (say) ten raft statuses (in a store-wide map) for which the quota pool did not consider a follower for release, i.e. every ~10 seconds print a message (if there is anything to print, which in the common case there is not)

quota pool not enforced for:
r100/5: follower_inactive base=150 status=[insert raft status]
r100/7: base_index base=160 status=[...]
r200/2: no_healthy_conn [...]

Describe alternatives you've considered

Additional context

The above suggestion was written up quickly, and shouldn't be considered final or flawless.

Jira issue: CRDB-15931

@blathers-crl

This comment was marked as outdated.

@blathers-crl blathers-crl bot added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Apr 11, 2022
@jlinder jlinder added sync-me and removed sync-me labels May 20, 2022
@tbg
Copy link
Member Author

tbg commented Jun 30, 2022

Related: #83263

@maryliag maryliag added A-kv Anything in KV that doesn't belong in a more specific category. T-kv KV Team and removed T-sql-observability labels Apr 24, 2023
@tbg
Copy link
Member Author

tbg commented Jul 3, 2023

This issue is obsolete if we disable/remove the quota pool, x-ref #106063

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
None yet
Development

No branches or pull requests

3 participants