Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

balance: Instrument metrics in pool balancer #2558

Merged
merged 1 commit into from
Dec 13, 2023
Merged

balance: Instrument metrics in pool balancer #2558

merged 1 commit into from
Dec 13, 2023

Conversation

olix0r
Copy link
Member

@olix0r olix0r commented Dec 13, 2023

This change updates the PoolQueue and P2CPool with metrics that track endpoint update and queue runtime metrics, including:

  • The count of queue gate state changes (as driven by failfast).
  • The timestamp of the last gate state change (i.e. so it's possible to determine how long a balancer has been in failfast).
  • The number of requests that enter the queue.
  • The distribution of in-queue latencies.
  • The current length of the queue.
  • The number of endpoint updates by update type.
  • The current number of endpoints.

This change updates the PoolQueue and P2CPool with metrics that track
endpoint update and queue runtime metrics, including:

* The count of queue gate state changes (as driven by failfast).
* The timestamp of the last gate state change (i.e. so it's possible to
  determine how long a balancer has been in failfast).
* The number of requests that enter the queue.
* The distribution of in-queue latencies.
* The current length of the queue.
* The number of endpoint updates by update type.
* The current number of endpoints.
@olix0r olix0r requested a review from a team as a code owner December 13, 2023 03:31
Copy link

codecov bot commented Dec 13, 2023

Codecov Report

Merging #2558 (d6932ca) into main (b4cdafb) will decrease coverage by 0.50%.
The diff coverage is 31.92%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2558      +/-   ##
==========================================
- Coverage   67.73%   67.24%   -0.50%     
==========================================
  Files         331      333       +2     
  Lines       14838    14985     +147     
==========================================
+ Hits        10050    10076      +26     
- Misses       4788     4909     +121     
Files Coverage Δ
linkerd/proxy/balance/src/lib.rs 92.00% <ø> (ø)
linkerd/proxy/pool/src/service.rs 80.64% <100.00%> (+2.86%) ⬆️
linkerd/proxy/pool/src/worker.rs 86.66% <83.33%> (+0.57%) ⬆️
linkerd/proxy/pool/src/lib.rs 21.73% <21.73%> (ø)
linkerd/proxy/balance/src/pool/p2c.rs 79.45% <54.54%> (-11.62%) ⬇️
linkerd/proxy/balance/src/pool.rs 0.00% <0.00%> (ø)
linkerd/proxy/pool/src/failfast.rs 46.37% <19.56%> (-53.63%) ⬇️

... and 4 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b4cdafb...d6932ca. Read the comment docs.

@olix0r olix0r merged commit a349575 into main Dec 13, 2023
22 of 23 checks passed
@olix0r olix0r deleted the ver/prom-balq branch December 13, 2023 03:53
olix0r added a commit to linkerd/linkerd2 that referenced this pull request Dec 13, 2023
This change culminates recent work to restructure the balancer to use a
PoolQueue so that balancer changes may occur independently of request
processing. This replaces independent discovery buffering so that the
balancer task is responsible for polling discovery streams without
independent buffering. Requests are buffered and processed as soon as
the pool has available backends. Fail-fast circuit breaking is enforced
on the balancer's queue so that requests can't get stuck in a queue
indefinitely.

In general, the new balancer is instrumented directly with metrics, and
the relevant metric name prefix and labelset is provided by the stack.
In addition to detailed queue metrics including request (in-queue)
latency histograms, but also failfast states, discovery updates counts,
and balancer endpoint pool sizes.

---

* outbound: Move queues into the concrete stack (linkerd/linkerd2-proxy#2539)
* metrics: Remove unused features (linkerd/linkerd2-proxy#2542)
* Add the PoolQueue middleware (linkerd/linkerd2-proxy#2540)
* ci: Fixup codecov config (linkerd/linkerd2-proxy#2545)
* ci: Cancel prior runs (linkerd/linkerd2-proxy#2546)
* ci: Skip ARM builds during non-release CI (linkerd/linkerd2-proxy#2547)
* deps: Update tokio, tonic, and prost (linkerd/linkerd2-proxy#2544)
* build(deps): bump tj-actions/changed-files from 40.2.0 to 40.2.1 (linkerd/linkerd2-proxy#2549)
* metrics: Use prometheus-client for proxy_build_info (linkerd/linkerd2-proxy#2551)
* balance: Add a p2c Pool implementation (linkerd/linkerd2-proxy#2541)
* metrics: Export process metrics using prometheus-client (linkerd/linkerd2-proxy#2552)
* linkerd_identity: split `linkerd_identity::Id` into DNS and URI variants (linkerd/linkerd2-proxy#2538)
* outbound: Move HTTP balancer into its own module (linkerd/linkerd2-proxy#2554)
* app: Setup prom registry for use in balancers (linkerd/linkerd2-proxy#2555)
* vscode: Move workspace settings to devcontainer (linkerd/linkerd2-proxy#2557)
* build(deps): bump tj-actions/changed-files from 40.2.1 to 40.2.2 (linkerd/linkerd2-proxy#2556)
* balance: Instrument metrics in pool balancer (linkerd/linkerd2-proxy#2558)
* Enable PoolQueue balancer (linkerd/linkerd2-proxy#2559)

Signed-off-by: Oliver Gould <[email protected]>
olix0r added a commit to linkerd/linkerd2 that referenced this pull request Dec 14, 2023
This change culminates recent work to restructure the balancer to use a
PoolQueue so that balancer changes may occur independently of request
processing. This replaces independent discovery buffering so that the
balancer task is responsible for polling discovery streams without
independent buffering. Requests are buffered and processed as soon as
the pool has available backends. Fail-fast circuit breaking is enforced
on the balancer's queue so that requests can't get stuck in a queue
indefinitely.

In general, the new balancer is instrumented directly with metrics, and
the relevant metric name prefix and labelset is provided by the stack.
In addition to detailed queue metrics including request (in-queue)
latency histograms, but also failfast states, discovery updates counts,
and balancer endpoint pool sizes.

---

* outbound: Move queues into the concrete stack (linkerd/linkerd2-proxy#2539)
* metrics: Remove unused features (linkerd/linkerd2-proxy#2542)
* Add the PoolQueue middleware (linkerd/linkerd2-proxy#2540)
* ci: Fixup codecov config (linkerd/linkerd2-proxy#2545)
* ci: Cancel prior runs (linkerd/linkerd2-proxy#2546)
* ci: Skip ARM builds during non-release CI (linkerd/linkerd2-proxy#2547)
* deps: Update tokio, tonic, and prost (linkerd/linkerd2-proxy#2544)
* build(deps): bump tj-actions/changed-files from 40.2.0 to 40.2.1 (linkerd/linkerd2-proxy#2549)
* metrics: Use prometheus-client for proxy_build_info (linkerd/linkerd2-proxy#2551)
* balance: Add a p2c Pool implementation (linkerd/linkerd2-proxy#2541)
* metrics: Export process metrics using prometheus-client (linkerd/linkerd2-proxy#2552)
* linkerd_identity: split `linkerd_identity::Id` into DNS and URI variants (linkerd/linkerd2-proxy#2538)
* outbound: Move HTTP balancer into its own module (linkerd/linkerd2-proxy#2554)
* app: Setup prom registry for use in balancers (linkerd/linkerd2-proxy#2555)
* vscode: Move workspace settings to devcontainer (linkerd/linkerd2-proxy#2557)
* build(deps): bump tj-actions/changed-files from 40.2.1 to 40.2.2 (linkerd/linkerd2-proxy#2556)
* balance: Instrument metrics in pool balancer (linkerd/linkerd2-proxy#2558)
* Enable PoolQueue balancer (linkerd/linkerd2-proxy#2559)

Signed-off-by: Oliver Gould <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant