feat: add metrics to proxy #1017

SantiagoPittella · 2024-12-12T22:20:02Z

feat: add request latency and count metrics feat: add grafana dashboard feat: add metrics endpoint to proxy config chore: update changelog docs: update README chore: add prometheus.yml

bin/tx-prover/src/proxy/mod.rs

Mirko-von-Leipzig

LGTM ito implementation.

It also showcases what I dislike about actively doing metrics inline. The metrics code overshadows the actual code logic which is a bummer (and not this PRs fault).

I'd be interested in trying out an approach where we move all the metrics code into a tracing layer where we inspect the tracing events and update metrics based on those. This would centralize our metrics code into one spot and we only have to instrument the code with tracing. The downside is having to couple something in the traces with the metrics.

There is also the open-telemetry metrics option instead of prometheus.

bin/tx-prover/src/commands/mod.rs

bin/tx-prover/src/proxy/metrics.rs

bobbinth

Looks good! Thank you! Not a full review yet, but I left some comments inline.

bin/tx-prover/README.md

bin/tx-prover/proxy_grafana_dashboard.json

bin/tx-prover/prometheus.yml

bin/tx-prover/src/commands/proxy.rs

bin/tx-prover/src/commands/mod.rs

bin/tx-prover/src/proxy/metrics.rs

…o readme

…fana

bobbinth

Thank you! Looks good. I left some more comments inline. A couple of general comments:

I think the prover service code is getting a bit convoluted and difficult to follow. We should think about how to better organize it (e.g., move LoadBalancerState into a separate module and limit its functionality to state management). We can do this in a follow-up PR though.
How does the current dashboard look like? I looked at the example here, but I think it is missing quite a few things.

Re second point, ideally, we'd have something like this:

A graph showing total number of requests, rejected requests (e.g., because the queue is full or rate limit was triggered) over time. These could also be in req/min units.
A graph showing total number of workers and the number of busy workers over time.
A graph showing request latency and queue latency over time.
A graph showing queue size over time.

The above graphs would require combining multiple metrics in a single graph - I'm not sure how easy this is (or if it is doable at all).

bin/tx-prover/src/proxy/metrics.rs

bin/tx-prover/src/proxy/mod.rs

bobbinth · 2024-12-28T22:53:56Z

bin/tx-prover/src/proxy/mod.rs

@@ -100,7 +114,9 @@ impl LoadBalancerState {
        let mut available_workers = self.workers.write().await;
        if let Some(w) = available_workers.iter_mut().find(|w| *w == &worker) {
            w.set_availability(true);
+            WORKER_UTILIZATION.dec();


Is it possible that a busy worker gets removed form the list before we get here? If so, we'd never decrement the number of utilized workers and so this metric may become corrupted.

An alternative could be to have a single function which computes the number of busy workers and updates the metric accordingly.

Yes, it might get outdated in that case. I'm thinking in another way to keep this metric correct without having to set it in a lot of places.

I've thinking about this and realized that every time we call this method means that pop_available_worker was called before, therefore the WORKER_UTILIZATION was increased. Even if the worker that was processing the request was deleted due to being unhealthy or manually through the CLI (miden-tx-prover remove-worker), we should always descend on this method (add_available_worker) the WORKER_UTILIZATION.

I'm moving it below the braces.

bin/tx-prover/src/proxy/mod.rs

bin/tx-prover/src/proxy/metrics.rs

bin/tx-prover/README.md

bin/tx-prover/prometheus.yml

tomyrd

Looks good overall! Left a couple of small comments. Couldn't test it locally yet, will approve after that.

tomyrd · 2024-12-30T14:12:04Z

bin/tx-prover/README.md

@@ -61,6 +61,8 @@ max_retries_per_request = 1
 max_req_per_sec = 5
 # Interval to check the health of the workers
 health_check_interval_secs = 1
+# Port of the metrics server
+prometheus_port = 6192


Not from this PR but we are missing the available_workers_polling_time_ms in this example config

I added it in this PR, it was simple enough.

bin/tx-prover/src/proxy/metrics.rs

SantiagoPittella · 2025-01-02T16:29:37Z

@bobbinth regarding your questions:

I would maybe try to group charts according to the type of data they show. For example, the top row could be about requests, the middle row could be about workers, the bottom row could be about the queue. Or maybe it would be better to arrange them in columns.

I found a way to group graphs and created the three categories. In the picture below the "Workers" section is collapsed.

I'm wondering if we should introduce a metric called "accepted requests". This would be requests that made it into the queue. This way, the request chart could show total requests, accepted requests, and failed requests (i.e., the requests that were accepted but then failed to be processed).

Not quite sure about what this metrics represents. Is the same that request_count - rate_limited_requests? If so, we can create the graph using a query on the existing metrics.

It may make sense to combine Queue drop rate and rate-limited requests charts into one. This chart could show 2 series: (1) number of rate-limited requests, and (2) number of dropped requests.

I don't see how those metrics are related, I thought about those two like two different graphs in different sections, but looks like I'm missing something. I can change the graph though I you think that will add some value. Let me know.

Not sure how useful Requests per worker chart will be if we have a dozen or more workers. Is there a way to visualize this as a table (rather than a chart)?

Yes, I agree that is not the best way to display it. I will try some alternatives.

bobbinth · 2025-01-02T18:18:01Z

Not quite sure about what this metrics represents. Is the same that request_count - rate_limited_requests? If so, we can create the graph using a query on the existing metrics.

Accepted requests would be request_count - rate_limited_requests - queue_drop_count - basically, all requests that made it into the queue.

If possible, I'd put this metric into the same chart as "Requests". So, we'd have 3 series there: (1) total requests (blue), (2) accepted requests (green), and (3) failed requests (red).

I don't see how those metrics are related, I thought about those two like two different graphs in different sections, but looks like I'm missing something. I can change the graph though I you think that will add some value. Let me know.

See comment above - but basically: there are 2 reasons why a request can be rejected: (1) the user exceeded their rate limit, and (2) the queue is full.

In addition to the changes described above, I'd probably replace the current `"Rate-Limited "Requests" graph with the "Rejected requests" graph which shows both rate-limited requests and requests dropped due to queue being full (as 2 separate series).

SantiagoPittella · 2025-01-02T18:52:21Z

Replacing the graphs with what you just mentioned we are left with something like (note that I also changed the requests per worker to a heat map)

Requests:

Workers and queue:

Sent them separately because it no longer fits in my screen with a decent font size.

bobbinth · 2025-01-02T19:09:26Z

@SantiagoPittella - looks good! A few suggestions:

I would probably move "Rejected Requests" under "Total Requests handled". This way "Requests" and "Success Rate" charts would be alined (and ideally they should be on the same time scale).
I think we can probably git rid of the "Queue drop rate" chart as "Rejected Requests" chart has the same info.
The heatmap looks a bit weird - but if we can't do a simple table, I'm fine with keeping it this way.
In the Legend of "Request" chart, could "Accepted requests" come before "Failed requests"?

igamigo · 2025-01-02T19:12:26Z

One thought for querying metrics is that maybe we want to group all responses under a single metric with a "status code" label. Then queries become more versatile (status_code >= 400, etc.). But maybe this is overkill for now because the amount of different responses is narrow enough that it can be handled separately (and maybe this was discussed already).

SantiagoPittella · 2025-01-02T23:03:29Z

I would probably move "Rejected Requests" under "Total Requests handled". This way "Requests" and "Success Rate" charts would be alined (and ideally they should be on the same time scale).

Ok!

I think we can probably git rid of the "Queue drop rate" chart as "Rejected Requests" chart has the same info.

Sure.

The heatmap looks a bit weird - but if we can't do a simple table, I'm fine with keeping it this way.

This is the graph using a table, it will show every worker as an option in the dropdown menu. It is a good way to display the data when using many workers, though is not that good to visualize how the load balancer is distributing jobs to each worker.

In the Legend of "Request" chart, could "Accepted requests" come before "Failed requests"?

Yes, I'm changing it.

cc @bobbinth

SantiagoPittella · 2025-01-02T23:16:07Z

One thought for querying metrics is that maybe we want to group all responses under a single metric with a "status code" label. Then queries become more versatile (status_code >= 400, etc.). But maybe this is overkill for now because the amount of different responses is narrow enough that it can be handled separately (and maybe this was discussed already).

Sounds nice, some of the current metrics (RATE_LIMITED_REQUESTS, REQUEST_FAILURE_COUNT) might be replaced with new approach. I agree, though, that it might be a bit complex that what need now for the same motives that you mentioned, we don't have that many cases.

cc @igamigo

SantiagoPittella · 2025-01-02T23:25:33Z

Addressing @bobbinth , this is the current state of the dashboard:

bobbinth

Looks good! Thank you! I left just one small comment inline.

Regarding the dashboard layout, I would make the following changes:

Let's swap "Requests" and "Total requests handled" charts so that "Requests" is to the left of "Total requests handled".
Let's change "Total requests handled" chart to show only rejected requests (we have the total requests metric in the "Requests" chart already). This would involve:
- Remove "Total requests" series from the chart.
- Rename the chart to "Rejected requests".
- Rename "Rate limited" series to "Rate limited requests".
- Rename "Dropped by full queue" to "Queue overflow requests"
- Could we also change the vertical axes to be requests/minute rather than just requests? (this would make it similar to the "Requests" chart)
In the Workers section, could we move the "Unhealthy workers" into its own chart? I would put this chart between the "Workers" chart " and "Requests per worker" table.

bin/tx-prover/tx_prover_service_grafana_dashboard.json

bobbinth · 2025-01-03T16:00:36Z

@SantiagoPittella - could you upload the latest dashboard views?

SantiagoPittella · 2025-01-03T16:14:00Z

@SantiagoPittella - could you upload the latest dashboard views?

Sure @bobbinth

bobbinth · 2025-01-03T16:47:11Z

All looks good! I'm not sure if @igamigo or @tomyrd want to take another look - but I'm good with merging this.

One separate question about the actual data in the graphs (if this is just sample data, please ignore): it seems like we get a pretty high request latency (i.e., between 8 and 10 seconds) and most of this is because of queue latency. I'm assuming this means that requests are waiting in the queue because there are too many requests compared to workers. But then looking at workers, we have 3 workers but never utilize more than 2? Or is this because of sampling granularity?

SantiagoPittella · 2025-01-03T17:23:40Z

One separate question about the actual data in the graphs (if this is just sample data, please ignore): it seems like we get a pretty high request latency (i.e., between 8 and 10 seconds) and most of this is because of queue latency. I'm assuming this means that requests are waiting in the queue because there are too many requests compared to workers. But then looking at workers, we have 3 workers but never utilize more than 2? Or is this because of sampling granularity?

There is a problem on how we assign requests to workers, we need to tweak a bit the mechanism. In this case the 3 workers are picking up jobs, but one of them is handling a really small portion. If we look at the table:

We can see that it is handling only a few requests.

We need to distribute more evenly the requests and that should solve it.

igamigo

LGTM! The only comment I would carry onto future issues is the one related to labeling metrics (seems unnecessary right now, as we discussed before), but other than that verything looks good.

SantiagoPittella · 2025-01-03T18:15:19Z

LGTM! The only comment I would carry onto future issues is the one related to labeling metrics (seems unnecessary right now, as we discussed before), but other than that verything looks good.

@igamigo , I created this issue to track this: #1048

SantiagoPittella · 2025-01-03T18:17:24Z

We can see that it is handling only a few requests.

We need to distribute more evenly the requests and that should solve it.

I mentioned this task for it to be addressed in #1009 . Should be a rather small change.

SantiagoPittella mentioned this pull request Dec 12, 2024

feat(proxy, worker): add metrics for the proxy and worker #1004

Closed

SantiagoPittella force-pushed the santiagopittella-add-metrics-to-proxy branch 2 times, most recently from f20401d to 4a4e1a9 Compare December 19, 2024 19:26

SantiagoPittella marked this pull request as ready for review December 19, 2024 20:20

feat: add metrics to proxy

06be9fb

feat: add request latency and count metrics feat: add grafana dashboard feat: add metrics endpoint to proxy config chore: update changelog docs: update README chore: add prometheus.yml

SantiagoPittella force-pushed the santiagopittella-add-metrics-to-proxy branch from 7d655d5 to 06be9fb Compare December 19, 2024 20:22

igamigo reviewed Dec 19, 2024

View reviewed changes

bin/tx-prover/src/proxy/mod.rs Show resolved Hide resolved

review: update RequestContext documentation

ff098ac

Mirko-von-Leipzig reviewed Dec 20, 2024

View reviewed changes

bin/tx-prover/src/commands/mod.rs Outdated Show resolved Hide resolved

bin/tx-prover/src/proxy/metrics.rs Outdated Show resolved Hide resolved

SantiagoPittella added 2 commits December 20, 2024 11:30

review: replace lazy_static with LazyLock

5f6096e

review: use always localhost for metrics host

c8506a4

SantiagoPittella mentioned this pull request Dec 23, 2024

feat(proxy,prover): derivate metrics from traces #1033

Closed

SantiagoPittella requested a review from bobbinth December 26, 2024 13:19

Merge branch 'next' into santiagopittella-add-metrics-to-proxy

9c07da7

bobbinth reviewed Dec 27, 2024

View reviewed changes

SantiagoPittella added 5 commits December 27, 2024 17:50

review: improve separators in metrics definition

00a76b2

review: re-add metrics host configuration and default to localhost

e486ffd

review: update prometheus.yml

ae669d2

review: add information about grafana dashboard creation and export t…

9c2836e

…o readme

review: add documentation on local installation of prometheus and gra…

01ddba2

…fana

SantiagoPittella requested a review from bobbinth December 27, 2024 21:22

bobbinth reviewed Dec 28, 2024

View reviewed changes

tomyrd reviewed Dec 30, 2024

View reviewed changes

SantiagoPittella added 7 commits December 30, 2024 17:23

review: add units to histograms

dd8a928

review: rename the tag in prometheus.yml to tx_prover

1729d6c

review: fix WORKER_COUNT metric update

308612c

review: fix WORKER_UTILIZATION updates

5e12d97

review: update RequestQueue docs

89c6933

review: move WORKER_UNHEALTHY update logic

ac33f91

review: fix WORKER_UTILIZATION desc

021923c

review: do not count update worker requests for metrics

82dbe5e

review: move busy workers metric update

7e4a3c5

review: update grafana dashboard

bc577dc

SantiagoPittella added 4 commits January 2, 2025 20:27

review: update dashboard

ce0892d

review: move metric update to create response function

07c999e

chore: address lint errors

5295fd2

Merge branch 'next' into santiagopittella-add-metrics-to-proxy

3e35509

SantiagoPittella requested a review from bobbinth January 3, 2025 00:38

bobbinth approved these changes Jan 3, 2025

View reviewed changes

bin/tx-prover/tx_prover_service_grafana_dashboard.json Outdated Show resolved Hide resolved

SantiagoPittella added 2 commits January 3, 2025 09:32

review: rename grafana dashboard file

61b7074

review: update dashboard

d9c1d40

igamigo approved these changes Jan 3, 2025

View reviewed changes

SantiagoPittella mentioned this pull request Jan 3, 2025

Investigate about grouping metrics by status code #1048

Open

SantiagoPittella mentioned this pull request Jan 3, 2025

Add retries to worker health check #1009

Open

SantiagoPittella merged commit c98b1b7 into next Jan 3, 2025
9 checks passed

SantiagoPittella deleted the santiagopittella-add-metrics-to-proxy branch January 3, 2025 18:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add metrics to proxy #1017

feat: add metrics to proxy #1017

SantiagoPittella commented Dec 12, 2024 •

edited

Loading

Mirko-von-Leipzig left a comment

bobbinth left a comment

bobbinth left a comment

bobbinth Dec 28, 2024

SantiagoPittella Dec 30, 2024

SantiagoPittella Dec 30, 2024

tomyrd left a comment

tomyrd Dec 30, 2024

SantiagoPittella Dec 30, 2024

SantiagoPittella commented Jan 2, 2025

bobbinth commented Jan 2, 2025

SantiagoPittella commented Jan 2, 2025 •

edited

Loading

bobbinth commented Jan 2, 2025

igamigo commented Jan 2, 2025 •

edited

Loading

SantiagoPittella commented Jan 2, 2025 •

edited

Loading

SantiagoPittella commented Jan 2, 2025 •

edited

Loading

SantiagoPittella commented Jan 2, 2025

bobbinth left a comment

bobbinth commented Jan 3, 2025

SantiagoPittella commented Jan 3, 2025 •

edited

Loading

bobbinth commented Jan 3, 2025

SantiagoPittella commented Jan 3, 2025

igamigo left a comment •

edited

Loading

SantiagoPittella commented Jan 3, 2025

SantiagoPittella commented Jan 3, 2025

feat: add metrics to proxy #1017

feat: add metrics to proxy #1017

Conversation

SantiagoPittella commented Dec 12, 2024 • edited Loading

Mirko-von-Leipzig left a comment

Choose a reason for hiding this comment

bobbinth left a comment

Choose a reason for hiding this comment

bobbinth left a comment

Choose a reason for hiding this comment

bobbinth Dec 28, 2024

Choose a reason for hiding this comment

SantiagoPittella Dec 30, 2024

Choose a reason for hiding this comment

SantiagoPittella Dec 30, 2024

Choose a reason for hiding this comment

tomyrd left a comment

Choose a reason for hiding this comment

tomyrd Dec 30, 2024

Choose a reason for hiding this comment

SantiagoPittella Dec 30, 2024

Choose a reason for hiding this comment

SantiagoPittella commented Jan 2, 2025

bobbinth commented Jan 2, 2025

SantiagoPittella commented Jan 2, 2025 • edited Loading

bobbinth commented Jan 2, 2025

igamigo commented Jan 2, 2025 • edited Loading

SantiagoPittella commented Jan 2, 2025 • edited Loading

SantiagoPittella commented Jan 2, 2025 • edited Loading

SantiagoPittella commented Jan 2, 2025

bobbinth left a comment

Choose a reason for hiding this comment

bobbinth commented Jan 3, 2025

SantiagoPittella commented Jan 3, 2025 • edited Loading

bobbinth commented Jan 3, 2025

SantiagoPittella commented Jan 3, 2025

igamigo left a comment • edited Loading

Choose a reason for hiding this comment

SantiagoPittella commented Jan 3, 2025

SantiagoPittella commented Jan 3, 2025

SantiagoPittella commented Dec 12, 2024 •

edited

Loading

SantiagoPittella commented Jan 2, 2025 •

edited

Loading

igamigo commented Jan 2, 2025 •

edited

Loading

SantiagoPittella commented Jan 2, 2025 •

edited

Loading

SantiagoPittella commented Jan 2, 2025 •

edited

Loading

SantiagoPittella commented Jan 3, 2025 •

edited

Loading

igamigo left a comment •

edited

Loading