Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add metrics to proxy #1017

Merged
merged 39 commits into from
Jan 3, 2025

Conversation

SantiagoPittella
Copy link
Collaborator

@SantiagoPittella SantiagoPittella commented Dec 12, 2024

closes #1004

@SantiagoPittella SantiagoPittella force-pushed the santiagopittella-add-metrics-to-proxy branch 2 times, most recently from f20401d to 4a4e1a9 Compare December 19, 2024 19:26
@SantiagoPittella SantiagoPittella marked this pull request as ready for review December 19, 2024 20:20
feat: add request latency and count metrics

feat: add grafana dashboard

feat: add metrics endpoint to proxy config

chore: update changelog

docs: update README

chore: add prometheus.yml
@SantiagoPittella SantiagoPittella force-pushed the santiagopittella-add-metrics-to-proxy branch from 7d655d5 to 06be9fb Compare December 19, 2024 20:22
Copy link
Contributor

@Mirko-von-Leipzig Mirko-von-Leipzig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ito implementation.

It also showcases what I dislike about actively doing metrics inline. The metrics code overshadows the actual code logic which is a bummer (and not this PRs fault).

I'd be interested in trying out an approach where we move all the metrics code into a tracing layer where we inspect the tracing events and update metrics based on those. This would centralize our metrics code into one spot and we only have to instrument the code with tracing. The downside is having to couple something in the traces with the metrics.

There is also the open-telemetry metrics option instead of prometheus.

bin/tx-prover/src/commands/mod.rs Outdated Show resolved Hide resolved
bin/tx-prover/src/proxy/metrics.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you! Not a full review yet, but I left some comments inline.

bin/tx-prover/README.md Outdated Show resolved Hide resolved
bin/tx-prover/README.md Outdated Show resolved Hide resolved
bin/tx-prover/proxy_grafana_dashboard.json Outdated Show resolved Hide resolved
bin/tx-prover/prometheus.yml Outdated Show resolved Hide resolved
bin/tx-prover/prometheus.yml Outdated Show resolved Hide resolved
bin/tx-prover/src/commands/proxy.rs Outdated Show resolved Hide resolved
bin/tx-prover/src/commands/mod.rs Show resolved Hide resolved
bin/tx-prover/src/proxy/metrics.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Looks good. I left some more comments inline. A couple of general comments:

  • I think the prover service code is getting a bit convoluted and difficult to follow. We should think about how to better organize it (e.g., move LoadBalancerState into a separate module and limit its functionality to state management). We can do this in a follow-up PR though.
  • How does the current dashboard look like? I looked at the example here, but I think it is missing quite a few things.

Re second point, ideally, we'd have something like this:

  1. A graph showing total number of requests, rejected requests (e.g., because the queue is full or rate limit was triggered) over time. These could also be in req/min units.
  2. A graph showing total number of workers and the number of busy workers over time.
  3. A graph showing request latency and queue latency over time.
  4. A graph showing queue size over time.

The above graphs would require combining multiple metrics in a single graph - I'm not sure how easy this is (or if it is doable at all).

bin/tx-prover/src/proxy/metrics.rs Outdated Show resolved Hide resolved
bin/tx-prover/src/proxy/metrics.rs Outdated Show resolved Hide resolved
bin/tx-prover/src/proxy/mod.rs Outdated Show resolved Hide resolved
bin/tx-prover/src/proxy/mod.rs Outdated Show resolved Hide resolved
@@ -100,7 +114,9 @@ impl LoadBalancerState {
let mut available_workers = self.workers.write().await;
if let Some(w) = available_workers.iter_mut().find(|w| *w == &worker) {
w.set_availability(true);
WORKER_UTILIZATION.dec();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that a busy worker gets removed form the list before we get here? If so, we'd never decrement the number of utilized workers and so this metric may become corrupted.

An alternative could be to have a single function which computes the number of busy workers and updates the metric accordingly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it might get outdated in that case. I'm thinking in another way to keep this metric correct without having to set it in a lot of places.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've thinking about this and realized that every time we call this method means that pop_available_worker was called before, therefore the WORKER_UTILIZATION was increased. Even if the worker that was processing the request was deleted due to being unhealthy or manually through the CLI (miden-tx-prover remove-worker), we should always descend on this method (add_available_worker) the WORKER_UTILIZATION.

I'm moving it below the braces.

bin/tx-prover/src/proxy/mod.rs Outdated Show resolved Hide resolved
bin/tx-prover/src/proxy/metrics.rs Outdated Show resolved Hide resolved
bin/tx-prover/README.md Outdated Show resolved Hide resolved
bin/tx-prover/README.md Outdated Show resolved Hide resolved
bin/tx-prover/prometheus.yml Outdated Show resolved Hide resolved
Copy link
Collaborator

@tomyrd tomyrd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall! Left a couple of small comments. Couldn't test it locally yet, will approve after that.

@@ -61,6 +61,8 @@ max_retries_per_request = 1
max_req_per_sec = 5
# Interval to check the health of the workers
health_check_interval_secs = 1
# Port of the metrics server
prometheus_port = 6192
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not from this PR but we are missing the available_workers_polling_time_ms in this example config

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it in this PR, it was simple enough.

bin/tx-prover/src/proxy/metrics.rs Outdated Show resolved Hide resolved
@SantiagoPittella
Copy link
Collaborator Author

@bobbinth regarding your questions:

  1. I would maybe try to group charts according to the type of data they show. For example, the top row could be about requests, the middle row could be about workers, the bottom row could be about the queue. Or maybe it would be better to arrange them in columns.

I found a way to group graphs and created the three categories. In the picture below the "Workers" section is collapsed.

image

  1. I'm wondering if we should introduce a metric called "accepted requests". This would be requests that made it into the queue. This way, the request chart could show total requests, accepted requests, and failed requests (i.e., the requests that were accepted but then failed to be processed).

Not quite sure about what this metrics represents. Is the same that request_count - rate_limited_requests? If so, we can create the graph using a query on the existing metrics.

  1. It may make sense to combine Queue drop rate and rate-limited requests charts into one. This chart could show 2 series: (1) number of rate-limited requests, and (2) number of dropped requests.

I don't see how those metrics are related, I thought about those two like two different graphs in different sections, but looks like I'm missing something. I can change the graph though I you think that will add some value. Let me know.

  1. Not sure how useful Requests per worker chart will be if we have a dozen or more workers. Is there a way to visualize this as a table (rather than a chart)?

Yes, I agree that is not the best way to display it. I will try some alternatives.

@bobbinth
Copy link
Contributor

bobbinth commented Jan 2, 2025

Not quite sure about what this metrics represents. Is the same that request_count - rate_limited_requests? If so, we can create the graph using a query on the existing metrics.

Accepted requests would be request_count - rate_limited_requests - queue_drop_count - basically, all requests that made it into the queue.

If possible, I'd put this metric into the same chart as "Requests". So, we'd have 3 series there: (1) total requests (blue), (2) accepted requests (green), and (3) failed requests (red).

I don't see how those metrics are related, I thought about those two like two different graphs in different sections, but looks like I'm missing something. I can change the graph though I you think that will add some value. Let me know.

See comment above - but basically: there are 2 reasons why a request can be rejected: (1) the user exceeded their rate limit, and (2) the queue is full.

In addition to the changes described above, I'd probably replace the current `"Rate-Limited "Requests" graph with the "Rejected requests" graph which shows both rate-limited requests and requests dropped due to queue being full (as 2 separate series).

@SantiagoPittella
Copy link
Collaborator Author

SantiagoPittella commented Jan 2, 2025

Replacing the graphs with what you just mentioned we are left with something like (note that I also changed the requests per worker to a heat map)

Requests:
image

Workers and queue:
image

Sent them separately because it no longer fits in my screen with a decent font size.

@bobbinth
Copy link
Contributor

bobbinth commented Jan 2, 2025

@SantiagoPittella - looks good! A few suggestions:

  1. I would probably move "Rejected Requests" under "Total Requests handled". This way "Requests" and "Success Rate" charts would be alined (and ideally they should be on the same time scale).
  2. I think we can probably git rid of the "Queue drop rate" chart as "Rejected Requests" chart has the same info.
  3. The heatmap looks a bit weird - but if we can't do a simple table, I'm fine with keeping it this way.
  4. In the Legend of "Request" chart, could "Accepted requests" come before "Failed requests"?

@igamigo
Copy link
Collaborator

igamigo commented Jan 2, 2025

One thought for querying metrics is that maybe we want to group all responses under a single metric with a "status code" label. Then queries become more versatile (status_code >= 400, etc.). But maybe this is overkill for now because the amount of different responses is narrow enough that it can be handled separately (and maybe this was discussed already).

@SantiagoPittella
Copy link
Collaborator Author

SantiagoPittella commented Jan 2, 2025

  1. I would probably move "Rejected Requests" under "Total Requests handled". This way "Requests" and "Success Rate" charts would be alined (and ideally they should be on the same time scale).

Ok!

  1. I think we can probably git rid of the "Queue drop rate" chart as "Rejected Requests" chart has the same info.

Sure.

  1. The heatmap looks a bit weird - but if we can't do a simple table, I'm fine with keeping it this way.

This is the graph using a table, it will show every worker as an option in the dropdown menu. It is a good way to display the data when using many workers, though is not that good to visualize how the load balancer is distributing jobs to each worker.
image

  1. In the Legend of "Request" chart, could "Accepted requests" come before "Failed requests"?

Yes, I'm changing it.

cc @bobbinth

@SantiagoPittella
Copy link
Collaborator Author

SantiagoPittella commented Jan 2, 2025

One thought for querying metrics is that maybe we want to group all responses under a single metric with a "status code" label. Then queries become more versatile (status_code >= 400, etc.). But maybe this is overkill for now because the amount of different responses is narrow enough that it can be handled separately (and maybe this was discussed already).

Sounds nice, some of the current metrics (RATE_LIMITED_REQUESTS, REQUEST_FAILURE_COUNT) might be replaced with new approach. I agree, though, that it might be a bit complex that what need now for the same motives that you mentioned, we don't have that many cases.

cc @igamigo

@SantiagoPittella
Copy link
Collaborator Author

Addressing @bobbinth , this is the current state of the dashboard:

Screenshot 2025-01-02 at 20 24 30
Screenshot 2025-01-02 at 20 25 04

Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you! I left just one small comment inline.

Regarding the dashboard layout, I would make the following changes:

  • Let's swap "Requests" and "Total requests handled" charts so that "Requests" is to the left of "Total requests handled".
  • Let's change "Total requests handled" chart to show only rejected requests (we have the total requests metric in the "Requests" chart already). This would involve:
    • Remove "Total requests" series from the chart.
    • Rename the chart to "Rejected requests".
    • Rename "Rate limited" series to "Rate limited requests".
    • Rename "Dropped by full queue" to "Queue overflow requests"
    • Could we also change the vertical axes to be requests/minute rather than just requests? (this would make it similar to the "Requests" chart)
  • In the Workers section, could we move the "Unhealthy workers" into its own chart? I would put this chart between the "Workers" chart " and "Requests per worker" table.

bin/tx-prover/tx_prover_service_grafana_dashboard.json Outdated Show resolved Hide resolved
@bobbinth
Copy link
Contributor

bobbinth commented Jan 3, 2025

@SantiagoPittella - could you upload the latest dashboard views?

@SantiagoPittella
Copy link
Collaborator Author

SantiagoPittella commented Jan 3, 2025

@SantiagoPittella - could you upload the latest dashboard views?

Sure @bobbinth

Screenshot 2025-01-03 at 13 12 44

Screenshot 2025-01-03 at 13 13 02

@bobbinth
Copy link
Contributor

bobbinth commented Jan 3, 2025

All looks good! I'm not sure if @igamigo or @tomyrd want to take another look - but I'm good with merging this.

One separate question about the actual data in the graphs (if this is just sample data, please ignore): it seems like we get a pretty high request latency (i.e., between 8 and 10 seconds) and most of this is because of queue latency. I'm assuming this means that requests are waiting in the queue because there are too many requests compared to workers. But then looking at workers, we have 3 workers but never utilize more than 2? Or is this because of sampling granularity?

@SantiagoPittella
Copy link
Collaborator Author

One separate question about the actual data in the graphs (if this is just sample data, please ignore): it seems like we get a pretty high request latency (i.e., between 8 and 10 seconds) and most of this is because of queue latency. I'm assuming this means that requests are waiting in the queue because there are too many requests compared to workers. But then looking at workers, we have 3 workers but never utilize more than 2? Or is this because of sampling granularity?

There is a problem on how we assign requests to workers, we need to tweak a bit the mechanism. In this case the 3 workers are picking up jobs, but one of them is handling a really small portion. If we look at the table:
image

We can see that it is handling only a few requests.

We need to distribute more evenly the requests and that should solve it.

Copy link
Collaborator

@igamigo igamigo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! The only comment I would carry onto future issues is the one related to labeling metrics (seems unnecessary right now, as we discussed before), but other than that verything looks good.

@SantiagoPittella
Copy link
Collaborator Author

LGTM! The only comment I would carry onto future issues is the one related to labeling metrics (seems unnecessary right now, as we discussed before), but other than that verything looks good.

@igamigo , I created this issue to track this: #1048

@SantiagoPittella
Copy link
Collaborator Author

We can see that it is handling only a few requests.

We need to distribute more evenly the requests and that should solve it.

I mentioned this task for it to be addressed in #1009 . Should be a rather small change.

@SantiagoPittella SantiagoPittella merged commit c98b1b7 into next Jan 3, 2025
9 checks passed
@SantiagoPittella SantiagoPittella deleted the santiagopittella-add-metrics-to-proxy branch January 3, 2025 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants