-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add metrics to proxy #1017
Conversation
f20401d
to
4a4e1a9
Compare
feat: add request latency and count metrics feat: add grafana dashboard feat: add metrics endpoint to proxy config chore: update changelog docs: update README chore: add prometheus.yml
7d655d5
to
06be9fb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ito implementation.
It also showcases what I dislike about actively doing metrics inline. The metrics code overshadows the actual code logic which is a bummer (and not this PRs fault).
I'd be interested in trying out an approach where we move all the metrics code into a tracing layer where we inspect the tracing events and update metrics based on those. This would centralize our metrics code into one spot and we only have to instrument the code with tracing
. The downside is having to couple something in the traces with the metrics.
There is also the open-telemetry metrics option instead of prometheus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thank you! Not a full review yet, but I left some comments inline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Looks good. I left some more comments inline. A couple of general comments:
- I think the prover service code is getting a bit convoluted and difficult to follow. We should think about how to better organize it (e.g., move
LoadBalancerState
into a separate module and limit its functionality to state management). We can do this in a follow-up PR though. - How does the current dashboard look like? I looked at the example here, but I think it is missing quite a few things.
Re second point, ideally, we'd have something like this:
- A graph showing total number of requests, rejected requests (e.g., because the queue is full or rate limit was triggered) over time. These could also be in req/min units.
- A graph showing total number of workers and the number of busy workers over time.
- A graph showing request latency and queue latency over time.
- A graph showing queue size over time.
The above graphs would require combining multiple metrics in a single graph - I'm not sure how easy this is (or if it is doable at all).
bin/tx-prover/src/proxy/mod.rs
Outdated
@@ -100,7 +114,9 @@ impl LoadBalancerState { | |||
let mut available_workers = self.workers.write().await; | |||
if let Some(w) = available_workers.iter_mut().find(|w| *w == &worker) { | |||
w.set_availability(true); | |||
WORKER_UTILIZATION.dec(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that a busy worker gets removed form the list before we get here? If so, we'd never decrement the number of utilized workers and so this metric may become corrupted.
An alternative could be to have a single function which computes the number of busy workers and updates the metric accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it might get outdated in that case. I'm thinking in another way to keep this metric correct without having to set it in a lot of places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've thinking about this and realized that every time we call this method means that pop_available_worker
was called before, therefore the WORKER_UTILIZATION
was increased. Even if the worker that was processing the request was deleted due to being unhealthy or manually through the CLI (miden-tx-prover remove-worker
), we should always descend on this method (add_available_worker
) the WORKER_UTILIZATION
.
I'm moving it below the braces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall! Left a couple of small comments. Couldn't test it locally yet, will approve after that.
@@ -61,6 +61,8 @@ max_retries_per_request = 1 | |||
max_req_per_sec = 5 | |||
# Interval to check the health of the workers | |||
health_check_interval_secs = 1 | |||
# Port of the metrics server | |||
prometheus_port = 6192 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not from this PR but we are missing the available_workers_polling_time_ms
in this example config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it in this PR, it was simple enough.
@bobbinth regarding your questions:
I found a way to group graphs and created the three categories. In the picture below the "Workers" section is collapsed.
Not quite sure about what this metrics represents. Is the same that
I don't see how those metrics are related, I thought about those two like two different graphs in different sections, but looks like I'm missing something. I can change the graph though I you think that will add some value. Let me know.
Yes, I agree that is not the best way to display it. I will try some alternatives. |
Accepted requests would be If possible, I'd put this metric into the same chart as "Requests". So, we'd have 3 series there: (1) total requests (blue), (2) accepted requests (green), and (3) failed requests (red).
See comment above - but basically: there are 2 reasons why a request can be rejected: (1) the user exceeded their rate limit, and (2) the queue is full. In addition to the changes described above, I'd probably replace the current `"Rate-Limited "Requests" graph with the "Rejected requests" graph which shows both rate-limited requests and requests dropped due to queue being full (as 2 separate series). |
@SantiagoPittella - looks good! A few suggestions:
|
One thought for querying metrics is that maybe we want to group all responses under a single metric with a "status code" label. Then queries become more versatile ( |
Ok!
Sure.
This is the graph using a table, it will show every worker as an option in the dropdown menu. It is a good way to display the data when using many workers, though is not that good to visualize how the load balancer is distributing jobs to each worker.
Yes, I'm changing it. cc @bobbinth |
Sounds nice, some of the current metrics ( cc @igamigo |
Addressing @bobbinth , this is the current state of the dashboard: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thank you! I left just one small comment inline.
Regarding the dashboard layout, I would make the following changes:
- Let's swap "Requests" and "Total requests handled" charts so that "Requests" is to the left of "Total requests handled".
- Let's change "Total requests handled" chart to show only rejected requests (we have the total requests metric in the "Requests" chart already). This would involve:
- Remove "Total requests" series from the chart.
- Rename the chart to "Rejected requests".
- Rename "Rate limited" series to "Rate limited requests".
- Rename "Dropped by full queue" to "Queue overflow requests"
- Could we also change the vertical axes to be requests/minute rather than just requests? (this would make it similar to the "Requests" chart)
- In the Workers section, could we move the "Unhealthy workers" into its own chart? I would put this chart between the "Workers" chart " and "Requests per worker" table.
@SantiagoPittella - could you upload the latest dashboard views? |
Sure @bobbinth |
All looks good! I'm not sure if @igamigo or @tomyrd want to take another look - but I'm good with merging this. One separate question about the actual data in the graphs (if this is just sample data, please ignore): it seems like we get a pretty high request latency (i.e., between 8 and 10 seconds) and most of this is because of queue latency. I'm assuming this means that requests are waiting in the queue because there are too many requests compared to workers. But then looking at workers, we have 3 workers but never utilize more than 2? Or is this because of sampling granularity? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! The only comment I would carry onto future issues is the one related to labeling metrics (seems unnecessary right now, as we discussed before), but other than that verything looks good.
I mentioned this task for it to be addressed in #1009 . Should be a rather small change. |
closes #1004