-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add prometheus metrics for batch calls #403
Conversation
crates/shared/src/transport/http.rs
Outdated
@@ -147,6 +147,9 @@ impl BatchTransport for HttpTransport { | |||
|
|||
async move { | |||
let _guard = metrics.on_request_start("batch"); | |||
for call in &calls { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the docs for scopeguard
this should not work as expected:
https://docs.rs/scopeguard/latest/scopeguard/
So, as I understand, the point here is to increment the counter of inflight
before executing the request, and then decrement the inflight
and increment the completed
after executing the request.
With your code, I would expect for all of this to happen inside the inner scope.
Seems to me that we actually need to change the signature of the on_request_start
function to receive a slice of methods (not only one method), and then handle that change inside on_request_start
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I liked that calling on_request_start()
a bunch of times is easy to implement but changing the signature to receive an iterator of labels is probably harder to misuse and would allow us to write the code such that only a single timer needs to be created per batch call. 🤔
Codecov Report
@@ Coverage Diff @@
## main #403 +/- ##
==========================================
- Coverage 64.08% 64.03% -0.06%
==========================================
Files 222 222
Lines 41848 41876 +28
==========================================
- Hits 26819 26816 -3
- Misses 15029 15060 +31 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
.await | ||
.unwrap(); | ||
let metric_storage = | ||
TransportMetrics::instance(global_metrics::get_metric_storage_registry()).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One issue I see with this test is that other tests may influence its result (since it uses a global storage registry). This could lead to flaky tests when run with other code that increments HTTP metrics.
Since this is an ignored test, this isn't super critical. Maybe we can create a local registry that is used just for this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lg
crates/shared/src/transport/http.rs
Outdated
let _guard = metrics.on_request_start("batch"); | ||
let _guards: Vec<_> = calls | ||
.iter() | ||
.map(|call| metrics.on_request_start(method_name(call))) | ||
.collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mixes up some things in the metrics.
- We record the whole batch and the individual requests making it up but there is no way to distinguish this. If I make one batch request with one internal request we count 2 total requests even though 1 is more accurate.
- It is not useful to record timing information for the requests making up a batch. They all take the same duration that the batched request takes. If a request takes one second and I make 3 in a row then I get an average of 1 second. But if I make a batch request that includes these three then we get a 3 second average.
I feel it would be more useful to duplicate
/// Number of completed RPC requests for ethereum node.
#[metric(labels("method"))]
requests_complete: prometheus::IntCounterVec,
into a new metrics whose only responsibility is to count individual requests inside of batch requests. inner_batch_requests_complete
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I agree that we should separate timing information for batch requests as it will be low signal (each request in the batch will take the total time as you suggested). This could lead to us thinking "why is eth_blockNumber
taking so long?" when in fact it only does because it is included in a batch with 1000 complicated eth_call
s).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, I don't think we need timing for the inner requests of the batches at all. We only need to count them. So I would only duplicate the one metric I mentioned and not the histogram or in-flight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about not duplicating counters and instead just add an argument to on_request_start()
which allows to discard timing data? That seems like a low effort solution on the rust side as well as the grafana side.
Edit: A separate function would mean we wouldn't have to touch existing call sites to add the new argument, tough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On_request_start does a bunch of things, most of which are not useful for the inner batch calls. Feels cleaner to me to handle the inner batch requests completely separately because all we want to do there is increment one prometheus counter. No need for DropGuard logic either because that is only used for timing the request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need timing for the inner requests of the batches at all
I think we should keep timing information for the batch, but not for the inner calls. This is what I meant with my ambiguous "should separate timing information for batch requests". I think we are arguing for the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would:
- Keep track of success/failure for all calls (regular and inner batch) per method
- Keep track of timming for regular calls and for the complete batch, but not for the inner batch calls themselves.
crates/shared/src/transport/http.rs
Outdated
calls.iter().for_each(|call| { | ||
metrics | ||
.inner_batch_requests_complete | ||
.with_label_values(&[method_name(call)]) | ||
.inc() | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Technically the requests only completed after execute_rpc()
in line 157. That shouldn't matter in the grand scheme of things, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is unclear whether we want to measure before or after. If we measure before then it is possible that the request is dropped before the node handles it. If we measure after then it is possible that the node handled but we dropped it before receiving. If the goal is to measure number of calls for pricing then I'm slightly leaning to increment before. Could rename the metric to "initiated" instead of "complete". Anyway, it's a small detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes indeed, the goal is to measure calls for pricing.
Could rename the metric to "initiated" instead of "complete"
Sound's good. Will update.
Fixes #389
Adds metrics to each RPC call within a batch call.
Individual calls within a batch call now get counted in our metrics.
Test Plan
cargo test -- inner_batch_requests_metrics_success --nocapture --ignored