Add prometheus metrics for batch calls #403

ahhda · 2022-08-03T07:44:29Z

Fixes #389

Adds metrics to each RPC call within a batch call.

Individual calls within a batch call now get counted in our metrics.

Test Plan

Run tests using the command cargo test -- inner_batch_requests_metrics_success --nocapture --ignored

sunce86 · 2022-08-03T09:19:41Z

crates/shared/src/transport/http.rs

@@ -147,6 +147,9 @@ impl BatchTransport for HttpTransport {

        async move {
            let _guard = metrics.on_request_start("batch");
+            for call in &calls {


Based on the docs for scopeguard this should not work as expected:
https://docs.rs/scopeguard/latest/scopeguard/

So, as I understand, the point here is to increment the counter of inflight before executing the request, and then decrement the inflight and increment the completed after executing the request.

With your code, I would expect for all of this to happen inside the inner scope.

Seems to me that we actually need to change the signature of the on_request_start function to receive a slice of methods (not only one method), and then handle that change inside on_request_start

I liked that calling on_request_start() a bunch of times is easy to implement but changing the signature to receive an iterator of labels is probably harder to misuse and would allow us to write the code such that only a single timer needs to be created per batch call. 🤔

codecov-commenter · 2022-08-03T09:29:21Z

Codecov Report

Merging #403 (8759466) into main (7d3e85e) will decrease coverage by 0.05%.
The diff coverage is 0.00%.

❗ Current head 8759466 differs from pull request most recent head f1d7177. Consider uploading reports for the commit f1d7177 to get more accurate results

@@            Coverage Diff             @@
##             main     #403      +/-   ##
==========================================
- Coverage   64.08%   64.03%   -0.06%     
==========================================
  Files         222      222              
  Lines       41848    41876      +28     
==========================================
- Hits        26819    26816       -3     
- Misses      15029    15060      +31

crates/shared/src/transport/http.rs

nlordell

LGTM

nlordell · 2022-08-03T10:38:19Z

crates/shared/src/transport/http.rs

+            .await
+            .unwrap();
+        let metric_storage =
+            TransportMetrics::instance(global_metrics::get_metric_storage_registry()).unwrap();


One issue I see with this test is that other tests may influence its result (since it uses a global storage registry). This could lead to flaky tests when run with other code that increments HTTP metrics.

Since this is an ignored test, this isn't super critical. Maybe we can create a local registry that is used just for this test.

sunce86

Lg

vkgnosis · 2022-08-03T11:00:43Z

crates/shared/src/transport/http.rs

            let _guard = metrics.on_request_start("batch");
+            let _guards: Vec<_> = calls
+                .iter()
+                .map(|call| metrics.on_request_start(method_name(call)))
+                .collect();


This mixes up some things in the metrics.

We record the whole batch and the individual requests making it up but there is no way to distinguish this. If I make one batch request with one internal request we count 2 total requests even though 1 is more accurate.

It is not useful to record timing information for the requests making up a batch. They all take the same duration that the batched request takes. If a request takes one second and I make 3 in a row then I get an average of 1 second. But if I make a batch request that includes these three then we get a 3 second average.

I feel it would be more useful to duplicate

/// Number of completed RPC requests for ethereum node. #[metric(labels("method"))] requests_complete: prometheus::IntCounterVec,

into a new metrics whose only responsibility is to count individual requests inside of batch requests. inner_batch_requests_complete .

Good point. I agree that we should separate timing information for batch requests as it will be low signal (each request in the batch will take the total time as you suggested). This could lead to us thinking "why is eth_blockNumber taking so long?" when in fact it only does because it is included in a batch with 1000 complicated eth_calls).

To be clear, I don't think we need timing for the inner requests of the batches at all. We only need to count them. So I would only duplicate the one metric I mentioned and not the histogram or in-flight.

How about not duplicating counters and instead just add an argument to on_request_start() which allows to discard timing data? That seems like a low effort solution on the rust side as well as the grafana side.
Edit: A separate function would mean we wouldn't have to touch existing call sites to add the new argument, tough.

On_request_start does a bunch of things, most of which are not useful for the inner batch calls. Feels cleaner to me to handle the inner batch requests completely separately because all we want to do there is increment one prometheus counter. No need for DropGuard logic either because that is only used for timing the request.

I don't think we need timing for the inner requests of the batches at all

I think we should keep timing information for the batch, but not for the inner calls. This is what I meant with my ambiguous "should separate timing information for batch requests". I think we are arguing for the same thing.

I would:

Keep track of success/failure for all calls (regular and inner batch) per method

Keep track of timming for regular calls and for the complete batch, but not for the inner batch calls themselves.

MartinquaXD · 2022-08-04T10:09:11Z

crates/shared/src/transport/http.rs

+            calls.iter().for_each(|call| {
+                metrics
+                    .inner_batch_requests_complete
+                    .with_label_values(&[method_name(call)])
+                    .inc()
+            });


nit: Technically the requests only completed after execute_rpc() in line 157. That shouldn't matter in the grand scheme of things, though.

It is unclear whether we want to measure before or after. If we measure before then it is possible that the request is dropped before the node handles it. If we measure after then it is possible that the node handled but we dropped it before receiving. If the goal is to measure number of calls for pricing then I'm slightly leaning to increment before. Could rename the metric to "initiated" instead of "complete". Anyway, it's a small detail.

Yes indeed, the goal is to measure calls for pricing.

Could rename the metric to "initiated" instead of "complete"

Sound's good. Will update.

Add metric call for each batch call

14ec2d1

sunce86 reviewed Aug 3, 2022

View reviewed changes

Add guards, add test

3a01a24

MartinquaXD reviewed Aug 3, 2022

View reviewed changes

crates/shared/src/transport/http.rs Outdated Show resolved Hide resolved

Remove unused variable, into_iter

fbc9115

nlordell approved these changes Aug 3, 2022

View reviewed changes

nlordell reviewed Aug 3, 2022

View reviewed changes

ahhda marked this pull request as ready for review August 3, 2022 10:43

ahhda requested a review from a team as a code owner August 3, 2022 10:43

sunce86 approved these changes Aug 3, 2022

View reviewed changes

vkgnosis reviewed Aug 3, 2022

View reviewed changes

Add inner batch request metrics

ab373b3

MartinquaXD reviewed Aug 4, 2022

View reviewed changes

Update metric name

8759466

MartinquaXD approved these changes Aug 4, 2022

View reviewed changes

vkgnosis approved these changes Aug 4, 2022

View reviewed changes

ahhda enabled auto-merge (squash) August 4, 2022 10:49

Merge branch 'main' into rpc-call-metrics

f1d7177

ahhda disabled auto-merge August 4, 2022 10:50

ahhda enabled auto-merge (squash) August 4, 2022 10:52

ahhda merged commit 13455a8 into main Aug 4, 2022

ahhda deleted the rpc-call-metrics branch August 4, 2022 10:52

github-actions bot locked and limited conversation to collaborators Aug 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add prometheus metrics for batch calls #403

Add prometheus metrics for batch calls #403

ahhda commented Aug 3, 2022 •

edited

Loading

sunce86 Aug 3, 2022

MartinquaXD Aug 3, 2022

codecov-commenter commented Aug 3, 2022 •

edited

Loading

nlordell left a comment

nlordell Aug 3, 2022

sunce86 left a comment

vkgnosis Aug 3, 2022

nlordell Aug 3, 2022 •

edited

Loading

vkgnosis Aug 3, 2022

MartinquaXD Aug 3, 2022 •

edited

Loading

vkgnosis Aug 3, 2022

nlordell Aug 3, 2022

nlordell Aug 3, 2022

MartinquaXD Aug 4, 2022

vkgnosis Aug 4, 2022

ahhda Aug 4, 2022

Add prometheus metrics for batch calls #403

Add prometheus metrics for batch calls #403

Conversation

ahhda commented Aug 3, 2022 • edited Loading

Test Plan

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Aug 3, 2022 • edited Loading

Codecov Report

nlordell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunce86 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nlordell Aug 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MartinquaXD Aug 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahhda commented Aug 3, 2022 •

edited

Loading

codecov-commenter commented Aug 3, 2022 •

edited

Loading

nlordell Aug 3, 2022 •

edited

Loading

MartinquaXD Aug 3, 2022 •

edited

Loading