Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prometheus metrics for batch calls #403

Merged
merged 6 commits into from
Aug 4, 2022
Merged

Add prometheus metrics for batch calls #403

merged 6 commits into from
Aug 4, 2022

Conversation

ahhda
Copy link
Contributor

@ahhda ahhda commented Aug 3, 2022

Fixes #389

Adds metrics to each RPC call within a batch call.

Individual calls within a batch call now get counted in our metrics.

Screenshot 2022-08-03 at 3 57 17 PM

Test Plan

  • Run tests using the command cargo test -- inner_batch_requests_metrics_success --nocapture --ignored

@@ -147,6 +147,9 @@ impl BatchTransport for HttpTransport {

async move {
let _guard = metrics.on_request_start("batch");
for call in &calls {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the docs for scopeguard this should not work as expected:
https://docs.rs/scopeguard/latest/scopeguard/

So, as I understand, the point here is to increment the counter of inflight before executing the request, and then decrement the inflight and increment the completed after executing the request.

With your code, I would expect for all of this to happen inside the inner scope.

Seems to me that we actually need to change the signature of the on_request_start function to receive a slice of methods (not only one method), and then handle that change inside on_request_start

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked that calling on_request_start() a bunch of times is easy to implement but changing the signature to receive an iterator of labels is probably harder to misuse and would allow us to write the code such that only a single timer needs to be created per batch call. 🤔

@codecov-commenter
Copy link

codecov-commenter commented Aug 3, 2022

Codecov Report

Merging #403 (8759466) into main (7d3e85e) will decrease coverage by 0.05%.
The diff coverage is 0.00%.

❗ Current head 8759466 differs from pull request most recent head f1d7177. Consider uploading reports for the commit f1d7177 to get more accurate results

@@            Coverage Diff             @@
##             main     #403      +/-   ##
==========================================
- Coverage   64.08%   64.03%   -0.06%     
==========================================
  Files         222      222              
  Lines       41848    41876      +28     
==========================================
- Hits        26819    26816       -3     
- Misses      15029    15060      +31     

Copy link
Contributor

@nlordell nlordell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

.await
.unwrap();
let metric_storage =
TransportMetrics::instance(global_metrics::get_metric_storage_registry()).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue I see with this test is that other tests may influence its result (since it uses a global storage registry). This could lead to flaky tests when run with other code that increments HTTP metrics.

Since this is an ignored test, this isn't super critical. Maybe we can create a local registry that is used just for this test.

@ahhda ahhda marked this pull request as ready for review August 3, 2022 10:43
@ahhda ahhda requested a review from a team as a code owner August 3, 2022 10:43
Copy link
Contributor

@sunce86 sunce86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lg

Comment on lines 149 to 153
let _guard = metrics.on_request_start("batch");
let _guards: Vec<_> = calls
.iter()
.map(|call| metrics.on_request_start(method_name(call)))
.collect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mixes up some things in the metrics.

  1. We record the whole batch and the individual requests making it up but there is no way to distinguish this. If I make one batch request with one internal request we count 2 total requests even though 1 is more accurate.
  2. It is not useful to record timing information for the requests making up a batch. They all take the same duration that the batched request takes. If a request takes one second and I make 3 in a row then I get an average of 1 second. But if I make a batch request that includes these three then we get a 3 second average.

I feel it would be more useful to duplicate

    /// Number of completed RPC requests for ethereum node.
    #[metric(labels("method"))]
    requests_complete: prometheus::IntCounterVec,

into a new metrics whose only responsibility is to count individual requests inside of batch requests. inner_batch_requests_complete .

Copy link
Contributor

@nlordell nlordell Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I agree that we should separate timing information for batch requests as it will be low signal (each request in the batch will take the total time as you suggested). This could lead to us thinking "why is eth_blockNumber taking so long?" when in fact it only does because it is included in a batch with 1000 complicated eth_calls).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I don't think we need timing for the inner requests of the batches at all. We only need to count them. So I would only duplicate the one metric I mentioned and not the histogram or in-flight.

Copy link
Contributor

@MartinquaXD MartinquaXD Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about not duplicating counters and instead just add an argument to on_request_start() which allows to discard timing data? That seems like a low effort solution on the rust side as well as the grafana side.
Edit: A separate function would mean we wouldn't have to touch existing call sites to add the new argument, tough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On_request_start does a bunch of things, most of which are not useful for the inner batch calls. Feels cleaner to me to handle the inner batch requests completely separately because all we want to do there is increment one prometheus counter. No need for DropGuard logic either because that is only used for timing the request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need timing for the inner requests of the batches at all

I think we should keep timing information for the batch, but not for the inner calls. This is what I meant with my ambiguous "should separate timing information for batch requests". I think we are arguing for the same thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would:

  • Keep track of success/failure for all calls (regular and inner batch) per method
  • Keep track of timming for regular calls and for the complete batch, but not for the inner batch calls themselves.

Comment on lines 150 to 155
calls.iter().for_each(|call| {
metrics
.inner_batch_requests_complete
.with_label_values(&[method_name(call)])
.inc()
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Technically the requests only completed after execute_rpc() in line 157. That shouldn't matter in the grand scheme of things, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unclear whether we want to measure before or after. If we measure before then it is possible that the request is dropped before the node handles it. If we measure after then it is possible that the node handled but we dropped it before receiving. If the goal is to measure number of calls for pricing then I'm slightly leaning to increment before. Could rename the metric to "initiated" instead of "complete". Anyway, it's a small detail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed, the goal is to measure calls for pricing.

Could rename the metric to "initiated" instead of "complete"

Sound's good. Will update.

@ahhda ahhda enabled auto-merge (squash) August 4, 2022 10:49
@ahhda ahhda disabled auto-merge August 4, 2022 10:50
@ahhda ahhda enabled auto-merge (squash) August 4, 2022 10:52
@ahhda ahhda merged commit 13455a8 into main Aug 4, 2022
@ahhda ahhda deleted the rpc-call-metrics branch August 4, 2022 10:52
@github-actions github-actions bot locked and limited conversation to collaborators Aug 4, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a prometheus counter for the number of RPC calls
6 participants