Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prometheus metrics for batch calls #403

Merged
merged 6 commits into from
Aug 4, 2022
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions crates/shared/src/transport/http.rs
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,10 @@ impl BatchTransport for HttpTransport {

async move {
let _guard = metrics.on_request_start("batch");
let _guards: Vec<_> = calls
.iter()
.map(|call| metrics.on_request_start(method_name(call)))
.collect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mixes up some things in the metrics.

  1. We record the whole batch and the individual requests making it up but there is no way to distinguish this. If I make one batch request with one internal request we count 2 total requests even though 1 is more accurate.
  2. It is not useful to record timing information for the requests making up a batch. They all take the same duration that the batched request takes. If a request takes one second and I make 3 in a row then I get an average of 1 second. But if I make a batch request that includes these three then we get a 3 second average.

I feel it would be more useful to duplicate

    /// Number of completed RPC requests for ethereum node.
    #[metric(labels("method"))]
    requests_complete: prometheus::IntCounterVec,

into a new metrics whose only responsibility is to count individual requests inside of batch requests. inner_batch_requests_complete .

Copy link
Contributor

@nlordell nlordell Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I agree that we should separate timing information for batch requests as it will be low signal (each request in the batch will take the total time as you suggested). This could lead to us thinking "why is eth_blockNumber taking so long?" when in fact it only does because it is included in a batch with 1000 complicated eth_calls).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I don't think we need timing for the inner requests of the batches at all. We only need to count them. So I would only duplicate the one metric I mentioned and not the histogram or in-flight.

Copy link
Contributor

@MartinquaXD MartinquaXD Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about not duplicating counters and instead just add an argument to on_request_start() which allows to discard timing data? That seems like a low effort solution on the rust side as well as the grafana side.
Edit: A separate function would mean we wouldn't have to touch existing call sites to add the new argument, tough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On_request_start does a bunch of things, most of which are not useful for the inner batch calls. Feels cleaner to me to handle the inner batch requests completely separately because all we want to do there is increment one prometheus counter. No need for DropGuard logic either because that is only used for timing the request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need timing for the inner requests of the batches at all

I think we should keep timing information for the batch, but not for the inner calls. This is what I meant with my ambiguous "should separate timing information for batch requests". I think we are arguing for the same thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would:

  • Keep track of success/failure for all calls (regular and inner batch) per method
  • Keep track of timming for regular calls and for the complete batch, but not for the inner batch calls themselves.


let outputs = execute_rpc(client, inner, id, &Request::Batch(calls)).await?;
handle_batch_response(&ids, outputs)
Expand Down Expand Up @@ -258,6 +262,8 @@ impl TransportMetrics {
#[cfg(test)]
mod tests {
use super::*;
use crate::transport::create_env_test_transport;
use crate::Web3;

#[test]
fn handles_batch_response_being_in_different_order_than_input() {
Expand Down Expand Up @@ -303,4 +309,31 @@ mod tests {
)
.is_err());
}

#[tokio::test]
#[ignore]
async fn batch_call_success() {
let http = create_env_test_transport();
let web3 = Web3::new(http);
let request = web3.transport().prepare("eth_blockNumber", Vec::default());
let request2 = web3.transport().prepare("eth_chainId", Vec::default());
let _batch_response = web3
.transport()
.send_batch([request, request2])
.await
.unwrap()
.into_iter();
MartinquaXD marked this conversation as resolved.
Show resolved Hide resolved
let metric_storage =
TransportMetrics::instance(global_metrics::get_metric_storage_registry()).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue I see with this test is that other tests may influence its result (since it uses a global storage registry). This could lead to flaky tests when run with other code that increments HTTP metrics.

Since this is an ignored test, this isn't super critical. Maybe we can create a local registry that is used just for this test.

for method_name in ["eth_blockNumber", "eth_chainId", "batch"] {
let number_calls = metric_storage
.requests_complete
.with_label_values(&[method_name]);
let inflight_calls = metric_storage
.requests_inflight
.with_label_values(&[method_name]);
assert_eq!(number_calls.get(), 1);
assert_eq!(inflight_calls.get(), 0);
}
}
}