Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(metrics): measure call time and wait time separately #1858

Merged
merged 4 commits into from
Oct 30, 2023

Conversation

folex
Copy link
Member

@folex folex commented Oct 26, 2023

Description

  • Report actor mailbox size and actor counts in every poll. Before this PR, it was only reported when there are interpretation_stats.
  • Measure lock wait time separately from marine execution time in AppService
  • Measure time it takes for service call execution to be scheduled on Blocking Pool in Actor
  • Separate call execution time measurement from schedule wait time in Actor

Motivation

Including queuing time from latency measurements skews measurements when system is saturated. In such form, measurement doesn't make much sense: it will obviously grow as more work arrives.

Instead, execution latency should be measured separately, to allow for applying eg Little's Law to estimate system's throughput at given execution latencies.

Another benefit of that separation is that we get a clear picture of actual Marine performance.

Concerns

We have two sets of metrics that describe execution latencies.

One is one the higher level: in Plumber-Actor. It measures Builtin calls latencies as close as possible, which is good. However, for Service Calls its measurements include RWLock and Mutex latencies, which makes these measurements skewed.

To the rescue comes another measurement, lower in the stack, at AppService level. That one (now) measures pure Marine call latency, and time it took to wait for a Service lock to be available.

The mixture of measurement sites and absence of clear naming makes it hard to navigate metrics.

The dashboards in Grafana must be upgraded with great care, then. And documenting these caveats is crucial for external users' ability to observe Nox through these metrics.

@folex folex requested review from gurinderu and kmd-fl October 26, 2023 21:52
@folex folex changed the title fix(metrics): always report mailbox size, actor count fix(metrics): measure call time and wait time separately Oct 26, 2023
@@ -182,13 +183,20 @@ impl ServicesMetricsExternal {
"number of modules per services",
);

let call_time_msec: Family<_, _> = register(
let call_time_sec: Family<_, _> = register(
sub_registry,
Family::new_with_constructor(|| Histogram::new(execution_time_buckets())),
"call_time_msec",
Copy link
Member Author

@folex folex Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why it's called msec. @kmd-fl wdyt?

@folex folex requested a review from ValeryAntopol October 26, 2023 23:41
@folex folex merged commit 73bab7e into master Oct 30, 2023
13 checks passed
@folex folex deleted the fix_actor_metrics branch October 30, 2023 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants