[Telemetry] collect event loop delays on server & browser #101283

Bamieh · 2021-06-03T14:35:05Z

Summary

Part of measuring kibana performance we want to monitor event loop delays.

This would help us detect how often customers face delays in computations and IO. We can try to correlate this data with memory size, server/browser uptime and outgoing requests to get a better picture of the topology of kibana when the ramp up starts to happen.

On the server

We have access to APIs from Node.js core to monitor the event loop:

perf_hooks.monitorEventLoopDelay()
perf_hooks.performance.eventLoopUtilization()

On the browser

On the browser we can use PerformanceTiming to access time taken to complete certain browser related measurements:

connectEnd: 1622730377902
connectStart: 1622730377902
domComplete: 1622730378157
domContentLoadedEventEnd: 1622730378143
domContentLoadedEventStart: 1622730378143
domInteractive: 1622730378034
domLoading: 1622730377994
domainLookupEnd: 1622730377902
domainLookupStart: 1622730377902
fetchStart: 1622730377902
loadEventEnd: 1622730378157
loadEventStart: 1622730378157
navigationStart: 1622730377899
requestStart: 1622730377902
responseEnd: 1622730377982
responseStart: 1622730377902

We can also simulate perf_hooks.monitorEventLoopDelay() through javascript timers

function measureEventLoopDelay() { 
  var interval = 500;
  var interval = setInterval(() => {
      const last = window.performance.now();
      setImmediate(() => {
          eventLoopDelay = window.performance.now() - last;
      });
  }, interval);

  return interval;
}

Details

setInterval functions are triggered at the beginning of the "event loop" cycle while setImmediate functions are triggered at the end of it. Measuring the difference in timing between the two basically measures the time it took for the cycle to fully loop.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-06-03T14:35:07Z

Pinging @elastic/kibana-core (Team:Core)

mshustov · 2021-06-03T17:11:36Z

Do you want to collect delays for every loop? or with a configurable interval? or calculate it lazily by request?
Nodejs already returns an IntervalHistogram. What value should we report? 95perc, maybe?

Bamieh · 2021-06-07T10:39:51Z

Do you want to collect delays for every loop? or with a configurable interval? or calculate it lazily by request?

In terms of the usablity of this metric a snapshot of the current delay when the usage is collected it will not provide any useful insights.

perf_hooks.monitorEventLoopDelay() will keep measuring delays until disable or reset is called.

My plan is to monitor.reset() every 1 day and report an array of 1 day hisograms

{
  dailyEventLoopDelays: [
    {
      timestamp: '<timestamp>',
      min: 8314880,
      max: 2241855487,
      mean: 11560498.484671826,
      exceeds: 0,
      stddev: 23112618.446909714,
      percentiles: {
        0: 8314880,
        50: 10887168,
        75: 12468224,
        87.5: 12607488,
        93.75: 12615680,
        96.875: 12632064,
        98.4375: 12656640,
        99.21875: 12697600,
        99.609375: 13582336,
        99.8046875: 16637952,
        99.90234375: 21200896,
        99.951171875: 26902528,
        99.9755859375: 74121216,
        99.98779296875: 584581120,
        99.993896484375: 2239758336,
        100: 2239758336,
      },
    },
    ...
  ]
}

Daily granularity is consistent with the rest of the daily aggregated events we collect. This allows understanding what recent changes might be causing fluctuations in the delay on a per level basis.

On the telemetry cluster: Calculating the average of averages should be equal to the total average of the delay in the kibana process which is useful for the snapshot usage we collect.

Nodejs already returns an IntervalHistogram. What value should we report? 95perc, maybe?

I want to report the whole histogram along with the process total uptime to provide useful insight into the delays happening in the kibana process.

cc @thesmallestduck

mshustov · 2021-06-07T11:30:44Z

My plan is to monitor.reset() every 1 day and report an array of 1 day histograms

Is a high level of granularity really useful? I'd guess we are interested in a sub-set of data (mean, min, max, 50th, 75th, 95th, 99th)

In terms of the usablity of this metric a snapshot of the current delay when the usage is collected it will not provide any useful insights.

Let's say that's true, then what do we use in the browser? Is there a package that collects data in form of IntervalHistogram?

Bamieh · 2021-06-07T11:41:55Z

I'd guess we are interested in a sub-set of data (mean, min, max, 50th, 75th, 95th, 99th)

True I dont think we need all the levels provided by nodeJS performance histogram for our case. What you mentioned above should be enough.

Let's say that's true, then what do we use in the browser? Is there a package that collects data in form of IntervalHistogram?

I'd leave the implementation details for the PR but just to highlight my thought process: We can easily build our own intervalHistogram without any libraries using the same resolution we are using on the server (10seconds by default).

Ideally I would be able to use nodeJS Histogram implementation and then the browser would just send an array of averages to be calculated on the server side (we're sending the data anyways to be stored for the usage report.

// create historgram
const histogram = perf_hooks.createHistogram();

// to update every x resolution:
histogram.record(1231);
// to reset and collect last day's data:
histogram.recordDelta();

In addition to the delays in the loop on the browser I was thinking of collecting PerformanceTiming metrics provided by the window.perfomrance API (although still experimental). These would provide more useful timers IMO.

mshustov · 2021-06-07T15:46:15Z

In addition to the delays in the loop on the browser I was thinking of collecting PerformanceTiming metrics provided by the window.perfomrance API (although still experimental). These would provide more useful timers IMO.

I agree it can be useful, but let's do it as a separate task?

afharo · 2021-08-17T11:14:47Z

@Bamieh since the PR was merged, can we close this issue? Are there any pending items?

Bamieh · 2021-08-19T11:09:14Z

@afharo The browser side of things is not implemented, only the server side was implemented.

pgayvallet · 2024-07-05T07:21:04Z

With our recent shift of priorities due to serverless, I don't think we will ever really need or want to collect browser-side event loop delay, so I'll go ahead and close this (but feel free to reopen if you think I shouldn't have)

Bamieh added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc enhancement New value added to drive a business result Feature:Telemetry v7.14.0 labels Jun 3, 2021

Bamieh changed the title ~~[Telemetry] collect event loop delays on client & browser~~ [Telemetry] collect event loop delays on server & browser Jun 3, 2021

Bamieh mentioned this issue Jun 8, 2021

[Telemetry] Track event loop delays on the server #101580

Merged

pgayvallet removed the v7.14.0 label Jul 12, 2021

pgayvallet closed this as completed Jul 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Telemetry] collect event loop delays on server & browser #101283

[Telemetry] collect event loop delays on server & browser #101283

Bamieh commented Jun 3, 2021 •

edited

Loading

elasticmachine commented Jun 3, 2021

mshustov commented Jun 3, 2021 •

edited

Loading

Bamieh commented Jun 7, 2021

mshustov commented Jun 7, 2021

Bamieh commented Jun 7, 2021 •

edited

Loading

mshustov commented Jun 7, 2021

afharo commented Aug 17, 2021

Bamieh commented Aug 19, 2021

pgayvallet commented Jul 5, 2024

[Telemetry] collect event loop delays on server & browser #101283

[Telemetry] collect event loop delays on server & browser #101283

Comments

Bamieh commented Jun 3, 2021 • edited Loading

Summary

On the server

On the browser

Details

elasticmachine commented Jun 3, 2021

mshustov commented Jun 3, 2021 • edited Loading

Bamieh commented Jun 7, 2021

mshustov commented Jun 7, 2021

Bamieh commented Jun 7, 2021 • edited Loading

mshustov commented Jun 7, 2021

afharo commented Aug 17, 2021

Bamieh commented Aug 19, 2021

pgayvallet commented Jul 5, 2024

Bamieh commented Jun 3, 2021 •

edited

Loading

mshustov commented Jun 3, 2021 •

edited

Loading

Bamieh commented Jun 7, 2021 •

edited

Loading