Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Telemetry] collect event loop delays on server & browser #101283

Closed
Bamieh opened this issue Jun 3, 2021 · 9 comments
Closed

[Telemetry] collect event loop delays on server & browser #101283

Bamieh opened this issue Jun 3, 2021 · 9 comments
Labels
enhancement New value added to drive a business result Feature:Telemetry Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@Bamieh
Copy link
Member

Bamieh commented Jun 3, 2021

Summary

Part of measuring kibana performance we want to monitor event loop delays.

This would help us detect how often customers face delays in computations and IO. We can try to correlate this data with memory size, server/browser uptime and outgoing requests to get a better picture of the topology of kibana when the ramp up starts to happen.

On the server

We have access to APIs from Node.js core to monitor the event loop:

perf_hooks.monitorEventLoopDelay()
perf_hooks.performance.eventLoopUtilization()

On the browser

On the browser we can use PerformanceTiming to access time taken to complete certain browser related measurements:

connectEnd: 1622730377902
connectStart: 1622730377902
domComplete: 1622730378157
domContentLoadedEventEnd: 1622730378143
domContentLoadedEventStart: 1622730378143
domInteractive: 1622730378034
domLoading: 1622730377994
domainLookupEnd: 1622730377902
domainLookupStart: 1622730377902
fetchStart: 1622730377902
loadEventEnd: 1622730378157
loadEventStart: 1622730378157
navigationStart: 1622730377899
requestStart: 1622730377902
responseEnd: 1622730377982
responseStart: 1622730377902

We can also simulate perf_hooks.monitorEventLoopDelay() through javascript timers

function measureEventLoopDelay() { 
  var interval = 500;
  var interval = setInterval(() => {
      const last = window.performance.now();
      setImmediate(() => {
          eventLoopDelay = window.performance.now() - last;
      });
  }, interval);

  return interval;
}

Details

setInterval functions are triggered at the beginning of the "event loop" cycle while setImmediate functions are triggered at the end of it. Measuring the difference in timing between the two basically measures the time it took for the cycle to fully loop.

@Bamieh Bamieh added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc enhancement New value added to drive a business result Feature:Telemetry v7.14.0 labels Jun 3, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@Bamieh Bamieh changed the title [Telemetry] collect event loop delays on client & browser [Telemetry] collect event loop delays on server & browser Jun 3, 2021
@mshustov
Copy link
Contributor

mshustov commented Jun 3, 2021

Do you want to collect delays for every loop? or with a configurable interval? or calculate it lazily by request?
Nodejs already returns an IntervalHistogram. What value should we report? 95perc, maybe?

@Bamieh
Copy link
Member Author

Bamieh commented Jun 7, 2021

Do you want to collect delays for every loop? or with a configurable interval? or calculate it lazily by request?

In terms of the usablity of this metric a snapshot of the current delay when the usage is collected it will not provide any useful insights.

perf_hooks.monitorEventLoopDelay() will keep measuring delays until disable or reset is called.

My plan is to monitor.reset() every 1 day and report an array of 1 day hisograms

{
  dailyEventLoopDelays: [
    {
      timestamp: '<timestamp>',
      min: 8314880,
      max: 2241855487,
      mean: 11560498.484671826,
      exceeds: 0,
      stddev: 23112618.446909714,
      percentiles: {
        0: 8314880,
        50: 10887168,
        75: 12468224,
        87.5: 12607488,
        93.75: 12615680,
        96.875: 12632064,
        98.4375: 12656640,
        99.21875: 12697600,
        99.609375: 13582336,
        99.8046875: 16637952,
        99.90234375: 21200896,
        99.951171875: 26902528,
        99.9755859375: 74121216,
        99.98779296875: 584581120,
        99.993896484375: 2239758336,
        100: 2239758336,
      },
    },
    ...
  ]
}

Daily granularity is consistent with the rest of the daily aggregated events we collect. This allows understanding what recent changes might be causing fluctuations in the delay on a per level basis.

On the telemetry cluster: Calculating the average of averages should be equal to the total average of the delay in the kibana process which is useful for the snapshot usage we collect.

Nodejs already returns an IntervalHistogram. What value should we report? 95perc, maybe?

I want to report the whole histogram along with the process total uptime to provide useful insight into the delays happening in the kibana process.

cc @thesmallestduck

@mshustov
Copy link
Contributor

mshustov commented Jun 7, 2021

My plan is to monitor.reset() every 1 day and report an array of 1 day histograms

Is a high level of granularity really useful? I'd guess we are interested in a sub-set of data (mean, min, max, 50th, 75th, 95th, 99th)

In terms of the usablity of this metric a snapshot of the current delay when the usage is collected it will not provide any useful insights.

Let's say that's true, then what do we use in the browser? Is there a package that collects data in form of IntervalHistogram?

@Bamieh
Copy link
Member Author

Bamieh commented Jun 7, 2021

I'd guess we are interested in a sub-set of data (mean, min, max, 50th, 75th, 95th, 99th)

True I dont think we need all the levels provided by nodeJS performance histogram for our case. What you mentioned above should be enough.

Let's say that's true, then what do we use in the browser? Is there a package that collects data in form of IntervalHistogram?

I'd leave the implementation details for the PR but just to highlight my thought process: We can easily build our own intervalHistogram without any libraries using the same resolution we are using on the server (10seconds by default).

Ideally I would be able to use nodeJS Histogram implementation and then the browser would just send an array of averages to be calculated on the server side (we're sending the data anyways to be stored for the usage report.

// create historgram
const histogram = perf_hooks.createHistogram();

// to update every x resolution:
histogram.record(1231);
// to reset and collect last day's data:
histogram.recordDelta();

In addition to the delays in the loop on the browser I was thinking of collecting PerformanceTiming metrics provided by the window.perfomrance API (although still experimental). These would provide more useful timers IMO.

@mshustov
Copy link
Contributor

mshustov commented Jun 7, 2021

In addition to the delays in the loop on the browser I was thinking of collecting PerformanceTiming metrics provided by the window.perfomrance API (although still experimental). These would provide more useful timers IMO.

I agree it can be useful, but let's do it as a separate task?

@afharo
Copy link
Member

afharo commented Aug 17, 2021

@Bamieh since the PR was merged, can we close this issue? Are there any pending items?

@Bamieh
Copy link
Member Author

Bamieh commented Aug 19, 2021

@afharo The browser side of things is not implemented, only the server side was implemented.

@pgayvallet
Copy link
Contributor

With our recent shift of priorities due to serverless, I don't think we will ever really need or want to collect browser-side event loop delay, so I'll go ahead and close this (but feel free to reopen if you think I shouldn't have)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result Feature:Telemetry Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

No branches or pull requests

5 participants