Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handles ns to ms conversion for event loop delay metrics #118447

Merged

Conversation

TinaHeiligers
Copy link
Contributor

@TinaHeiligers TinaHeiligers commented Nov 12, 2021

Resolves #116778

How this PR fixes the issue:

This PR converts the raw data in core's EventLoopDelayMonitor from ns to ms and reports the data in ms. We source the data from Node's native monitorEventLoopDelay histogram, that is in ns.

How was the bug introduced?

  • Event loop delays collection was moved from the kibana_usage_collection plugin to core. There were several other changes that resulted from this move, including changing the process event_loop_delay to read from the mean of the histogram which uses nanoseconds rather than milliseconds.

  • Monitoring gets kibana metrics from the status endpoint, which includes core’s metrics.
    On setup, monitoring initializes the bulkUploader with metrics from core and gets the kibana stats from there.

How was the metric calculated before?

Prior to this change, we were using bench.elapsed from hapi/hoek.
Reference for bench.elapsed internals: https://github.com/hapijs/hoek/blob/master/lib/bench.js#L19-L22.

Screenshots

Actual numbers from these screen shots will differ due to environment. Review relative orders of magnitude between Event loop delay.

7-15 stack deployment (cloud-staging)
cloud_7_15_stack_event_loop_delay
7-16 stack deployment (cloud-staging)
Cloud_staging_7-16-monitoring-UI
** This PR (running locally) **
post_fix_local_8-1

PR deployed on cloud here. Use the same auth creds as running this PR locally.

Checklist

Risk Matrix

Risk Probability Severity Mitigation/Notes
event_loop_delay metrics reported incorrectly. Low Medium Any downstream consumers of metrics data need to ensure that they take metrics data units into account.

Release Note

Fixes bug in event_loop_delay metrics units.

@TinaHeiligers TinaHeiligers added auto-backport Deprecated - use backport:version if exact versions are needed backport-v8.0.0 v7.16.1 v8.1.0 release_note:fix and removed auto-backport Deprecated - use backport:version if exact versions are needed labels Nov 12, 2021
Copy link
Contributor Author

@TinaHeiligers TinaHeiligers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self review

const now = Date.now();

return {
const mockedRawCollectedDataInNs = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mocks the raw data returned from Node's native API in nanoseconds.

return mockedRawCollectedDataInNs;
}

function createMockMonitorDataMsHistogram(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create mock data in ms returned from the monitor


return new MockEventLoopDelaysMonitor();
}

export const mocked = {
createHistogram: createMockHistogram,
createHistogram: createMockRawNsDataHistogram, // raw data as received from Node.js perf_hooks.monitorEventLoopDelay([options])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createHistogram is used to generate unit test input data that the monitor then converts from ns -> ms.

// This ensures that the wiring of the `collect` function is correct.
const mockedHistogram = mocked.createHistogram();
expect(histogramData).toEqual(mockedHistogram);

expect(histogramData.min).toEqual(nsToMs(mockedHistogram.min));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't come up with an elegant way to test all of this that was still explicit enough for us to glance over and realize there's a unit conversion taking place.

Since the units already threw us once, I'm erring on the side of caution here.

* Converts time metric from ns to ms
**/
export function nsToMs(metric: number) {
return moment.duration(metric / ONE_MILLISECOND_AS_NANOSECONDS).asMilliseconds();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm torn between using moment here and just treating the number as a number.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the native API whenever possible. I'd remove moment usage as it doesn't bring any value.

mean,
exceeds,
stddev,
const collectedDataMs: IntervalHistogram = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation ended up being the easiest to read.

@@ -32,7 +32,7 @@ export const MONITOR_EVENT_LOOP_DELAYS_RESET = 24 * 60 * 60 * 1000;
export const MONITOR_EVENT_LOOP_DELAYS_START = 1 * 60 * 1000;

/**
* Mean event loop delay threshold for logging a warning.
* Mean event loop delay threshold in ms for logging a warning.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the unit here as it wasn't explicit.

@@ -59,19 +59,19 @@ export const eventLoopDelaysUsageSchema: MakeSchemaFrom<EventLoopDelaysUsageRepo
min: {
type: 'long',
_meta: {
description: 'The minimum recorded event loop delay.',
description: 'The minimum recorded event loop delay in ms.',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding units to telemetry field descriptions helps!

const meanDurationMs = moment
.duration(mean / ONE_MILLISECOND_AS_NANOSECONDS)
.asMilliseconds();
const { mean: meanMS } = eventLoopDelaysMonitor.collect();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data's already in ms by now.

@TinaHeiligers TinaHeiligers marked this pull request as ready for review November 13, 2021 17:27
@TinaHeiligers TinaHeiligers requested review from a team as code owners November 13, 2021 17:27
@TinaHeiligers
Copy link
Contributor Author

@elasticmachine merge upstream

@TinaHeiligers
Copy link
Contributor Author

@elasticmachine merge upstream

@TinaHeiligers TinaHeiligers added the Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc label Nov 15, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@TinaHeiligers TinaHeiligers added v8.0.0 auto-backport Deprecated - use backport:version if exact versions are needed and removed backport-v8.0.0 labels Nov 15, 2021
Copy link
Contributor

@pgayvallet pgayvallet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

/**
* Nanosecond to milisecond conversion unit
*/
export const ONE_MILLISECOND_AS_NANOSECONDS = 1000000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
export const ONE_MILLISECOND_AS_NANOSECONDS = 1000000;
export const ONE_MILLISECOND_AS_NANOSECONDS = 1_000_000;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely easier to parse by eye. Nice suggestion!

* Converts time metric from ns to ms
**/
export function nsToMs(metric: number) {
return moment.duration(metric / ONE_MILLISECOND_AS_NANOSECONDS).asMilliseconds();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the native API whenever possible. I'd remove moment usage as it doesn't bring any value.

},
};

this.loopMonitor.enable();
return collectedData;
return collectedDataMs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not all data in the returned object are in milliseconds.

@TinaHeiligers TinaHeiligers enabled auto-merge (squash) November 15, 2021 18:20
@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@kibanamachine
Copy link
Contributor

💚 Backport successful

Status Branch Result
8.0
7.16

The backport PRs will be merged automatically after passing CI.

kibanamachine added a commit that referenced this pull request Nov 15, 2021
…118612)

Co-authored-by: Kibana Machine <[email protected]>

Co-authored-by: Christiane (Tina) Heiligers <[email protected]>
kibanamachine added a commit that referenced this pull request Nov 15, 2021
…118613)

Co-authored-by: Kibana Machine <[email protected]>

Co-authored-by: Christiane (Tina) Heiligers <[email protected]>
jloleysens added a commit to jloleysens/kibana that referenced this pull request Nov 16, 2021
…igrate-away-from-injected-css-js

* 'main' of github.com:elastic/kibana: (221 commits)
  [Reporting] Add log level to config (elastic#118319)
  [Metrics UI] Skip failing waffle chart color palette test (elastic#118527)
  [APM] chore: Unify naming of 'apm/scripts/**/*' with snake_case (elastic#118328)
  skip flaky suites (elastic#99581, elastic#118356, elastic#118474)
  [Alerting] Initial implementation of alerting task `cancel()` (elastic#114289)
  chore(NA): creates pkg_npm_types bazel rule (elastic#116465)
  skip flaky suite (elastic#116892)
  Bump chromedriver to 95.0.0 (elastic#116724)
  [Data visualizer] Improve design of expanded rows (elastic#118125)
  [APM] prefer ECS field names for HTTP and URL (elastic#118485)
  Update query_debugging_in_development_and_production.md (elastic#118491)
  [Uptime] adjust Elastic Synthetics integration functional tests (elastic#118163)
  [kbn/rule-data-utils] add submodules and require public use them (elastic#117963)
  [DOCS] Refresh APM correlation screenshots (elastic#116723) (elastic#118577)
  Handles ns to ms conversion for event loop delay metrics (elastic#118447)
  [Cases] Rename functional tests folder (elastic#118490)
  dummy commit
  skip flaky suite (elastic#118593)
  Improve workpad schema validation (elastic#97838)
  skip flaky suite (elastic#118584)
  ...

# Conflicts:
#	src/plugins/dashboard/public/application/embeddable/viewport/dashboard_viewport.tsx
mbondyra pushed a commit to mbondyra/kibana that referenced this pull request Nov 19, 2021
dmlemeshko pushed a commit that referenced this pull request Nov 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Deprecated - use backport:version if exact versions are needed release_note:fix Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v7.16.0 v8.0.0 v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7.16 event loop delay value incorrect
6 participants