Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core.metrics] Add support for multiple processes in ops metrics & stats API; deprecate process field #109820

Merged
merged 14 commits into from
Sep 14, 2021

Conversation

Bamieh
Copy link
Member

@Bamieh Bamieh commented Aug 24, 2021

Summary

  • Deprecate process field
  • Decided not to add a name field for each process since we do not support multi-processes yet.
  • Move event loop delay histogram to core
  • Telemetry collectors to use core's histogram
  • Core.metrics use histogram.mean for event_loop_delay.
  • Core.metrics expose histogram to be used for stats and status api.
  • Updated histogram collection to include inistance_uuid since docker containers will always assign pid: 1 to the kibana process. This fixes a bug for multi kibana instances running on docker machines, or in the rare case more than one server assigns the kibana process the same pid.
  • Update unit tests
  • Add integration tests

/api/status

{
  "processes": 
     [
	    {
	      "memory": {
	        "heap": {
	          "total_in_bytes": 387874816,
	          "used_in_bytes": 332095280,
	          "size_limit": 12935233536
	        },
	        "resident_set_size_in_bytes": 540487680
	      },
	      "pid": 29897,
	      "event_loop_delay": 11417136.036613273,
	      "event_loop_delay_histogram": {
	        "min": 9469952,
	        "max": 19021823,
	        "mean": 11417136.036613273,
	        "exceeds": 0,
	        "stddev": 1132985.4515301601,
	        "fromTimestamp": "2021-08-09T12:24:51.190Z",
	        "lastUpdatedAt": "2021-08-09T12:24:56.185Z",
	        "percentiles": {
	          "50": 11247615,
	          "75": 12599295,
	          "95": 12656639,
	          "99": 14434303
	        }
	      },
	      "uptime_in_millis": 101736.272458
	    }
	  ],
  "process": {
      "memory": {
        "heap": {
          "total_in_bytes": 387874816,
          "used_in_bytes": 332095280,
          "size_limit": 12935233536
        },
        "resident_set_size_in_bytes": 540487680
      },
      "pid": 29897,
      "event_loop_delay": 11417136.036613273,
      "event_loop_delay_histogram": {
        "min": 9469952,
        "max": 19021823,
        "mean": 11417136.036613273,
        "exceeds": 0,
        "stddev": 1132985.4515301601,
        "fromTimestamp": "2021-08-09T12:24:51.190Z",
        "lastUpdatedAt": "2021-08-09T12:24:56.185Z",
        "percentiles": {
          "50": 11247615,
          "75": 12599295,
          "95": 12656639,
          "99": 14434303
        }
      },
      "uptime_in_millis": 101736.272458
    }
}

Closes #104031
Rebased changes from #107900 (there was a weird issue happening when running TS type checker on that branch).

@Bamieh Bamieh added v8.0.0 release_note:skip Skip the PR/issue when compiling release notes auto-backport Deprecated - use backport:version if exact versions are needed v7.16.0 labels Aug 24, 2021
@Bamieh Bamieh requested a review from a team as a code owner August 24, 2021 11:51
@Bamieh
Copy link
Member Author

Bamieh commented Aug 26, 2021

@elasticmachine merge upstream

@Bamieh Bamieh requested a review from lukeelmers September 11, 2021 12:12
@Bamieh Bamieh requested a review from a team as a code owner September 11, 2021 14:27
src/core/server/metrics/collectors/types.ts Outdated Show resolved Hide resolved
Comment on lines +46 to +48
public collect(): OpsProcessMetrics[] {
return [this.getCurrentPidMetrics()];
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just thinking out loud, please ignore) I know this array format is better for the downstream indexation, but I'm still not a big fan compared to a Record<workerId, Data>

Comment on lines 48 to 50
"fromTimestamp",
"lastUpdatedAt",
"percentiles",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how far we want to go with unit testing (do we have IT test for this?), but maybe having the mocked monitorEventLoopDelay returning a mocked value to assert that we're properly using/mapping the correct values to our structure would make sense?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Youre right the wiring of this function is not tested. I've added a unit test to ensure that.

src/core/server/metrics/types.ts Outdated Show resolved Hide resolved
max: number;
// The mean of the recorded event loop delays.
mean: number;
// The number of times the event loop delay exceeded the maximum 1 hour event loop delay threshold.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 hour?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is taken from https://nodejs.org/api/perf_hooks.html#perf_hooks_histogram_exceeds

I've opened an issue in the node repo to clarify this metric and my understanding of it: nodejs/node#40096

Comment on lines 34 to +39
// @ts-expect-error
delete metrics.process.pid;
for (const process of metrics.processes) {
// @ts-expect-error
delete process.pid;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not directly related to the PR as this was done already, but I see we're cloneDeep'ing the metrics just to remove a few fields. Can't we avoid this costly call with just spreads and/or omits?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree we can improve this logic. I'd suggest we do it in a different RP though

@Bamieh Bamieh requested a review from pgayvallet September 13, 2021 12:51
@Bamieh
Copy link
Member Author

Bamieh commented Sep 13, 2021

@elasticmachine merge upstream

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
core 997 1006 +9
Unknown metric groups

API count

id before after diff
core 2249 2265 +16

References to deprecated APIs

id before after diff
kibanaUsageCollection 0 1 +1
monitoring 1 2 +1
total +2

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

Copy link
Member

@lukeelmers lukeelmers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a couple small questions, but overall the updates LGTM!

* Side Public License, v 1.
*/

/* eslint-disable dot-notation */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, why disabling dot-notation here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing private methods must be accessed by bracket notation which is not allowed by eslint in the project. example eventLoopDelaysMonitor['loopMonitor'].enable

/** Process related metrics */
/**
* Process related metrics.
* @deprecated use the processes field instead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a @removeBy for the new docs system? (Not sure if it works here)

*/
export interface IntervalHistogram {
// The first timestamp the interval timer kicked in for collecting data points.
fromTimestamp: string;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe worth commenting that these are iso 8601

Copy link
Contributor

@pgayvallet pgayvallet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Bamieh Bamieh merged commit b74f154 into elastic:master Sep 14, 2021
@Bamieh Bamieh deleted the core/process_multi_process branch September 14, 2021 13:50
kibanamachine added a commit to kibanamachine/kibana that referenced this pull request Sep 14, 2021
…ats API; deprecate process field (elastic#109820)

Co-authored-by: Kibana Machine <[email protected]>
@kibanamachine
Copy link
Contributor

💚 Backport successful

Status Branch Result
7.x

This backport PR will be merged automatically after passing CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Deprecated - use backport:version if exact versions are needed release_note:skip Skip the PR/issue when compiling release notes v7.16.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[core.metrics] Add support for multiple processes in ops metrics & stats API; deprecate process field
5 participants