Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental self-profiling #2839

Merged
merged 14 commits into from
Nov 29, 2019
Merged

Experimental self-profiling #2839

merged 14 commits into from
Nov 29, 2019

Conversation

axw
Copy link
Member

@axw axw commented Oct 22, 2019

This PR adds experimental support for continuously profiling the server itself, and recording the data in Elasticsearch.

By default the feature is disabled, and must be explicitly enabled through the config file. There is a new /intake/v2/profile endpoint, which is enabled when CPU or heap profiling is enabled.

We emit an event per profile sample, roughly as described in the proposal document. Some of the field names have changed. Each document includes a hash of the function names in the call stack. We use https://github.com/OneOfOne/xxhash for hashing, which is both faster and of higher quality than the FNV1a algorithm used in the prototype.

TODO:

  • complete and merge the Go agent branch
  • disable the profile endpoint by default
  • define response format for profile endpoint
  • generate a unique ID for each profile, so we can obtain samples for a specific profile
  • add tests

@axw axw force-pushed the server-profiling branch 3 times, most recently from 630c4a3 to dc31cca Compare November 21, 2019 03:38
@axw axw changed the base branch from 7.x to master November 21, 2019 03:39
@axw axw force-pushed the server-profiling branch 2 times, most recently from a77ebf9 to c0dfd59 Compare November 21, 2019 04:10
@axw axw marked this pull request as ready for review November 21, 2019 04:15
@axw axw force-pushed the server-profiling branch 6 times, most recently from 77f5824 to f7b934d Compare November 21, 2019 07:09
@codecov-io
Copy link

codecov-io commented Nov 21, 2019

Codecov Report

Merging #2839 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #2839   +/-   ##
=======================================
  Coverage   78.47%   78.47%           
=======================================
  Files          89       89           
  Lines        4711     4711           
=======================================
  Hits         3697     3697           
  Misses       1014     1014

Copy link
Contributor

@simitt simitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind to create issues for the TODO comments and link to them?

I am missing some integration tests. We usually have some integration test covering the incoming payload being processed and transformed. You can have a look at some integration tests where we capture the event before calling the libbeat publishing.
We also usually add a test to the system_tests, checking that the actual ingestion works as expected (ES template + ingested event fit together, ..)

Do the agent devs have a possibility to get the specs for how the payload has to look like? Usually they pull the json spec and test against it, so they don't have to actually run the APM Server in their integration tests.

beater/config/config.go Outdated Show resolved Hide resolved
model/profile/_meta/fields.yml Outdated Show resolved Hide resolved
model/profile/_meta/fields.yml Show resolved Hide resolved
beater/api/profile/handler.go Outdated Show resolved Hide resolved
beater/api/profile/handler.go Outdated Show resolved Hide resolved
beater/api/profile/handler.go Outdated Show resolved Hide resolved
beater/api/profile/handler.go Outdated Show resolved Hide resolved
beater/api/profile/handler.go Outdated Show resolved Hide resolved
model/profile/_meta/fields.yml Show resolved Hide resolved
@axw
Copy link
Member Author

axw commented Nov 22, 2019

Would you mind to create issues for the TODO comments and link to them?

#2954
#2955

I am missing some integration tests. We usually have some integration test covering the incoming payload being processed and transformed. You can have a look at some integration tests where we capture the event before calling the libbeat publishing.
We also usually add a test to the system_tests, checking that the actual ingestion works as expected (ES template + ingested event fit together, ..)

👍 Looking into this now.

Do the agent devs have a possibility to get the specs for how the payload has to look like? Usually they pull the json spec and test against it, so they don't have to actually run the APM Server in their integration tests.

Yes, it's protobuf encoded: https://github.com/google/pprof/blob/master/proto/profile.proto. If/when we open this feature up to applications as well as the server, then this will be better documented.

@axw
Copy link
Member Author

axw commented Nov 22, 2019

@simitt I've added integration tests. I'll try and have a look at system tests next week.

"profile": {
"cpu.ns": 20000000,
"duration": 5100778105,
"id": "d58847c89c275eba",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the id of a profile be unique, and the timestamp usually differ per profile? Is it possible to extract 2 or 3 unique profiles per example?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A profile is just a collection of samples, and each document corresponds to one sample. I added the profile.id field so that we could group all samples by the profile that they came from.

I've since removed this field for a few reasons:

  • It's not needed right now
  • We can potentially use the @timestamp field instead (all samples from a profile will have the same timestamp value)
  • The field won't be very meaningful if we start performing additional aggregation of samples in the server

@axw axw force-pushed the server-profiling branch from 0fc203d to 8c44037 Compare November 27, 2019 03:32
@axw
Copy link
Member Author

axw commented Nov 27, 2019

Jenkins run the tests please

@axw
Copy link
Member Author

axw commented Nov 27, 2019

We also usually add a test to the system_tests, checking that the actual ingestion works as expected (ES template + ingested event fit together, ..)

I added a couple of system tests that show that with appropriate config, the server will profile its own CPU and heap usage.

While I was doing that I found that there was an issue :) I'd missed adding "profile" as an event type to the idxmgmt package.

@simitt
Copy link
Contributor

simitt commented Nov 27, 2019

Thanks a bunch for bringing this to the server!

@axw axw force-pushed the server-profiling branch from be3eec8 to ab5e8d9 Compare November 27, 2019 12:02
@axw axw force-pushed the server-profiling branch from a64bb70 to 270266e Compare November 29, 2019 08:57
@axw axw force-pushed the server-profiling branch from 270266e to c600819 Compare November 29, 2019 09:54
@axw axw merged commit 2b1f9e9 into elastic:master Nov 29, 2019
@axw axw deleted the server-profiling branch November 29, 2019 11:27
graphaelli pushed a commit to graphaelli/apm-server that referenced this pull request Dec 11, 2019
* vendor: add github.com/google/pprof/profile

* model/profile: add profile data model

* config: config for profiling the server

* beater: add profiling endpoint

* vendor: add github.com/OneOfOne/xxhash

Already being used by Beats.

* model: fix fields

* beater/api/profile: address review comments

* beater/config: split InstrumentationConfig out

* decoder: add LimitedReader

* beater/api/profile: use decoder.LimitedReader

* Add integration test

* idxmgmt: add "profile" event index

* tests/system: add system tests for profiling

* model/profile: remove profile.id

Not needed for now, and will get in the way of storage
optimisations (aggregating multiple profiles) later.
graphaelli added a commit that referenced this pull request Dec 11, 2019
* vendor: add github.com/google/pprof/profile

* model/profile: add profile data model

* config: config for profiling the server

* beater: add profiling endpoint

* vendor: add github.com/OneOfOne/xxhash

Already being used by Beats.

* model: fix fields

* beater/api/profile: address review comments

* beater/config: split InstrumentationConfig out

* decoder: add LimitedReader

* beater/api/profile: use decoder.LimitedReader

* Add integration test

* idxmgmt: add "profile" event index

* tests/system: add system tests for profiling

* model/profile: remove profile.id

Not needed for now, and will get in the way of storage
optimisations (aggregating multiple profiles) later.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants