cmd/pprofessor: introduce pprof-based tool #4017

axw · 2020-07-23T03:30:39Z

Motivation/summary

pprofessor is based on the pprof tool, using an alternative
"Fetcher" implementation that queries Elasticsearch to
aggregate continuous profiling samples recorded by APM Server.

I haven't gone through all of the pprof flags to make sure
they work, so some may not make sense. There are a couple of
flags added specifically for our use case:

-service, which defines the service name to filter on.
-start, which defines the start timestamp (or date math)
from which to start aggregating. If specified along with
-seconds, the latter controls the end time. If only -start
is specified, then the end time is now. If only -seconds is
specified, then the end time is now and the start time is duration
seconds before now.

We record the number of profiles and documents aggregated in
profile comments. All metrics (currently: cpu, inuse/allocated
heap objects/space) are aggregated into a single profile.

To aggregate duration correctly, we introduce the profile.id
field, which is a unique ID generated per profile, shared by
all sample docs derived from a profile.

No automated tests for now, and comes with no warranty or support.

Note that there is a known issue with how the Go Agent reports "alloc"
samples: it reports ever-increasing counters, rather than deltas. We
need to change this, so we can visualise the allocations within a define
time range. See elastic/apm-agent-go#708

Checklist

I have signed the Contributor License Agreement.
~~- [ ] I have updated CHANGELOG.asciidoc~~

I have considered changes for:
~~- [ ] documentation~~
~~- [ ] logging (add log lines, choose appropriate log selector, etc.)~~
~~- [ ] metrics and monitoring (create issue for Kibana team to add metrics to visualizations, e.g. Kibana#44001)~~

automated tests (add tests for the code changes, all unit tests pass locally)
~~- [ ] telemetry~~
~~- [ ] Elasticsearch Service (https://cloud.elastic.co)~~
~~- [ ] Elastic Cloud Enterprise (https://www.elastic.co/products/ece)~~
~~- [ ] Elastic Cloud on Kubernetes (https://www.elastic.co/elastic-cloud-kubernetes)~~

How to test these changes

Run APM Server with CPU and Heap profiling enabled in APM Server:

apm-server -E apm-server.instrumentation.enabled=true -E apm-server.instrumentation.profiling.cpu.enabled=true -E apm-server.instrumentation.profiling.cpu.interval=10s -E apm-server.instrumentation.profiling.heap.enabled=true

Run pprofessor:

go run ./cmd/pprofessor -start=now-1h -service=apm-server --http=:6060 http://admin:changeme@localhost:9200

Demo

Related issues

Closes #3828

apmmachine · 2020-07-23T03:34:28Z

💚 Build Succeeded

Expand to view the summary

Build stats

Build Cause: [Pull request #4017 updated]
Start Time: 2020-08-05T03:52:09.483+0000
Duration: 46 min 26 sec

Test stats 🧪

Test	Results
Failed	0
Passed	3223
Skipped	153
Total	3376

Steps errors

Expand to view the steps failures

Name: Compress
- Description: tar --exclude=coverage-files.tgz -czf coverage-files.tgz coverage
- Duration: 0 min 0 sec
- Start Time: 2020-08-05T04:05:23.112+0000
- log
Name: Compress
- Description: tar --exclude=system-tests-linux-files.tgz -czf system-tests-linux-files.tgz system-tests
- Duration: 0 min 0 sec
- Start Time: 2020-08-05T04:29:41.230+0000
- log
Name: Test Sync
- Description: ./script/jenkins/sync.sh
- Duration: 3 min 48 sec
- Start Time: 2020-08-05T04:02:19.840+0000
- log

pprofessor is based on the pprof tool, using an alternative "Fetcher" implementation that queries Elasticsearch to aggregate profile samples recorded by APM Server. I haven't gone through all of the pprof flags to make sure they work, so some may not make sense. There are a couple of flags added specifically for our use case: * -service, which defines the service name to filter on. * -start, which defines the start timestamp from which to start aggregating. If specified along with -duration, the latter controls the end time. If only -start is specified, then the end time is now. If only -duration is specified, then the end time is now and the start time is duration seconds before now. We record the number of profiles and documents aggregated in profile comments. All metrics (currently: cpu, inuse/allocated heap objects/space) are aggregated into a single profile. To aggregate duration correctly, we introduce the `profile.id` field, which is a unique ID generated per profile, shared by all sample docs derived from a profile.

jalvz · 2020-08-04T09:45:37Z

This doesn't seem to work for me with the 7.8 stack:

http://admin:changeme@localhost:9200: [400 Bad Request] {"error":{"root_cause":[{"type":"x_content_parse_exception","reason":"[1:103] [composite] after doesn't support values of type: VALUE_NULL"}],"type":"x_content_parse_exception","reason":"[1:103] [composite] after doesn't support values of type: VALUE_NULL"},"status":400}

A few other notes/questions:

Non obvious requirements for running this should be documented (min stack version, Graphviz, etc)
The command semantics are not so intuitive to me:
Since this is mostly for us, I'd expect service to be hardcoded to apm-server, or at least default to it.
The behaviour of specifying only seconds is conventional? it doesn't look so useful..
I think the Elasticsearch URL should have its own flag, as other required arguments have it too.
While not tested, I'd remove all non-essential code, like tls_ca handling.
The expression "unique profile ID shared by all samples" is not so intuitive, consider profileKey, profileGroupID, or similar.
It would be more coherent to use types for both Elasticsearch responses and queries or none, but not only responses.
I think Elasticsearch types should live on their own package, or at least their own file.

cmd/pprofessor/fetcher.go

axw · 2020-08-04T13:06:39Z

@jalvz thanks for the review!

http://admin:changeme@localhost:9200: [400 Bad Request] {"error":{"root_cause":[{"type":"x_content_parse_exception","reason":"[1:103] [composite] after doesn't support values of type: VALUE_NULL"}],"type":"x_content_parse_exception","reason":"[1:103] [composite] after doesn't support values of type: VALUE_NULL"},"status":400}

That's probably because there's a composite aggregation on profile.id, which is introduced in this PR. You would need to build and run apm-server to use the tool. (Part of the reason for creating this tool is to surface missed data/requirements such as this.)

Non obvious requirements for running this should be documented (min stack version, Graphviz, etc)

I can add a basic README. Until then: my intention was for the tool to be used with the same (branch) stack & server version. Otherwise, no additional dependencies apart from pprof's (which includes Graphviz).

The command semantics are not so intuitive to me:
Since this is mostly for us, I'd expect service to be hardcoded to apm-server, or at least default to it.

I'll set "apm-server" as the default.

The behaviour of specifying only seconds is conventional? it doesn't look so useful..

That's one of the standard pprof flags. If you run go tool pprof -seconds=60 http://service:1234/debug/pprof/profile it'll profile for 60 seconds.

I think the Elasticsearch URL should have its own flag, as other required arguments have it too.

This is part of the pprof tool: the "source" (i.e. Elasticsearch URL in this case) is a positional argument.

While not tested, I'd remove all non-essential code, like tls_ca handling.

Sure, I'll remove it until it's needed.

The expression "unique profile ID shared by all samples" is not so intuitive, consider profileKey, profileGroupID, or similar.

Are you referring to the wording in fields.yml? I'll try to clarify the description if that's what you mean. I don't think we should change the field name. It is the ID of the profile to which the samples belong. Parent/child (1:n) relationship.

It would be more coherent to use types for both Elasticsearch responses and queries or none, but not only responses.

The reason for having struct types for responses is because we need to operate on the results. The requests are just JSON marshalled, so it doesn't really matter if they're structs or maps. I did what's simplest.

I think Elasticsearch types should live on their own package, or at least their own file.

-1 on separate package, see https://dave.cheney.net/practical-go/presentations/qcon-china.html#_consider_fewer_larger_packages

If you think it's really worthwhile I can create a separate file, but I don't think it is. The types are used in exactly one place.

jalvz · 2020-08-04T13:41:20Z

You would need to build and run apm-server to use the tool

Don't follow. Did make and make update, then ran with relevant flags. Anything else needed?

no additional dependencies apart from pprof's (which includes Graphviz).

Hmm, I actually had to apt-install it...

axw · 2020-08-05T00:23:14Z

You would need to build and run apm-server to use the tool

Don't follow. Did make and make update, then ran with relevant flags. Anything else needed?

Looking at the error message again, maybe I introduced a bug in between testing and proposing this PR. I'll look into it and come back.

no additional dependencies apart from pprof's (which includes Graphviz).

Hmm, I actually had to apt-install it...

Sorry, what I meant was "no additional dependencies other than pprof's dependencies, among which is Graphviz" -- not that it's included within the pprof binary.

This handles the case where there's no `profile.id` keyword field, e.g. because the template hasn't been overwritten.

Update to match the wording for "profile.duration".

axw · 2020-08-05T03:52:48Z

Looking at the error message again, maybe I introduced a bug in between testing and proposing this PR. I'll look into it and come back.

I hadn't tested under the scenario where there was an existing index/template without the profile.id field. If you remove your index and overwrite the template then it should work. Either way, I've added a bug fix which will avoid the error noted.

I've added a README and (hopefully) clarified the meaning of profile.id.

jalvz

thanks for the readme and changes!

axw · 2020-08-05T08:34:51Z

Merging without a changelog entry, as this is not intended for end users.

* cmd/pprofessor: introduce pprof-based tool pprofessor is based on the pprof tool, using an alternative "Fetcher" implementation that queries Elasticsearch to aggregate profile samples recorded by APM Server. I haven't gone through all of the pprof flags to make sure they work, so some may not make sense. There are a couple of flags added specifically for our use case: * -service, which defines the service name to filter on. * -start, which defines the start timestamp from which to start aggregating. If specified along with -duration, the latter controls the end time. If only -start is specified, then the end time is now. If only -duration is specified, then the end time is now and the start time is duration seconds before now. We record the number of profiles and documents aggregated in profile comments. All metrics (currently: cpu, inuse/allocated heap objects/space) are aggregated into a single profile. To aggregate duration correctly, we introduce the `profile.id` field, which is a unique ID generated per profile, shared by all sample docs derived from a profile. * cmd/pprofessor: set default service name * cmd/pprofessor: remove -tls_ca handling * cmd/pprofessor: rename Fetcher src param * cmd/pprofessor: don't set nil "after" in composite This handles the case where there's no `profile.id` keyword field, e.g. because the template hasn't been overwritten. * cmd/pprofessor: add minimal README * model/profile: update wording for "profile.id" Update to match the wording for "profile.duration".

axw force-pushed the pprofessor branch from a682e30 to cd7adf0 Compare July 23, 2020 03:31

axw force-pushed the pprofessor branch from cd7adf0 to e656a0b Compare July 23, 2020 03:50

axw force-pushed the pprofessor branch from e656a0b to f3060c7 Compare July 23, 2020 07:19

Merge branch 'master' into pprofessor

3539822

axw marked this pull request as ready for review July 23, 2020 09:27

axw requested a review from a team July 23, 2020 09:27

simitt reviewed Aug 4, 2020

View reviewed changes

cmd/pprofessor/fetcher.go Outdated Show resolved Hide resolved

axw added 3 commits August 4, 2020 21:07

cmd/pprofessor: set default service name

8a2d4b2

cmd/pprofessor: remove -tls_ca handling

424e8ea

cmd/pprofessor: rename Fetcher src param

800e70e

axw added 4 commits August 5, 2020 11:39

cmd/pprofessor: don't set nil "after" in composite

26c3dff

This handles the case where there's no `profile.id` keyword field, e.g. because the template hasn't been overwritten.

cmd/pprofessor: add minimal README

db97141

Merge branch 'master' into pprofessor

aef36fb

model/profile: update wording for "profile.id"

d7d2213

Update to match the wording for "profile.duration".

jalvz approved these changes Aug 5, 2020

View reviewed changes

axw merged commit f394d6f into elastic:master Aug 5, 2020

axw deleted the pprofessor branch August 5, 2020 08:35

axw mentioned this pull request Sep 7, 2020

[7.x] cmd/pprofessor: introduce pprof-based tool (#4017) #4160

Merged

jalvz added the v7.10.0 label Oct 12, 2020

jalvz added the test-plan label Oct 12, 2020

simitt self-assigned this Oct 14, 2020

simitt added the test-plan-ok label Oct 14, 2020

dgieselaar mentioned this pull request Feb 19, 2021

[APM] Profiling elastic/kibana#91818

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/pprofessor: introduce pprof-based tool #4017

cmd/pprofessor: introduce pprof-based tool #4017

axw commented Jul 23, 2020 •

edited

Loading

apmmachine commented Jul 23, 2020 •

edited

Loading

Build stats

Test stats 🧪

jalvz commented Aug 4, 2020

axw commented Aug 4, 2020

jalvz commented Aug 4, 2020

axw commented Aug 5, 2020

axw commented Aug 5, 2020

jalvz left a comment

axw commented Aug 5, 2020

cmd/pprofessor: introduce pprof-based tool #4017

cmd/pprofessor: introduce pprof-based tool #4017

Conversation

axw commented Jul 23, 2020 • edited Loading

Motivation/summary

Checklist

How to test these changes

Demo

Related issues

apmmachine commented Jul 23, 2020 • edited Loading

💚 Build Succeeded

Build stats

Test stats 🧪

Steps errors

jalvz commented Aug 4, 2020

axw commented Aug 4, 2020

jalvz commented Aug 4, 2020

axw commented Aug 5, 2020

axw commented Aug 5, 2020

jalvz left a comment

Choose a reason for hiding this comment

axw commented Aug 5, 2020

axw commented Jul 23, 2020 •

edited

Loading

apmmachine commented Jul 23, 2020 •

edited

Loading