Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/pprofessor: introduce pprof-based tool #4017

Merged
merged 9 commits into from
Aug 5, 2020
Merged

Conversation

axw
Copy link
Member

@axw axw commented Jul 23, 2020

Motivation/summary

pprofessor is based on the pprof tool, using an alternative
"Fetcher" implementation that queries Elasticsearch to
aggregate continuous profiling samples recorded by APM Server.

I haven't gone through all of the pprof flags to make sure
they work, so some may not make sense. There are a couple of
flags added specifically for our use case:

  • -service, which defines the service name to filter on.
  • -start, which defines the start timestamp (or date math)
    from which to start aggregating. If specified along with
    -seconds, the latter controls the end time. If only -start
    is specified, then the end time is now. If only -seconds is
    specified, then the end time is now and the start time is duration
    seconds before now.

We record the number of profiles and documents aggregated in
profile comments. All metrics (currently: cpu, inuse/allocated
heap objects/space) are aggregated into a single profile.

To aggregate duration correctly, we introduce the profile.id
field, which is a unique ID generated per profile, shared by
all sample docs derived from a profile.

No automated tests for now, and comes with no warranty or support.

Note that there is a known issue with how the Go Agent reports "alloc"
samples: it reports ever-increasing counters, rather than deltas. We
need to change this, so we can visualise the allocations within a define
time range. See elastic/apm-agent-go#708

Checklist

I have considered changes for:
- [ ] documentation
- [ ] logging (add log lines, choose appropriate log selector, etc.)
- [ ] metrics and monitoring (create issue for Kibana team to add metrics to visualizations, e.g. Kibana#44001)

How to test these changes

  • Run APM Server with CPU and Heap profiling enabled in APM Server:

apm-server -E apm-server.instrumentation.enabled=true -E apm-server.instrumentation.profiling.cpu.enabled=true -E apm-server.instrumentation.profiling.cpu.interval=10s -E apm-server.instrumentation.profiling.heap.enabled=true

  • Run pprofessor:

go run ./cmd/pprofessor -start=now-1h -service=apm-server --http=:6060 http://admin:changeme@localhost:9200

Demo

pprofessor3

Related issues

Closes #3828

@apmmachine
Copy link
Contributor

apmmachine commented Jul 23, 2020

💚 Build Succeeded

Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: [Pull request #4017 updated]

  • Start Time: 2020-08-05T03:52:09.483+0000

  • Duration: 46 min 26 sec

Test stats 🧪

Test Results
Failed 0
Passed 3223
Skipped 153
Total 3376

Steps errors

Expand to view the steps failures

  • Name: Compress

    • Description: tar --exclude=coverage-files.tgz -czf coverage-files.tgz coverage

    • Duration: 0 min 0 sec

    • Start Time: 2020-08-05T04:05:23.112+0000

    • log

  • Name: Compress

    • Description: tar --exclude=system-tests-linux-files.tgz -czf system-tests-linux-files.tgz system-tests

    • Duration: 0 min 0 sec

    • Start Time: 2020-08-05T04:29:41.230+0000

    • log

  • Name: Test Sync

    • Description: ./script/jenkins/sync.sh

    • Duration: 3 min 48 sec

    • Start Time: 2020-08-05T04:02:19.840+0000

    • log

pprofessor is based on the pprof tool, using an alternative
"Fetcher" implementation that queries Elasticsearch to aggregate
profile samples recorded by APM Server.

I haven't gone through all of the pprof flags to make sure
they work, so some may not make sense. There are a couple of
flags added specifically for our use case:

* -service, which defines the service name to filter on.
* -start, which defines the start timestamp from which to
  start aggregating. If specified along with -duration, the
  latter  controls the end time. If only -start is specified,
  then the end time is now. If only -duration is specified,
  then the end time is now and the start time is duration
  seconds before now.

We record the number of profiles and documents aggregated in
profile comments. All metrics (currently: cpu, inuse/allocated
heap objects/space) are aggregated into a single profile.
To aggregate duration correctly, we introduce the `profile.id`
field, which is a unique ID generated per profile, shared by
all sample docs derived from a profile.
@axw axw marked this pull request as ready for review July 23, 2020 09:27
@axw axw requested a review from a team July 23, 2020 09:27
@jalvz
Copy link
Contributor

jalvz commented Aug 4, 2020

This doesn't seem to work for me with the 7.8 stack:

http://admin:changeme@localhost:9200: [400 Bad Request] {"error":{"root_cause":[{"type":"x_content_parse_exception","reason":"[1:103] [composite] after doesn't support values of type: VALUE_NULL"}],"type":"x_content_parse_exception","reason":"[1:103] [composite] after doesn't support values of type: VALUE_NULL"},"status":400}

A few other notes/questions:

  • Non obvious requirements for running this should be documented (min stack version, Graphviz, etc)
  • The command semantics are not so intuitive to me:
    Since this is mostly for us, I'd expect service to be hardcoded to apm-server, or at least default to it.
    The behaviour of specifying only seconds is conventional? it doesn't look so useful..
    I think the Elasticsearch URL should have its own flag, as other required arguments have it too.
  • While not tested, I'd remove all non-essential code, like tls_ca handling.
  • The expression "unique profile ID shared by all samples" is not so intuitive, consider profileKey, profileGroupID, or similar.
  • It would be more coherent to use types for both Elasticsearch responses and queries or none, but not only responses.
  • I think Elasticsearch types should live on their own package, or at least their own file.

@axw
Copy link
Member Author

axw commented Aug 4, 2020

@jalvz thanks for the review!

http://admin:changeme@localhost:9200: [400 Bad Request] {"error":{"root_cause":[{"type":"x_content_parse_exception","reason":"[1:103] [composite] after doesn't support values of type: VALUE_NULL"}],"type":"x_content_parse_exception","reason":"[1:103] [composite] after doesn't support values of type: VALUE_NULL"},"status":400}

That's probably because there's a composite aggregation on profile.id, which is introduced in this PR. You would need to build and run apm-server to use the tool. (Part of the reason for creating this tool is to surface missed data/requirements such as this.)

Non obvious requirements for running this should be documented (min stack version, Graphviz, etc)

I can add a basic README. Until then: my intention was for the tool to be used with the same (branch) stack & server version. Otherwise, no additional dependencies apart from pprof's (which includes Graphviz).

The command semantics are not so intuitive to me:
Since this is mostly for us, I'd expect service to be hardcoded to apm-server, or at least default to it.

I'll set "apm-server" as the default.

The behaviour of specifying only seconds is conventional? it doesn't look so useful..

That's one of the standard pprof flags. If you run go tool pprof -seconds=60 http://service:1234/debug/pprof/profile it'll profile for 60 seconds.

I think the Elasticsearch URL should have its own flag, as other required arguments have it too.

This is part of the pprof tool: the "source" (i.e. Elasticsearch URL in this case) is a positional argument.

While not tested, I'd remove all non-essential code, like tls_ca handling.

Sure, I'll remove it until it's needed.

The expression "unique profile ID shared by all samples" is not so intuitive, consider profileKey, profileGroupID, or similar.

Are you referring to the wording in fields.yml? I'll try to clarify the description if that's what you mean. I don't think we should change the field name. It is the ID of the profile to which the samples belong. Parent/child (1:n) relationship.

It would be more coherent to use types for both Elasticsearch responses and queries or none, but not only responses.

The reason for having struct types for responses is because we need to operate on the results. The requests are just JSON marshalled, so it doesn't really matter if they're structs or maps. I did what's simplest.

I think Elasticsearch types should live on their own package, or at least their own file.

-1 on separate package, see https://dave.cheney.net/practical-go/presentations/qcon-china.html#_consider_fewer_larger_packages

If you think it's really worthwhile I can create a separate file, but I don't think it is. The types are used in exactly one place.

@jalvz
Copy link
Contributor

jalvz commented Aug 4, 2020

You would need to build and run apm-server to use the tool

Don't follow. Did make and make update, then ran with relevant flags. Anything else needed?

no additional dependencies apart from pprof's (which includes Graphviz).

Hmm, I actually had to apt-install it...

@axw
Copy link
Member Author

axw commented Aug 5, 2020

You would need to build and run apm-server to use the tool

Don't follow. Did make and make update, then ran with relevant flags. Anything else needed?

Looking at the error message again, maybe I introduced a bug in between testing and proposing this PR. I'll look into it and come back.

no additional dependencies apart from pprof's (which includes Graphviz).

Hmm, I actually had to apt-install it...

Sorry, what I meant was "no additional dependencies other than pprof's dependencies, among which is Graphviz" -- not that it's included within the pprof binary.

axw added 4 commits August 5, 2020 11:39
This handles the case where there's no `profile.id`
keyword field, e.g. because the template hasn't been
overwritten.
Update to match the wording for "profile.duration".
@axw
Copy link
Member Author

axw commented Aug 5, 2020

Looking at the error message again, maybe I introduced a bug in between testing and proposing this PR. I'll look into it and come back.

I hadn't tested under the scenario where there was an existing index/template without the profile.id field. If you remove your index and overwrite the template then it should work. Either way, I've added a bug fix which will avoid the error noted.

I've added a README and (hopefully) clarified the meaning of profile.id.

Copy link
Contributor

@jalvz jalvz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the readme and changes!

@axw
Copy link
Member Author

axw commented Aug 5, 2020

Merging without a changelog entry, as this is not intended for end users.

@axw axw merged commit f394d6f into elastic:master Aug 5, 2020
@axw axw deleted the pprofessor branch August 5, 2020 08:35
axw added a commit to axw/apm-server that referenced this pull request Sep 7, 2020
* cmd/pprofessor: introduce pprof-based tool

pprofessor is based on the pprof tool, using an alternative
"Fetcher" implementation that queries Elasticsearch to aggregate
profile samples recorded by APM Server.

I haven't gone through all of the pprof flags to make sure
they work, so some may not make sense. There are a couple of
flags added specifically for our use case:

* -service, which defines the service name to filter on.
* -start, which defines the start timestamp from which to
  start aggregating. If specified along with -duration, the
  latter  controls the end time. If only -start is specified,
  then the end time is now. If only -duration is specified,
  then the end time is now and the start time is duration
  seconds before now.

We record the number of profiles and documents aggregated in
profile comments. All metrics (currently: cpu, inuse/allocated
heap objects/space) are aggregated into a single profile.
To aggregate duration correctly, we introduce the `profile.id`
field, which is a unique ID generated per profile, shared by
all sample docs derived from a profile.

* cmd/pprofessor: set default service name

* cmd/pprofessor: remove -tls_ca handling

* cmd/pprofessor: rename Fetcher src param

* cmd/pprofessor: don't set nil "after" in composite

This handles the case where there's no `profile.id`
keyword field, e.g. because the template hasn't been
overwritten.

* cmd/pprofessor: add minimal README

* model/profile: update wording for "profile.id"

Update to match the wording for "profile.duration".
axw added a commit to axw/apm-server that referenced this pull request Sep 7, 2020
* cmd/pprofessor: introduce pprof-based tool

pprofessor is based on the pprof tool, using an alternative
"Fetcher" implementation that queries Elasticsearch to aggregate
profile samples recorded by APM Server.

I haven't gone through all of the pprof flags to make sure
they work, so some may not make sense. There are a couple of
flags added specifically for our use case:

* -service, which defines the service name to filter on.
* -start, which defines the start timestamp from which to
  start aggregating. If specified along with -duration, the
  latter  controls the end time. If only -start is specified,
  then the end time is now. If only -duration is specified,
  then the end time is now and the start time is duration
  seconds before now.

We record the number of profiles and documents aggregated in
profile comments. All metrics (currently: cpu, inuse/allocated
heap objects/space) are aggregated into a single profile.
To aggregate duration correctly, we introduce the `profile.id`
field, which is a unique ID generated per profile, shared by
all sample docs derived from a profile.

* cmd/pprofessor: set default service name

* cmd/pprofessor: remove -tls_ca handling

* cmd/pprofessor: rename Fetcher src param

* cmd/pprofessor: don't set nil "after" in composite

This handles the case where there's no `profile.id`
keyword field, e.g. because the template hasn't been
overwritten.

* cmd/pprofessor: add minimal README

* model/profile: update wording for "profile.id"

Update to match the wording for "profile.duration".
axw added a commit that referenced this pull request Sep 7, 2020
* cmd/pprofessor: introduce pprof-based tool

pprofessor is based on the pprof tool, using an alternative
"Fetcher" implementation that queries Elasticsearch to aggregate
profile samples recorded by APM Server.

I haven't gone through all of the pprof flags to make sure
they work, so some may not make sense. There are a couple of
flags added specifically for our use case:

* -service, which defines the service name to filter on.
* -start, which defines the start timestamp from which to
  start aggregating. If specified along with -duration, the
  latter  controls the end time. If only -start is specified,
  then the end time is now. If only -duration is specified,
  then the end time is now and the start time is duration
  seconds before now.

We record the number of profiles and documents aggregated in
profile comments. All metrics (currently: cpu, inuse/allocated
heap objects/space) are aggregated into a single profile.
To aggregate duration correctly, we introduce the `profile.id`
field, which is a unique ID generated per profile, shared by
all sample docs derived from a profile.

* cmd/pprofessor: set default service name

* cmd/pprofessor: remove -tls_ca handling

* cmd/pprofessor: rename Fetcher src param

* cmd/pprofessor: don't set nil "after" in composite

This handles the case where there's no `profile.id`
keyword field, e.g. because the template hasn't been
overwritten.

* cmd/pprofessor: add minimal README

* model/profile: update wording for "profile.id"

Update to match the wording for "profile.duration".
@jalvz jalvz added the v7.10.0 label Oct 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pprof-based tool for visualising aggregated profiles
4 participants