Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

elastic-agent diagnostics pprof #28798

Merged
merged 20 commits into from
Dec 14, 2021

Conversation

michel-laterman
Copy link
Contributor

@michel-laterman michel-laterman commented Nov 3, 2021

What does this PR do?

Add a new Pprof grpc call as well as two ways to gather pprof data.

  1. Collect all pprof data from the agent and the beats it runs with
elastic-agent diagnostics collect --pprof
  1. Collect one pprof profile from one application-(route) with
elastic-agent diagnostics pprof [--pprof-application APP-NAME] [--pprof-route-key ROUTE-KEY] [--pprof-type TYPE]

The duration of the trace or profile data can be controlled with the pprof-duration flag, set to 30s by default. The timeout for these commands can be controlled separately with the timeout flag, by default this value is 30s + pprof-duration. timeout should be larger then pprof-duration to allow the command to process the response data.

Why is it important?

This option allows operators to gather pprof data from all applications into the archive in order to assist debugging.
Adds the elastic-agent diagnostics pprof command to allow operators to gather one pprof type for debugging (specifically to be composed with other tools on the command line).

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation PR here
  • [] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Run elastic-agent collect --pprof and inspect the archive for pprof data.

Related issues

@michel-laterman michel-laterman added enhancement Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Nov 3, 2021
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Nov 3, 2021
@mergify
Copy link
Contributor

mergify bot commented Nov 3, 2021

This pull request does not have a backport label. Could you fix it @michel-laterman? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 7./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@mergify mergify bot added the backport-skip Skip notification from the automated backport with mergify label Nov 3, 2021
@elasticmachine
Copy link
Collaborator

elasticmachine commented Nov 3, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2021-12-14T04:56:30.581+0000

  • Duration: 100 min 0 sec

  • Commit: 4e6ecbb

Test stats 🧪

Test Results
Failed 0
Passed 7136
Skipped 16
Total 7152

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@michel-laterman michel-laterman added the backport-v7.16.0 Automated backport with mergify label Nov 9, 2021
@mergify mergify bot removed the backport-skip Skip notification from the automated backport with mergify label Nov 9, 2021
@michel-laterman
Copy link
Contributor Author

@ruflin here's a draft of adding pprof support.

I think we should consider adding a config option in order to serve the pprof endpoints on the standard port (I'm not sure why it's purely done via a command line flag at the moment, is there historical reasons?)

Also let me know if I should split this PR up as it's gotten pretty large

@ruflin
Copy link
Contributor

ruflin commented Nov 11, 2021

++ on also providing a config option. I always appreciate smaller PR's but leave it up to you if the overhead is worth it.

@ruflin ruflin added the backport-v8.0.0 Automated backport with mergify label Nov 11, 2021
Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, would be good if @michalpristas could take a closer look at the code.

One thing that concerns me but is not only related to this PR is that there are no tests added and no tests changed, meaning we don't have any. I know testing the pprof part itself might be tricky but I would hope at least some parts around creating the correct paths, building endpoints etc. could be tested.

cmd := &cobra.Command{
Use: "pprof",
Short: "Collect pprof information from a running process.",
Long: "Collect pprof information from the elastic-agent or one of its processes and write to stdout or a file.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is my understanding correct that by default this command will take 30s to execute? Maybe we should mention this in the docs here that it is tied to pprof-duration?

innerCtx, cancel := context.WithTimeout(ctx, 30*time.Second)
d := time.Second * 30
if pprof {
d = 2 * pprofDur
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it 2* ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i needed a context timeout that would be larger then the trace duration to give the agent time to process/zip the responses. I can change this to a static addition (i.e. 30s + pprofDur)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Can you add this comment to the code so someone else touching this code will know why this is there? What approach you take (30s or 2x) I leave to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll use the 30s+duration value, but I've added an option to specify the top level command timeout so it will be an option to increase it beyond this limit if that's needed. Hopefully with this we can avoid edge cases where it takes the agent over 30s to create an archive.

fmt.Fprintf(streams.Out, "pprof data written to %s\n", outFile)
return nil
}
streams.Out.Write(res.Result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should a potential error here to be handled?

}
endpoint = "npipe"
}
client := newSocketRequester(time.Second*5, n, rk, endpoint)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michalpristas Seems some quite fundamental changes here. Can you take a look?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see further down, most of the code is equal but split up in multiple methods which is a nice cleanup.

// retrieve elastic-agent pprof data if requested or application is unspecified.
if req.AppName == "" || req.AppName == "elastic-agent" {
endpoint := beats.AgentDebugEndpoint(runtime.GOOS)
c := newSocketRequester(dur*2, "elastic-agent", "", endpoint)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more place where we have *2. Curious to understand this better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the duration passed to newSocketRequester is an artifact of my development (I was not doing requests with contexts at one point); i'll clean this up

Change pprof timeouts from 2*pprofDur to 30s+pprofDur. Remove timeouts
from the socket requester client as cancellations for long running
requests will be handled by the passed ctx.
@michel-laterman
Copy link
Contributor Author

@ruflin, I think the diagnostics commands are a good candidate for adding an e2e test. I'm not sure if we should validate archive structure, i think just ensuring the command exited with a 0 and that an archive was created may be goo enough for now.

@ruflin
Copy link
Contributor

ruflin commented Nov 22, 2021

👍 on adding an e2e for the diagnostic command. Can you follow up on this in the e2e repo after this got in?

Should this be marked as ready for review? There are some CI complaints ..

Fix linting issues with new command. Add a timeout flag for when pprof
info is gathered. Flag will let users specify the command timeout value.
This value whould be greater then the pprof-duration as it needs to
gather and process pprof data.
@michel-laterman michel-laterman changed the title Elastic-agent pprof enablement elastic-agent diagnostics pprof Nov 23, 2021
@michel-laterman michel-laterman marked this pull request as ready for review November 23, 2021 04:13
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ferullo Does endpoint have something similar to pprof which could also be read out in the future?

@@ -150,3 +150,4 @@
- Add diagnostics collect command to gather beat metadata, config, policy, and logs and bundle it into an archive. {pull}28461[28461]
- Add `KIBANA_FLEET_SERVICE_TOKEN` to Elastic Agent container. {pull}28096[28096]
- Enable pprof endpoints for beats processes. Allow pprof endpoints for elastic-agent if enabled. {pull}28983[28983]
- Add `--pprof` flag to `elastic-agent diagnostics` and an `elastic-agent pprof` command to allow operators to gather pprof data from the agent and beats running under it. {pull}28798[28798]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets make sure we follow up with proper docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already in a pr

}
programs, ok := routes.Get(rk)
if !ok {
s.logger.With("route_key", rk).Warn("Unable to retrieve route.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past we tried to use Warn as little as possible as it is not clear if the user needs to take action or not. What are the reasons we can't retrieve the route key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we would be unable to retrieve routes in this case, but I thought logging any issue could help with debugging/edge cases

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

c := newSocketRequester(n, rk, endpoint)

// Launch a concurrent goroutine to gather all pprof endpoints from a socket.
for _, opt := range req.PprofType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have here 3 for loops nested inside each other. Maybe we could extract some into methods which would make reviewing and potentially testing easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll work on it

@ruflin
Copy link
Contributor

ruflin commented Nov 24, 2021

I did try to run this locally by running ./elastic-agent run. Likely because of this I get the following errors for pprof:

Get "http://unix/debug/pprof/profile?seconds=30": dial unix /tmp/elastic-agent/93926c22716b6d271825bfa3d3c9f0b3434619940c4a3ccba25d96ff2322946f.sock: connect: no such file or directory

Is it expected to also work if I have not installed elastic-agent but only run it?

After I run ./elastic-agent diagnostics collect --pprof for 30s nothing happens which of course is expected. But before we start the diagnostics run, could we print out a message to std that now collection starts and will likely take x seconds?

@michel-laterman
Copy link
Contributor Author

I haven't tried testing this when run is used instead of installing, i can get to it.

@mergify
Copy link
Contributor

mergify bot commented Nov 30, 2021

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b elastic-agent-profile upstream/elastic-agent-profile
git merge upstream/master
git push upstream elastic-agent-profile

@michel-laterman
Copy link
Contributor Author

Right now there's a bug with how the agent handles the agent.monitoring.pprof attribute. The output of NewMonitor

func NewMonitor(downloadConfig *artifact.Config, monitoringCfg *monitoringConfig.MonitoringConfig, logMetrics bool) *Monitor {

references the attribute in the file. However, the non-nill config passed to reload
func (b *Monitor) Reload(rawConfig *config.Config) error {
has the (default) false setting. I'm trying to find out why this is happening.

@michel-laterman
Copy link
Contributor Author

I think the issue with the pprof attribute is that on startup in fleet mode, the agent gets a policy update from fleet in

func (h *PolicyChange) Handle(ctx context.Context, a fleetapi.Action, acker store.FleetAcker) error {

The policy sent by fleet does not have a reference to the pprof attribute

Chagne the setting in elastic agent from agent.monioring.pprof to
agent.monitoring.pprof.enabled so that policy updates (such as the one
that occurs when the agent is starting in fleet mode) do not use the
default false value if the user has injected the ssetting into fleet.yml
@@ -39,6 +46,5 @@ func DefaultConfig() *MonitoringConfig {
Port: defaultPort,
},
Namespace: defaultNamespace,
Pprof: false,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ruflin, I changed this from a default false behaviour to default nil so that a value inserted into fleet.yml would be used when the Reload function is called (so the setting is now applied to beats that the agent starts)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, so value only gets overwritten if not set yet?

@narph This likely should also feed into your thoughts on how these files should work in the future.

@michel-laterman michel-laterman removed the backport-v7.16.0 Automated backport with mergify label Dec 14, 2021
@michel-laterman michel-laterman merged commit f5e0ec4 into elastic:master Dec 14, 2021
@michel-laterman michel-laterman deleted the elastic-agent-profile branch December 14, 2021 20:01
mergify bot pushed a commit that referenced this pull request Dec 14, 2021
* Allow -httpprof to bind to sockets/pipes

* Enable pprof debug endpoint on socket for agent and beats

Force the elastic-agent and all beats that it starts to run the
http/pprof listener on a local socket.

* Add new Pprof command to control.proto

* Add pprof option to diagnostics collect

* Fix linting issues

* Add diagonstics pprof command allow pprof to collect from agent

* Revert debug socket changes

* Cleanup timeout handling

Change pprof timeouts from 2*pprofDur to 30s+pprofDur. Remove timeouts
from the socket requester client as cancellations for long running
requests will be handled by the passed ctx.

* Fix linting issue add timeout flag

Fix linting issues with new command. Add a timeout flag for when pprof
info is gathered. Flag will let users specify the command timeout value.
This value whould be greater then the pprof-duration as it needs to
gather and process pprof data.

* Add more command help text.

* Add CHANGELOG

* move spec collection for routes to fn

* add monitoringCfg reference to control server

* elastic-agent server only processes pprof requests when enabled

* Fix error message fix commands only on elastic-agent

* Add pprof fleet.yml, fix nil reference

* Change pprof setting name to monitoring.pprof.enabled

Chagne the setting in elastic agent from agent.monioring.pprof to
agent.monitoring.pprof.enabled so that policy updates (such as the one
that occurs when the agent is starting in fleet mode) do not use the
default false value if the user has injected the ssetting into fleet.yml

(cherry picked from commit f5e0ec4)
michel-laterman added a commit that referenced this pull request Dec 15, 2021
* Allow -httpprof to bind to sockets/pipes

* Enable pprof debug endpoint on socket for agent and beats

Force the elastic-agent and all beats that it starts to run the
http/pprof listener on a local socket.

* Add new Pprof command to control.proto

* Add pprof option to diagnostics collect

* Fix linting issues

* Add diagonstics pprof command allow pprof to collect from agent

* Revert debug socket changes

* Cleanup timeout handling

Change pprof timeouts from 2*pprofDur to 30s+pprofDur. Remove timeouts
from the socket requester client as cancellations for long running
requests will be handled by the passed ctx.

* Fix linting issue add timeout flag

Fix linting issues with new command. Add a timeout flag for when pprof
info is gathered. Flag will let users specify the command timeout value.
This value whould be greater then the pprof-duration as it needs to
gather and process pprof data.

* Add more command help text.

* Add CHANGELOG

* move spec collection for routes to fn

* add monitoringCfg reference to control server

* elastic-agent server only processes pprof requests when enabled

* Fix error message fix commands only on elastic-agent

* Add pprof fleet.yml, fix nil reference

* Change pprof setting name to monitoring.pprof.enabled

Chagne the setting in elastic agent from agent.monioring.pprof to
agent.monitoring.pprof.enabled so that policy updates (such as the one
that occurs when the agent is starting in fleet mode) do not use the
default false value if the user has injected the ssetting into fleet.yml

(cherry picked from commit f5e0ec4)

Co-authored-by: Michel Laterman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.0.0 Automated backport with mergify enhancement Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants