Remote agent pprof endpoints #6841

drewbailey · 2019-12-11T20:28:28Z

This PR adds server and client rpc endpoints to allow operators to generate pprof reports for any given node or server as long as they have proper acl privileges.

A new HTTP endpoint /v1/agent/pprof/ acts as typical golang pprof endpoints https://golang.org/pkg/net/http/pprof/ but forwards the request to remote nodes.

TODO

Docs

api/agent.go

schmichael

Code looks great! Mostly docs/wording/style comments that are not blockers.

api/agent.go

client/agent_endpoint.go

client/agent_endpoint_test.go

command/agent/agent_endpoint.go

command/agent/agent_endpoint_test.go

website/source/api/agent.html.md

notnoop

This is quite meaty - great thinking through so many cases and conditions. I have many stylistic nitpicks though.

I'd be curious if we have considered using a streaming RPC approach with command/agent/profile effectively invoking pprof.handles - we can have a wrapper RequestHandler that stream results to httpserver directly. The logging endpoints might be a pattern to follow here? Doing so would allow us to keep parity with pprof endpoints handling and avoid loading entire profile in memory. I don't have a sense of how big the profiles would be (i guess memory related once can very large in a busy cluster).

api/agent.go

client/agent_endpoint.go

command/agent/agent_endpoint_test.go

command/agent/agent_endpoint.go

command/agent/profile/pprof.go

command/agent/http.go

api/agent.go

command/agent/profile/pprof.go

command/agent/agent_endpoint.go

command/agent/profile/pprof.go

notnoop

lgtm - thanks.

command/agent/agent_endpoint.go

drewbailey · 2020-01-09T20:02:17Z

@notnoop I've been doing some testing on how large the profiles can get on a busy server.

on a t2.2xl server with 27/31 gb utilized (all pending nomad jobs) I've gotten the following results. Trace is by far the largest and grows with duration of the request. I'm wondering if that's small enough to ease your concerns around streaming or if its something we should still consider doing in the near term.

cc @schmichael

trace of 55 seconds -> 72 Mb profile

→ curl -v -o out.profile $NOMAD_ADDR/v1/agent/pprof/trace\?seconds=55
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 52.0.15.160:4646...
* TCP_NODELAY set
* Connected to nomad-server-lb-232516302.us-east-1.elb.amazonaws.com (52.0.15.160) port 4646 (#0)
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0> GET /v1/agent/pprof/trace?seconds=55 HTTP/1.1
> Host: nomad-server-lb-232516302.us-east-1.elb.amazonaws.com:4646
> User-Agent: curl/7.65.3
> Accept: */*
>
  0     0    0     0    0     0      0      0 --:--:--  0:00:55 --:--:--     0* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Disposition: attachment; filename="trace"
< Content-Type: application/octet-stream
< Date: Thu, 09 Jan 2020 19:36:52 GMT
< Vary: Accept-Encoding
< X-Content-Type-Options: nosniff
< transfer-encoding: chunked
< Connection: keep-alive
<
{ [14225 bytes data]
100 71.7M    0 71.7M    0     0  1263k      0 --:--:--  0:00:58 --:--:-- 18.6M
* Connection #0 to host nomad-server-lb-232516302.us-east-1.elb.amazonaws.com left intact

goroutines -> 8.8k

→ curl -v -o out.profile $NOMAD_ADDR/v1/agent/pprof/goroutine\?seconds=40
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 52.201.140.15:4646...
* TCP_NODELAY set
* Connected to nomad-server-lb-232516302.us-east-1.elb.amazonaws.com (52.201.140.15) port 4646 (#0)
> GET /v1/agent/pprof/goroutine?seconds=40 HTTP/1.1
> Host: nomad-server-lb-232516302.us-east-1.elb.amazonaws.com:4646
> User-Agent: curl/7.65.3
> Accept: */*
>
  0     0    0     0    0     0      0      0 --:--:--  0:00:18 --:--:--     0* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Disposition: attachment; filename="goroutine"
< Content-Type: application/octet-stream
< Date: Thu, 09 Jan 2020 19:54:16 GMT
< Vary: Accept-Encoding
< X-Content-Type-Options: nosniff
< Content-Length: 8406
< Connection: keep-alive
<
{ [6987 bytes data]
100  8406  100  8406    0     0    437      0  0:00:19  0:00:19 --:--:--  2114
* Connection #0 to host nomad-server-lb-232516302.us-east-1.elb.amazonaws.com left intact

Heap 97k

→ curl -v -o out.profile $NOMAD_ADDR/v1/agent/pprof/heap\?seconds=40
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 52.0.15.160:4646...
* TCP_NODELAY set
* Connected to nomad-server-lb-232516302.us-east-1.elb.amazonaws.com (52.0.15.160) port 4646 (#0)
> GET /v1/agent/pprof/heap?seconds=40 HTTP/1.1
> Host: nomad-server-lb-232516302.us-east-1.elb.amazonaws.com:4646
> User-Agent: curl/7.65.3
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Disposition: attachment; filename="heap"
< Content-Type: application/octet-stream
< Date: Thu, 09 Jan 2020 19:55:02 GMT
< Vary: Accept-Encoding
< X-Content-Type-Options: nosniff
< transfer-encoding: chunked
< Connection: keep-alive
<
{ [2642 bytes data]
100 98361    0 98361    0     0   533k      0 --:--:-- --:--:-- --:--:--  533k
* Connection #0 to host nomad-server-lb-232516302.us-east-1.elb.amazonaws.com left intact

profile -> 9k

→ curl -v -o out.profile $NOMAD_ADDR/v1/agent/pprof/profile\?seconds=40
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 52.0.15.160:4646...
* TCP_NODELAY set
* Connected to nomad-server-lb-232516302.us-east-1.elb.amazonaws.com (52.0.15.160) port 4646 (#0)
> GET /v1/agent/pprof/profile?seconds=40 HTTP/1.1
> Host: nomad-server-lb-232516302.us-east-1.elb.amazonaws.com:4646
> User-Agent: curl/7.65.3
> Accept: */*
>
  0     0    0     0    0     0      0      0 --:--:--  0:00:39 --:--:--     0* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Disposition: attachment; filename="profile"
< Content-Type: application/octet-stream
< Date: Thu, 09 Jan 2020 19:56:44 GMT
< Vary: Accept-Encoding
< X-Content-Type-Options: nosniff
< Content-Length: 9028
< Connection: keep-alive
<
{ [9028 bytes data]
100  9028  100  9028    0     0    225      0  0:00:40  0:00:40 --:--:--  1874
* Connection #0 to host nomad-server-lb-232516302.us-east-1.elb.amazonaws.com left intact

wip, agent endpoint and client endpoint for pprof profiles agent endpoint test

Return rpc errors for profile requests, set up remote forwarding to target leader or server id for profile requests. server forwarding, endpoint tests

rename implementation method

m -> a receiver name return codederrors, fix query

tidy up, add comments clean up seconds param assignment

helper func to return serverPart based off of serverID

prevent region forwarding loop, backfill tests fix failing test

Passes in agent enable_debug config to nomad server and client configs. This allows for rpc endpoints to have more granular control if they should be enabled or not in combination with ACLs. enable debug on client test

fix test expectation test wrapNonJSON

Address pr feedback, rename profile package to pprof to more accurately describe its purpose. Adds gc param for heap lookup profiles.

comment why we ignore errors parsing params

schmichael · 2020-01-10T22:56:04Z

@drewbailey Can you create an issue for streaming traces (and make sure it has a link to the discussion here)? As discussed it's not something I think we need to prioritize, but it might make a good starter issue for someone wanting to learn the RPC internals. Or in the future if there's a tool that expects tracing to be a stream, it'd be important to update our implementation.

github-actions · 2023-01-21T02:16:01Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

drewbailey commented Dec 11, 2019

View reviewed changes

api/agent.go Outdated Show resolved Hide resolved

drewbailey force-pushed the f-agent-pprof-acl branch 3 times, most recently from 4d58bb3 to 5a2eec2 Compare December 13, 2019 15:50

drewbailey marked this pull request as ready for review December 13, 2019 20:15

drewbailey requested a review from schmichael December 13, 2019 20:15

drewbailey force-pushed the f-agent-pprof-acl branch from b7924d5 to 9358f64 Compare December 16, 2019 15:06

notnoop self-requested a review December 16, 2019 15:46

schmichael approved these changes Dec 16, 2019

View reviewed changes

notnoop reviewed Dec 19, 2019

View reviewed changes

drewbailey force-pushed the f-agent-pprof-acl branch 3 times, most recently from df82bd4 to 92b5140 Compare December 20, 2019 18:51

drewbailey requested a review from notnoop January 6, 2020 15:29

notnoop approved these changes Jan 9, 2020

View reviewed changes

command/agent/agent_endpoint.go Show resolved Hide resolved

drewbailey force-pushed the f-agent-pprof-acl branch from 92b5140 to 5517743 Compare January 9, 2020 20:14

drewbailey added 13 commits January 9, 2020 15:15

agent pprof endpoints

240c0ee

wip, agent endpoint and client endpoint for pprof profiles agent endpoint test

test for known pprof endpoints

3575f17

Server request forwarding for Agent.Profile

fb1b4cd

Return rpc errors for profile requests, set up remote forwarding to target leader or server id for profile requests. server forwarding, endpoint tests

acl and debug test table

d077cfe

rename implementation method

warn when enabled debug is on when registering

c28e5ad

m -> a receiver name return codederrors, fix query

test pprof headers and profile methods

57dc0c6

tidy up, add comments clean up seconds param assignment

api agent endpoints

b0410a4

helper func to return serverPart based off of serverID

move shared structs out of client and into nomad

390e22e

region forwarding; prevent recursive forwards for impossible requests

3280755

prevent region forwarding loop, backfill tests fix failing test

rename forward func, add comment for why we forward

6e62434

api docs for agent/pprof

588b34c

RPC server EnableDebug option

d77b5ad

Passes in agent enable_debug config to nomad server and client configs. This allows for rpc endpoints to have more granular control if they should be enabled or not in combination with ACLs. enable debug on client test

prevent doubly wrapping with rpc error

11563dc

drewbailey added 8 commits January 9, 2020 15:15

provide helpful error, cleanup logic

db382d3

leave acl checking to rpc endpoints

a3f73b3

fix test expectation test wrapNonJSON

comments for api usage of agent profile

cd7652f

address pr feedback

1776458

Rename profile package to pprof

549045f

Address pr feedback, rename profile package to pprof to more accurately describe its purpose. Adds gc param for heap lookup profiles.

condense table test

2826508

adds qc param, address pr feedback

ad86438

refactor api profile methods

a58b8a5

comment why we ignore errors parsing params

drewbailey force-pushed the f-agent-pprof-acl branch from 5517743 to a58b8a5 Compare January 9, 2020 20:15

drewbailey merged commit ac0fef1 into master Jan 10, 2020

drewbailey deleted the f-agent-pprof-acl branch January 10, 2020 19:52

drewbailey mentioned this pull request Jan 13, 2020

Update agent pprof rpcs to streaming #6933

Open

github-actions bot locked as resolved and limited conversation to collaborators Jan 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remote agent pprof endpoints #6841

Remote agent pprof endpoints #6841

drewbailey commented Dec 11, 2019 •

edited

Loading

schmichael left a comment

notnoop left a comment

notnoop left a comment

drewbailey commented Jan 9, 2020

schmichael commented Jan 10, 2020 •

edited

Loading

github-actions bot commented Jan 21, 2023

Remote agent pprof endpoints #6841

Remote agent pprof endpoints #6841

Conversation

drewbailey commented Dec 11, 2019 • edited Loading

schmichael left a comment

Choose a reason for hiding this comment

notnoop left a comment

Choose a reason for hiding this comment

notnoop left a comment

Choose a reason for hiding this comment

drewbailey commented Jan 9, 2020

schmichael commented Jan 10, 2020 • edited Loading

github-actions bot commented Jan 21, 2023

drewbailey commented Dec 11, 2019 •

edited

Loading

schmichael commented Jan 10, 2020 •

edited

Loading