-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new command: nomad debug captures a debug archive of cluster state #8244
Conversation
39dea54
to
5c013ee
Compare
35464a7
to
6013f08
Compare
command/debug_test.go
Outdated
require.FileExists(t, filepath.Join(path, "version", "agent-self.json")) | ||
|
||
// Consul and Vault are only captured if they exist | ||
_, err := osStat(filepath.Join(path, "version", "consul-agent-self.json")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a testing gap, especially because there are couple of variations of getting tokens in and http vs. https. I think the most straightforward way to test this is in the e2e cluster where we have a running consul and vault.
|
||
// collectPprof captures pprof data for the node | ||
func (c *DebugCommand) collectPprof(path, id string, client *api.Client) { | ||
opts := api.PprofOptions{Seconds: 1} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is at least a little silly, it would be more useful to both configure this value and maybe do profiles in intervals like we do capturing state. There is a concern that only one profile can be run on an agent at a time, which causes some problems. I think this will still capture useful data, and I'm planning on improving this in a followup.
} | ||
|
||
// startMonitors starts go routines for each node and client | ||
func (c *DebugCommand) startMonitors(client *api.Client) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just calling out that I'd be interested to monitor (sorry) how this might effect the nodes performance wise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about the same thing with pprof.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is all looking really good so far. Would it make sense to handle some signals /interrupts to also close stopCh to gracefully stop the apis?
This patch history is a mess, this pr will be squash merged so the false starts won't be inflicted on the future. |
Yeah, that seems like a good improvement, especially thinking about the performance concerns in the api calls. |
@drewbailey signal trapping implemented in 9746759 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So excited for this!
command/debug.go
Outdated
|
||
go func() { | ||
<-sigCh | ||
close(c.stopCh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't close this from 2 places safely. Use a context and pass around its cancel func for an easy way to get a channel that multiple goroutines can cancel without synchronization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this, I've got the context and the cancel function stored directly in my parent struct (i.e. not as a pointer). This seems ok from docs, but like there's some potential for inadvertently copying the context struct. It does behave as expected.
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This implements the
nomad debug
command, which creates a local tar archive of Nomad configuration, state and logs, in a format designed to help support address customer issues.debug
commandPart 1 of #8273