Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new command: nomad debug captures a debug archive of cluster state #8244

Merged
merged 20 commits into from
Jun 25, 2020

Conversation

langmartin
Copy link
Contributor

@langmartin langmartin commented Jun 22, 2020

This implements the nomad debug command, which creates a local tar archive of Nomad configuration, state and logs, in a format designed to help support address customer issues.

  • debug command
    • make a tarball
    • command help text
    • use the monitor interface to capture DEBUG logs from nodes & servers
    • query consul and vault state directly
    • manifest

Part 1 of #8273

@langmartin langmartin added this to the 0.12.0-beta2 milestone Jun 24, 2020
@langmartin langmartin marked this pull request as ready for review June 24, 2020 20:09
@langmartin langmartin requested a review from schmichael June 24, 2020 20:09
require.FileExists(t, filepath.Join(path, "version", "agent-self.json"))

// Consul and Vault are only captured if they exist
_, err := osStat(filepath.Join(path, "version", "consul-agent-self.json"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a testing gap, especially because there are couple of variations of getting tokens in and http vs. https. I think the most straightforward way to test this is in the e2e cluster where we have a running consul and vault.

@langmartin langmartin requested a review from drewbailey June 24, 2020 20:13

// collectPprof captures pprof data for the node
func (c *DebugCommand) collectPprof(path, id string, client *api.Client) {
opts := api.PprofOptions{Seconds: 1}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is at least a little silly, it would be more useful to both configure this value and maybe do profiles in intervals like we do capturing state. There is a concern that only one profile can be run on an agent at a time, which causes some problems. I think this will still capture useful data, and I'm planning on improving this in a followup.

}

// startMonitors starts go routines for each node and client
func (c *DebugCommand) startMonitors(client *api.Client) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just calling out that I'd be interested to monitor (sorry) how this might effect the nodes performance wise

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about the same thing with pprof.

Copy link
Contributor

@drewbailey drewbailey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all looking really good so far. Would it make sense to handle some signals /interrupts to also close stopCh to gracefully stop the apis?

@langmartin
Copy link
Contributor Author

This patch history is a mess, this pr will be squash merged so the false starts won't be inflicted on the future.

@langmartin
Copy link
Contributor Author

This is all looking really good so far. Would it make sense to handle some signals /interrupts to also close stopCh to gracefully stop the apis?

Yeah, that seems like a good improvement, especially thinking about the performance concerns in the api calls.

@langmartin
Copy link
Contributor Author

@drewbailey signal trapping implemented in 9746759

@langmartin langmartin mentioned this pull request Jun 24, 2020
Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So excited for this!

command/debug.go Outdated Show resolved Hide resolved
command/debug.go Outdated Show resolved Hide resolved
command/debug.go Outdated Show resolved Hide resolved
command/debug.go Outdated

go func() {
<-sigCh
close(c.stopCh)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't close this from 2 places safely. Use a context and pass around its cancel func for an easy way to get a channel that multiple goroutines can cancel without synchronization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this, I've got the context and the cancel function stored directly in my parent struct (i.e. not as a pointer). This seems ok from docs, but like there's some potential for inadvertently copying the context struct. It does behave as expected.

@langmartin langmartin merged commit 84072d3 into master Jun 25, 2020
@langmartin langmartin deleted the f-nomad-debug branch June 25, 2020 16:51
@github-actions
Copy link

github-actions bot commented Jan 1, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 1, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants