Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: debug zip should continue past errors #23954

Closed
bdarnell opened this issue Mar 16, 2018 · 7 comments
Closed

cli: debug zip should continue past errors #23954

bdarnell opened this issue Mar 16, 2018 · 7 comments
Assignees
Milestone

Comments

@bdarnell
Copy link
Contributor

Some of the information in debug zip requires the cluster to be "up" and able to serve consistent reads. But much of it does not (for example, lower-level details like goroutine stacks), and these parts are often the most useful. The debug zip command currently hangs on clusters that are unavailable. Instead, it should try to gather as much as it can and log what it couldn't get. Every step in the data collection should have a timeout, and a timeout shouldn't prevent the rest of the process from being run.

@dianasaur323
Copy link
Contributor

Thanks Ben - I think we also want to add heap profiles to the debug zip, so perhaps we do these two issues together in one go.

@petermattis petermattis self-assigned this Mar 16, 2018
@petermattis petermattis added this to the 2.1 milestone Mar 16, 2018
@petermattis
Copy link
Collaborator

This should be relatively straightforward. I might try to get a PR out tomorrow.

@dianasaur323 Heap profiles have already been added to debug zip.

@dianasaur323
Copy link
Contributor

dianasaur323 commented Mar 17, 2018 via email

@petermattis
Copy link
Collaborator

Adding timeouts to the various RPCs is easy, but doesn't get us very far. The problem is that statusServer.Nodes does a scan of the node status keys which will block if the range containing those keys is unavailable. I think I need to do a larger restructuring so that we determine the node IDs via the Gossip endpoint.

@petermattis
Copy link
Collaborator

The problem with statusServer.Nodes will also affect the node status command. Do we keep an in-process cache of the NodeStatus descriptors? Note that gossip keeps a cache of the NodeDescriptor which is different. For debug zip we could get away with inspecting gossip, but node status actually uses some of the fields from NodeStatus.

@tbg
Copy link
Member

tbg commented Mar 26, 2018 via email

@petermattis
Copy link
Collaborator

@tschottdorf Thanks. Perhaps both debug zip and node status should use crdb_internal.gossip_nodes to get the list of node IDs (which internally uses gossip) and then use statusServer.Node to retrieve the node status. Seems workable.

windchan7 added a commit to windchan7/cockroach that referenced this issue Apr 4, 2018
Now made command `debug zip` continue past errors with a timeout
(based on peter's timeout commit).

Also dumped information in crdb_internal.gossip_nodes and gossip_liveness
to the output file.

Fixes cockroachdb#23954.

Release note: None
@windchan7 windchan7 self-assigned this Apr 4, 2018
windchan7 added a commit to windchan7/cockroach that referenced this issue Apr 4, 2018
Now made command `debug zip` continue past errors with a timeout
(based on peter's timeout commit).

Also dumped information in crdb_internal.gossip_nodes and gossip_liveness
to the output file.

Fixes cockroachdb#23954.

Release note: None
craig bot added a commit that referenced this issue Apr 4, 2018
24469: cli: `debug zip` with timeout, added dump for crdb_internal.gossip_* r=windchan7 a=windchan7

Now made command `debug zip` continue past errors with a timeout
(based on peter's timeout commit).

Also dumped information in crdb_internal.gossip_nodes and gossip_liveness
to the output file.

Fixes #23954.

Release note: None
@craig craig bot closed this as completed in #24469 Apr 4, 2018
tbg pushed a commit to tbg/cockroach that referenced this issue May 3, 2018
Now made command `debug zip` continue past errors with a timeout
(based on peter's timeout commit).

Also dumped information in crdb_internal.gossip_nodes and gossip_liveness
to the output file.

Fixes cockroachdb#23954.

Release note: None
craig bot pushed a commit that referenced this issue May 7, 2018
25276: cherrypick-2.0: cli: `debug zip` with timeout, added dump for crdb_internal.gossip_* r=bdarnell a=tschottdorf

Now made command `debug zip` continue past errors with a timeout
(based on peter's timeout commit).

Also dumped information in crdb_internal.gossip_nodes and gossip_liveness
to the output file.

Fixes #23954.
cc @cockroachdb/release 
Release note: None

Co-authored-by: Victor Chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants