cli: `debug zip` should continue past errors #23954

bdarnell · 2018-03-16T14:26:46Z

Some of the information in debug zip requires the cluster to be "up" and able to serve consistent reads. But much of it does not (for example, lower-level details like goroutine stacks), and these parts are often the most useful. The debug zip command currently hangs on clusters that are unavailable. Instead, it should try to gather as much as it can and log what it couldn't get. Every step in the data collection should have a timeout, and a timeout shouldn't prevent the rest of the process from being run.

The text was updated successfully, but these errors were encountered:

dianasaur323 · 2018-03-16T21:16:35Z

Thanks Ben - I think we also want to add heap profiles to the debug zip, so perhaps we do these two issues together in one go.

petermattis · 2018-03-16T23:25:33Z

This should be relatively straightforward. I might try to get a PR out tomorrow.

@dianasaur323 Heap profiles have already been added to debug zip.

dianasaur323 · 2018-03-17T16:55:01Z

Thanks for the heads up, woohoo!

On Fri, Mar 16, 2018 at 7:25 PM Peter Mattis ***@***.***> wrote: This should be relatively straightforward. I might try to get a PR out tomorrow. @dianasaur323 <https://github.com/dianasaur323> Heap profiles have already been added to debug zip. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#23954 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFgDWDPBR8iof77VhWH8D4BxtVvtrxNIks5tfEn1gaJpZM4St4Ex> .

-- Diana Hsieh [email protected] 407-690-9048

petermattis · 2018-03-26T13:55:32Z

Adding timeouts to the various RPCs is easy, but doesn't get us very far. The problem is that statusServer.Nodes does a scan of the node status keys which will block if the range containing those keys is unavailable. I think I need to do a larger restructuring so that we determine the node IDs via the Gossip endpoint.

petermattis · 2018-03-26T19:27:23Z

The problem with statusServer.Nodes will also affect the node status command. Do we keep an in-process cache of the NodeStatus descriptors? Note that gossip keeps a cache of the NodeDescriptor which is different. For debug zip we could get away with inspecting gossip, but node status actually uses some of the fields from NodeStatus.

tbg · 2018-03-26T19:42:00Z

You may want to dump the `crdb_internal.gossip_*` tables instead. It's not all the same, but it's always available. They were introduced to address similar shortcomings in the cli status commands that also rely on `Nodes`.

On Mon, Mar 26, 2018 at 3:27 PM Peter Mattis ***@***.***> wrote: The problem with statusServer.Nodes will also affect the node status command. Do we keep an in-process cache of the NodeStatus descriptors? Note that gossip keeps a cache of the NodeDescriptor which is different. For debug zip we could get away with inspecting gossip, but node status actually uses some of the fields from NodeStatus. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#23954 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE135OuZgNU_7gqnJof32LlRuKxNkTFxks5tiUEogaJpZM4St4Ex> .

--

…

-- Tobias

petermattis · 2018-03-26T20:21:35Z

@tschottdorf Thanks. Perhaps both debug zip and node status should use crdb_internal.gossip_nodes to get the list of node IDs (which internally uses gossip) and then use statusServer.Node to retrieve the node status. Seems workable.

Now made command `debug zip` continue past errors with a timeout (based on peter's timeout commit). Also dumped information in crdb_internal.gossip_nodes and gossip_liveness to the output file. Fixes cockroachdb#23954. Release note: None

24469: cli: `debug zip` with timeout, added dump for crdb_internal.gossip_* r=windchan7 a=windchan7 Now made command `debug zip` continue past errors with a timeout (based on peter's timeout commit). Also dumped information in crdb_internal.gossip_nodes and gossip_liveness to the output file. Fixes #23954. Release note: None

Now made command `debug zip` continue past errors with a timeout (based on peter's timeout commit). Also dumped information in crdb_internal.gossip_nodes and gossip_liveness to the output file. Fixes cockroachdb#23954. Release note: None

25276: cherrypick-2.0: cli: `debug zip` with timeout, added dump for crdb_internal.gossip_* r=bdarnell a=tschottdorf Now made command `debug zip` continue past errors with a timeout (based on peter's timeout commit). Also dumped information in crdb_internal.gossip_nodes and gossip_liveness to the output file. Fixes #23954. cc @cockroachdb/release Release note: None Co-authored-by: Victor Chen <[email protected]>

petermattis self-assigned this Mar 16, 2018

petermattis added this to the 2.1 milestone Mar 16, 2018

windchan7 mentioned this issue Apr 4, 2018

cli: debug zip with timeout, added dump for crdb_internal.gossip_* #24469

Merged

windchan7 self-assigned this Apr 4, 2018

craig bot closed this as completed in #24469 Apr 4, 2018

tbg mentioned this issue May 3, 2018

cherrypick-2.0: cli: debug zip with timeout, added dump for crdb_internal.gossip_* #25276

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cli: `debug zip` should continue past errors #23954

cli: `debug zip` should continue past errors #23954

bdarnell commented Mar 16, 2018

dianasaur323 commented Mar 16, 2018

petermattis commented Mar 16, 2018

dianasaur323 commented Mar 17, 2018 via email

petermattis commented Mar 26, 2018

petermattis commented Mar 26, 2018

tbg commented Mar 26, 2018 via email

petermattis commented Mar 26, 2018

cli: debug zip should continue past errors #23954

cli: debug zip should continue past errors #23954

Comments

bdarnell commented Mar 16, 2018

dianasaur323 commented Mar 16, 2018

petermattis commented Mar 16, 2018

dianasaur323 commented Mar 17, 2018 via email

petermattis commented Mar 26, 2018

petermattis commented Mar 26, 2018

tbg commented Mar 26, 2018 via email

petermattis commented Mar 26, 2018

cli: `debug zip` should continue past errors #23954

cli: `debug zip` should continue past errors #23954