-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cli: debug zip
should continue past errors
#23954
Comments
Thanks Ben - I think we also want to add heap profiles to the debug zip, so perhaps we do these two issues together in one go. |
This should be relatively straightforward. I might try to get a PR out tomorrow. @dianasaur323 Heap profiles have already been added to |
Thanks for the heads up, woohoo!
On Fri, Mar 16, 2018 at 7:25 PM Peter Mattis ***@***.***> wrote:
This should be relatively straightforward. I might try to get a PR out
tomorrow.
@dianasaur323 <https://github.com/dianasaur323> Heap profiles have
already been added to debug zip.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23954 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFgDWDPBR8iof77VhWH8D4BxtVvtrxNIks5tfEn1gaJpZM4St4Ex>
.
--
Diana Hsieh
[email protected]
407-690-9048
|
Adding timeouts to the various RPCs is easy, but doesn't get us very far. The problem is that |
The problem with |
You may want to dump the `crdb_internal.gossip_*` tables instead. It's not
all the same, but it's always available. They were introduced to address
similar shortcomings in the cli status commands that also rely on `Nodes`.
On Mon, Mar 26, 2018 at 3:27 PM Peter Mattis ***@***.***> wrote:
The problem with statusServer.Nodes will also affect the node status
command. Do we keep an in-process cache of the NodeStatus descriptors?
Note that gossip keeps a cache of the NodeDescriptor which is different.
For debug zip we could get away with inspecting gossip, but node status
actually uses some of the fields from NodeStatus.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#23954 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE135OuZgNU_7gqnJof32LlRuKxNkTFxks5tiUEogaJpZM4St4Ex>
.
--
…-- Tobias
|
@tschottdorf Thanks. Perhaps both |
Now made command `debug zip` continue past errors with a timeout (based on peter's timeout commit). Also dumped information in crdb_internal.gossip_nodes and gossip_liveness to the output file. Fixes cockroachdb#23954. Release note: None
Now made command `debug zip` continue past errors with a timeout (based on peter's timeout commit). Also dumped information in crdb_internal.gossip_nodes and gossip_liveness to the output file. Fixes cockroachdb#23954. Release note: None
24469: cli: `debug zip` with timeout, added dump for crdb_internal.gossip_* r=windchan7 a=windchan7 Now made command `debug zip` continue past errors with a timeout (based on peter's timeout commit). Also dumped information in crdb_internal.gossip_nodes and gossip_liveness to the output file. Fixes #23954. Release note: None
Now made command `debug zip` continue past errors with a timeout (based on peter's timeout commit). Also dumped information in crdb_internal.gossip_nodes and gossip_liveness to the output file. Fixes cockroachdb#23954. Release note: None
25276: cherrypick-2.0: cli: `debug zip` with timeout, added dump for crdb_internal.gossip_* r=bdarnell a=tschottdorf Now made command `debug zip` continue past errors with a timeout (based on peter's timeout commit). Also dumped information in crdb_internal.gossip_nodes and gossip_liveness to the output file. Fixes #23954. cc @cockroachdb/release Release note: None Co-authored-by: Victor Chen <[email protected]>
Some of the information in
debug zip
requires the cluster to be "up" and able to serve consistent reads. But much of it does not (for example, lower-level details like goroutine stacks), and these parts are often the most useful. Thedebug zip
command currently hangs on clusters that are unavailable. Instead, it should try to gather as much as it can and log what it couldn't get. Every step in the data collection should have a timeout, and a timeout shouldn't prevent the rest of the process from being run.The text was updated successfully, but these errors were encountered: