Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: cockroach zip will not complete if there are decommissioned nodes #43966

Closed
knz opened this issue Jan 14, 2020 · 6 comments · Fixed by #44064
Closed

cli: cockroach zip will not complete if there are decommissioned nodes #43966

knz opened this issue Jan 14, 2020 · 6 comments · Fixed by #44064
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1 High impact: many users impacted, serious risk of high unavailability or data loss

Comments

@knz
Copy link
Contributor

knz commented Jan 14, 2020

Reported by @roncrdb

Currently if some nodes are fully decommissioned (i.e. also down) cockroach zip will still try (and fail) to connect to them and retrieve data.

Failing to do so, it reports noisy error messages like this:

debug/nodes/1/crdb_internal.node_statement_statistics.txt
^- resulted in dial tcp 10.69.129.60:26257: i/o timeout
debug/nodes/1/crdb_internal.node_txn_stats.txt
^- resulted in dial tcp 10.69.129.60:26257: i/o timeout
debug/nodes/1/details.json
^- resulted in rpc error: code = Unknown desc = unable to look up descriptor for n1
debug/nodes/1/gossip.json
^- resulted in rpc error: code = Unknown desc = unable to look up descriptor for n1

This stops the cockroach debug zip from completing a rolling restart may fix this issue, but it would be good to find the root cause of what is happening as well.

@knz knz added A-cli C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. labels Jan 14, 2020
@knz knz changed the title cli: cockroach zip is annoyingly noisy about decommissioned node cli: cockroach zip is annoyingly noisy about decommissioned nodes Jan 14, 2020
@roncrdb roncrdb changed the title cli: cockroach zip is annoyingly noisy about decommissioned nodes cli: cockroach zip will not complete if there are decommissioned nodes Jan 14, 2020
@roncrdb roncrdb added S-1 High impact: many users impacted, serious risk of high unavailability or data loss and removed S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. labels Jan 14, 2020
@piyush-singh
Copy link

Triaging with Ron now, we should fix this ASAP as it blocks support for any cluster with decommed nodes. Will have Ron bring this up at the SIG while I'm OOO.

@roncrdb
Copy link

roncrdb commented Jan 14, 2020

Seems to be trivially reproducible. Started a local cluster with 4 nodes, decommissioned one, tried to debug zip, failed when it got to the 3rd node which was decommissioned. A file is created, but seems to be writing the file completely before it zips it, so the file on my local machine only has encoded data on it.

@knz
Copy link
Contributor Author

knz commented Jan 15, 2020

@roncrdb do you know if this specific to 19.2/master, or does it also repro with 19.1/2.1?

If also in other versions I'll go for a simpler fix which can be more readily backported.

@roncrdb
Copy link

roncrdb commented Jan 15, 2020

@knz I have not tested on 19.1/2.1

@roncrdb
Copy link

roncrdb commented Jan 15, 2020

@knz tested this on a roachprod cluster, decommissioned a node on both 2.1.10 and 19.1.6 both completed the debug zip file without the node that was decommissioned as was expected. So it does not fail but it does complain that it cannot connect to the node that is offline, skips that node, and finishes creating the zip file which is what 19.2 should do but instead fails.

@knz
Copy link
Contributor Author

knz commented Jan 16, 2020

Found the bug, will send PR out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1 High impact: many users impacted, serious risk of high unavailability or data loss
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants