Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs for decommissioning and removing nodes #1876

Merged
merged 2 commits into from
Sep 11, 2017
Merged

Conversation

jseldess
Copy link
Contributor

@jseldess jseldess commented Sep 1, 2017

  • Update docs on temporary stopping a node.
  • Add docs on decommissioning and permanent removal of nodes, as well as recommissioning.
  • Update cockroach node docs.
  • Update command overview and sidenav.

Fixes #1496
Fixes #97

@jseldess jseldess requested a review from tbg September 1, 2017 20:27
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jseldess
Copy link
Contributor Author

jseldess commented Sep 1, 2017

@tschottdorf, @bdarnell, I still have a bit of work to do, but I'd like your early feedback on how I've documented decommissioning and removing nodes. Please take a look at the changes to [stop-a-node.md and the new decommission-a-node.md file.

HTML versions:

@tbg
Copy link
Member

tbg commented Sep 3, 2017

Looks good! I don't fully understand where the docs sit in the greater scheme of things, but it seems that there's a bit of duplication that's likely to rot? Other than that, only two points:

  1. ./cockroach quit --decommission is essentially ./cockroach node decommission <self> && ./cockroach quit. That means you'll use it to decommission and then stop a node, it's not necessary to decommission it first.
  2. Discussing the case in which multiple nodes are decommissioned would be good. It's more efficient to do them all at once than one after another to minimize data movement.
  3. The diagrams are good!

Reviewed 21 of 21 files at r1.
Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


_includes/cli/decommission-a-node.html, line 8 at r1 (raw file):

1. Confirm that there are enough nodes to take over the replicas from the node you want to remove. See [Considerations](decommission-a-node.html#consideration) for some example scenarios.

2. [Install the `cockroach` binary](install-cockroachdb.html) on a machine separate from the node.

This isn't necessary, can do this from one of the machines itself.


v1.1/stop-a-node.md, line 9 at r1 (raw file):

<span class="version-tag">Changed in v1.1:</span> This page shows you how to use the `cockroach quit` [command](cockroach-commands.html) to either temporarily stop a node that you plan to restart or permanently remove a node that has already been [decommissioned](decommission-a-node.html).

Generally, you temporarily stop nodes during the process of [upgrading your cluster's version of CockroachDB](upgrade-cockroach-version.html), whereas you permanently remove nodes when downsizing a cluster.

or reacting to hardware failures.


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Sep 3, 2017

Oh, and perhaps a Considerations section that removes a node that had a hardware failure would be interesting (i.e. use --wait=live since the node is already dead).

@bdarnell
Copy link
Contributor

bdarnell commented Sep 3, 2017

Reviewed 21 of 21 files at r1.
Review status: all files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


_includes/cli/decommission-a-node.html, line 10 at r1 (raw file):

2. [Install the `cockroach` binary](install-cockroachdb.html) on a machine separate from the node.

3. Run the [`cockroach node status`](view-node-details.html) command and identify the ID of the node you want to remove:

If the node is up, it's often easier to ask it for its ID than to scan the node status output in a large cluster: cockroach sql --security-flags --host=<node-to-be-removed> -e 'show node_id'. The node ID is also printed when the node starts up.


v1.1/decommission-a-node.md, line 13 at r1 (raw file):

<div id="toc"></div>

## Considerations

This should also discuss what it means to decommission a node that's already down (i.e. that this is what you'd do to remove permanently dead nodes from the UI).


v1.1/recommission-a-node.md, line 3 at r1 (raw file):

---
title: Recommission Nodes
summary: Learn why and how to temporarily stop a CockroachDB node.

Recommissioning is not about temporarily stopping a node, it's only for undoing a (mistaken) decommission. I'd include it on the decommission page instead of giving it its own page.


v1.1/remove-a-node.md, line 2 at r1 (raw file):

---
title: Remove a Node

"Removing" a node implies permanent (decomissioning) removal to me, whereas "stop" is very strongly associated with a temporary stop. I'd swap this doc with the stop-a-node one, so "stop" describes the temporary quit process and "remove a node" is the high-level guide about the two options.


v1.1/remove-a-node.md, line 7 at r1 (raw file):

---

To stop a CockroachDB node running in the background, run the `cockroach quit` [command](cockroach-commands.html) with appropriate flags. To stop a node running in the foreground, use **CTRL + C** or run `cockroach quit` from another shell.

Sending a signal to the process is also a valid option (for both foreground and background processes). This is the mechanisms that most process managers would use.


v1.1/remove-a-node.md, line 9 at r1 (raw file):

To stop a CockroachDB node running in the background, run the `cockroach quit` [command](cockroach-commands.html) with appropriate flags. To stop a node running in the foreground, use **CTRL + C** or run `cockroach quit` from another shell.

The `quit` command allows in-flight requests to complete and then shuts down the node. Once a node has been offline for approximately 5 minutes, CockroachDB automatically rebalances replicas from the missing node, using unaffected replicas on other nodes as sources.

Not just the quit command - ctrl-c and signals also allow in-flight requests to complete.


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Sep 5, 2017

Review status: all files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


_includes/cli/decommission-a-node.html, line 10 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

If the node is up, it's often easier to ask it for its ID than to scan the node status output in a large cluster: cockroach sql --security-flags --host=<node-to-be-removed> -e 'show node_id'. The node ID is also printed when the node starts up.

... or it's printed in the admin ui, if what you know is the host it's running on.

Note that the node may be dead, in which case they shouldn't try to talk to the node.


Comments from Reviewable

@jseldess
Copy link
Contributor Author

jseldess commented Sep 5, 2017

TFTR, @tschottdorf and @bdarnell. Will rework soon.

@jseldess
Copy link
Contributor Author

jseldess commented Sep 5, 2017

_includes/cli/decommission-a-node.html, line 8 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

This isn't necessary, can do this from one of the machines itself.

Hmm, you can ssh onto the node, it's true, but I think I've been told by @mberhault or @bdarnell that it's best to recommend running client commands from elsewhere?


Comments from Reviewable

@jseldess
Copy link
Contributor Author

jseldess commented Sep 5, 2017

v1.1/decommission-a-node.md, line 13 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This should also discuss what it means to decommission a node that's already down (i.e. that this is what you'd do to remove permanently dead nodes from the UI).

In that case, do you just run the cockroach node decommission command and the UI will catch on?


Comments from Reviewable

@jseldess
Copy link
Contributor Author

jseldess commented Sep 5, 2017

v1.1/recommission-a-node.md, line 3 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Recommissioning is not about temporarily stopping a node, it's only for undoing a (mistaken) decommission. I'd include it on the decommission page instead of giving it its own page.

Sorry. This is just a stub page with incorrect copy. I'll remove it and add this content to the decommission page, as you suggest.


Comments from Reviewable

@jseldess
Copy link
Contributor Author

jseldess commented Sep 5, 2017

v1.1/remove-a-node.md, line 2 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

"Removing" a node implies permanent (decomissioning) removal to me, whereas "stop" is very strongly associated with a temporary stop. I'd swap this doc with the stop-a-node one, so "stop" describes the temporary quit process and "remove a node" is the high-level guide about the two options.

Again, sorry. This is just a stub I left in place accidentally. I think I'll try to have one page, Stop or Remove a Node, cover both cases.


Comments from Reviewable

@bdarnell
Copy link
Contributor

bdarnell commented Sep 5, 2017

Review status: all files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


v1.1/decommission-a-node.md, line 13 at r1 (raw file):

Previously, jseldess wrote…

In that case, do you just run the cockroach node decommission command and the UI will catch on?

Yes. (Just don't use --wait=all, or it won't finish)


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Sep 5, 2017

Review status: all files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


_includes/cli/decommission-a-node.html, line 8 at r1 (raw file):

Previously, jseldess wrote…

Hmm, you can ssh onto the node, it's true, but I think I've been told by @mberhault or @bdarnell that it's best to recommend running client commands from elsewhere?

Serious deployments would likely have a controller host, but generally I don't think it's necessary. @mberhault and @bdarnell are definitely the authority on what we want to recommend though.


Comments from Reviewable

@bdarnell
Copy link
Contributor

bdarnell commented Sep 6, 2017

Review status: all files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


_includes/cli/decommission-a-node.html, line 8 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Serious deployments would likely have a controller host, but generally I don't think it's necessary. @mberhault and @bdarnell are definitely the authority on what we want to recommend though.

In general, I think it's fine for our instructions to demonstrate running the command on a node; we don't need to be didactic about this every time. However, because decommission is a command that is intended to be used some of the times on a downed node, it's probably a good idea to demonstrate this command on a node other than the one to be decommissioned.


Comments from Reviewable

@jseldess jseldess changed the title [WIP] Docs for decommissioning and removing nodes Docs for decommissioning and removing nodes Sep 7, 2017
@jseldess
Copy link
Contributor Author

jseldess commented Sep 7, 2017

@tschottdorf and @bdarnell, please take another look.

  • stop-a-node.md now focuses on temporary stopping.
  • remove-a-node.md now focuses on decommissioning and node removal.
  • I expanded view-node-details.md to cover the decommission and recommission subcommands and flags. In a follow-up PR, I'll add more details about the response fields for those commands.

@jseldess
Copy link
Contributor Author

jseldess commented Sep 7, 2017

Review status: 12 of 28 files reviewed at latest revision, 8 unresolved discussions, some commit checks pending.


v1.1/stop-a-node.md, line 9 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

or reacting to hardware failures.

Done.


_includes/cli/decommission-a-node.html, line 8 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

In general, I think it's fine for our instructions to demonstrate running the command on a node; we don't need to be didactic about this every time. However, because decommission is a command that is intended to be used some of the times on a downed node, it's probably a good idea to demonstrate this command on a node other than the one to be decommissioned.

Done.


_includes/cli/decommission-a-node.html, line 10 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

... or it's printed in the admin ui, if what you know is the host it's running on.

Note that the node may be dead, in which case they shouldn't try to talk to the node.

Using both of these methods now, in different places.


v1.1/decommission-a-node.md, line 13 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Yes. (Just don't use --wait=all, or it won't finish)

Done.


Comments from Reviewable

@bdarnell
Copy link
Contributor

bdarnell commented Sep 7, 2017

:lgtm:


Reviewed 14 of 21 files at r2, 2 of 2 files at r3.
Review status: all files reviewed at latest revision, 8 unresolved discussions, all commit checks successful.


Comments from Reviewable

@jseldess
Copy link
Contributor Author

jseldess commented Sep 8, 2017

Decided to add descriptions for fields in cockroach node subcommand responses.

@jseldess jseldess merged commit 84dd0b5 into master Sep 11, 2017
@jseldess jseldess deleted the decommission-nodes branch September 11, 2017 02:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants