Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: add command or flag to actively remove node from cluster #6198

Closed
cuongdo opened this issue Apr 21, 2016 · 23 comments · Fixed by #6780
Closed

cli: add command or flag to actively remove node from cluster #6198

cuongdo opened this issue Apr 21, 2016 · 23 comments · Fixed by #6780
Assignees
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Milestone

Comments

@cuongdo
Copy link
Contributor

cuongdo commented Apr 21, 2016

We need some kind of command or flag to actively remove a node from the cluster when there's no intention to bring that node back up.

At a minimum, this would:

  • Ensure that the node not show up in the admin UI in the future
  • Migrate replicas off the stores on that node in a clean way
  • Prevent the stream of log entries that occur when you currently kill a node. For example:
I160421 11:32:01.362135 /Users/cdo/go/src/google.golang.org/grpc/clientconn.go:317  Failed to dial cdo-cockroach.local:26259: grpc: timed out trying to connect; please retry.
W160421 11:32:10.651793 ts/db.go:98  error writing time series data: failed to send RPC: too many errors encountered (1 of 1 total): rpc error: code = 4 desc = context deadline exceeded
  • Shut down the CockroachDB instance on the node.
  • Probably log an entry in the event log
@cuongdo cuongdo added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Apr 21, 2016
@tbg
Copy link
Member

tbg commented Apr 21, 2016

This somewhat decomposes into a few more granular features:

  • drain a node (i.e. let all leader leases expire and don't obtain new
    ones, disable all queues, maybe prevent new SQL connections - but keep
    making Raft progress) and wait until the slate is clean
    Used for clean shutdown (and pretty much always part of any type of
    shutdown)
  • drain data off a node (i.e. have all or some subset of ranges migrate
    away, and wait until that has happened)
    Used for decommissioning, but could also be used to change hard-drives
    cleanly (i.e. drain data off a store).
  • drain for update/migration (see the {F,Unf}reeze proposal in [RFC] Proposer-evaluted KV #6166)
    and finally
  • decommission (permanently remove), which is "most of the above".

On Thu, Apr 21, 2016 at 11:43 AM Cuong Do [email protected] wrote:

We need some kind of command or flag to actively remove a node from the
cluster when there's no intention to bring that node back up.

At a minimum, this would:

  • Ensure that the node not show up in the admin UI in the future
  • Migrate replicas off the stores on that node in a clean way
  • Prevent the stream of log entries that occur when you currently kill
    a node. For example:

I160421 11:32:01.362135 /Users/cdo/go/src/google.golang.org/grpc/clientconn.go:317 Failed to dial cdo-cockroach.local:26259: grpc: timed out trying to connect; please retry.
W160421 11:32:10.651793 ts/db.go:98 error writing time series data: failed to send RPC: too many errors encountered (1 of 1 total): rpc error: code = 4 desc = context deadline exceeded

  • Shut down the CockroachDB instance on the node.
  • Probably log an entry in the event log


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#6198

-- Tobias

tbg added a commit to tbg/cockroach that referenced this issue May 16, 2016
Whether we ultimately want a command like that is not determined (see cockroachdb#6197),
but the code isn't currently working correctly (cockroachdb#6196) and it's set up in the
legacy way, which hinders the upcoming migration of the `quit` command.
A re-implementation of the code would likely want to erase the data on the
server side (and not from the cli).

cc @BramGruneir
cc @jseldess for documentation

Closes cockroachdb#6197, closes cockroachdb#6198.
tbg added a commit to tbg/cockroach that referenced this issue May 18, 2016
Whether we ultimately want a command like that is not determined (see cockroachdb#6197),
but the code isn't currently working correctly (cockroachdb#6196) and it's set up in the
legacy way, which hinders the upcoming migration of the `quit` command.
A re-implementation of the code would likely want to erase the data on the
server side (and not from the cli).

cc @BramGruneir
cc @jseldess for documentation

Closes cockroachdb#6197, closes cockroachdb#6198.
@tbg tbg closed this as completed in #6780 May 18, 2016
@kwando
Copy link

kwando commented Oct 10, 2016

So... there is no way of retire a node in a cluster?

@tbg
Copy link
Member

tbg commented Oct 10, 2016

There is (or at least should be), but this issue was closed prematurely.

@tbg tbg reopened this Oct 10, 2016
@kwando
Copy link

kwando commented Oct 10, 2016

That's what I figured =)

@tbg tbg added this to the Q3 milestone Oct 10, 2016
@tbg
Copy link
Member

tbg commented Oct 10, 2016

Yep, thanks for pointing this out.

More precisely, here are intermediate steps for draining a cluster:

  • use ZoneConfigs to initiate the data of the node moving elsewhere
  • wait until that has happened
  • turn off the node.

The first two steps are currently too manual, especially the waiting. Once we have them figured out, it might be worth to package the result into a cli command.

@petermattis
Copy link
Collaborator

Currently ZoneConfigs are the only way to drain a node, but I think that is an inappropriate use. We really want a simpler mechanism to mark a node as draining.

@tbg
Copy link
Member

tbg commented Oct 10, 2016

Why is that an inappropriate use? ZoneConfigs describe what data is where and how often, and when I want all data data of a node to move somewhere else, it seems like exactly the appropriate use - building a second mechanism for a special case of that seems unwise.

Where value can be added is in making that process simple and intuitive, but nothing is required there that we don't want anyway. For example, we certainly want a tool to be able to tell you whether a given replication zone has all of its constraints satisfied (i.e. is everything replicated in the right way), the high level version of that giving you the cluster's overall status. This tool would tell you exactly when it's safe to take that node you're trying to get rid of down.
Also, an explicit remove command can announce to the cluster that the node isn't coming back, which can avoid triggering certain alerts.

I agree that it's not clear how the ZoneConfig would in general support the removal of nodes (it seems easy enough to do in any standard situation, but harder if the zone config is completely trivial or if there are a gazillion zones). What else did you have in mind?

@petermattis
Copy link
Collaborator

ZoneConfigs are settable by the user and there might be dozens or hundreds. Changing all of them in order to drain a node doesn't seem reasonable. Conceptually, what we'd want is something where there is an implicit -draining attribute on every ZoneConfig and marking a node as draining adds that attribute. We'd still go through the standard allocator/rebalancing logic. Re-reading this, perhaps we're saying the same thing.

@tbg
Copy link
Member

tbg commented Oct 10, 2016

Ok, sure sounds like it. Glad we're on the same page.

@kaeverens
Copy link

any progress on this? I'd like to cleanly remove a server from my cluster.

in the meantime, is it safe to simply turn one of the servers off? would the others eventually give up on it and stop trying to talk to it?

@kaeverens
Copy link

kaeverens commented Dec 7, 2016

as a note, I see some work an RFC on this:
https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/drain_modes.md

@bdarnell
Copy link
Contributor

bdarnell commented Dec 8, 2016

Yes, it's safe to simply turn one of the servers off. The data will be re-replicated onto other nodes.

@dianasaur323 dianasaur323 modified the milestones: 1.0, Q3 Feb 9, 2017
@spencerkimball spencerkimball modified the milestones: Later, 1.0 Mar 30, 2017
@spencerkimball
Copy link
Member

I'm moving this off of 1.0.

@tbg tbg assigned tbg and unassigned asubiotto May 31, 2017
@petermattis petermattis modified the milestones: 1.1, Later Jun 1, 2017
tbg added a commit to tbg/cockroach that referenced this issue Aug 1, 2017
Ignore the first commit -- that's cockroachdb#16968.

See cockroachdb#6198. This implements the "management portion" of
It leans heavily on @neeral's WIP PR cockroachdb#17157.

- add `node decommission [--wait=all|live|none] [nodeID1 nodeID2 ...]`
- add `cockroach quit --decommission`
- add a comprehensive acceptance test that puts all of these to good use.

It works surprisingly well but as you'd expect there are kinks. Specificially,
in the acceptance test, the invocation `quit --decommission` tends to hang for
extended periods of time, sometimes "forever". In the most recent run, this was
traced to the fact that the lease holder for a replica remaining on
a decommissioning node had *no* ZoneConfig in gossip, which effectively
disables its leaseholder replication checks. It is not clear whether this is
related to decommissioning in the first place, though the leaseholder node was
itself decommissioned, recommissioned and restarted when this occured.

The acceptance test also requires at least four nodes to work, and it takes
around 10 minutes, so that we may only want to run a reduced version during
regular CI, with the long one running nightly.

The invocation for the failing acceptance test is:

```
make acceptance TESTS=Decom TESTFLAGS='-v -show-logs -nodes=4' TESTTIMEOUT=20m
```

(if the test runs and fails with the localcluster shim complaining about
unexpected events, then that's because I haven't had a chance to tell it
about the node we're intentionally `--quit`ting yet) or rather, test what
I did to tell it about that.

cc @a-robinson

Closes cockroachdb#17157.
tbg added a commit to tbg/cockroach that referenced this issue Aug 3, 2017
Ignore the first commit -- that's cockroachdb#16968.

See cockroachdb#6198. This implements the "management portion" of
It leans heavily on @neeral's WIP PR cockroachdb#17157.

- add `node decommission [--wait=all|live|none] [nodeID1 nodeID2 ...]`
- add `cockroach quit --decommission`
- add a comprehensive acceptance test that puts all of these to good use.

It works surprisingly well but as you'd expect there are kinks. Specificially,
in the acceptance test, the invocation `quit --decommission` tends to hang for
extended periods of time, sometimes "forever". In the most recent run, this was
traced to the fact that the lease holder for a replica remaining on
a decommissioning node had *no* ZoneConfig in gossip, which effectively
disables its leaseholder replication checks. It is not clear whether this is
related to decommissioning in the first place, though the leaseholder node was
itself decommissioned, recommissioned and restarted when this occured.

The acceptance test also requires at least four nodes to work, and it takes
around 10 minutes, so that we may only want to run a reduced version during
regular CI, with the long one running nightly.

The invocation for the failing acceptance test is:

```
make acceptance TESTS=Decom TESTFLAGS='-v -show-logs -nodes=4' TESTTIMEOUT=20m
```

(if the test runs and fails with the localcluster shim complaining about
unexpected events, then that's because I haven't had a chance to tell it
about the node we're intentionally `--quit`ting yet) or rather, test what
I did to tell it about that.

cc @a-robinson

Closes cockroachdb#17157.
tbg added a commit to tbg/cockroach that referenced this issue Aug 4, 2017
Ignore the first commit -- that's cockroachdb#16968.

See cockroachdb#6198. This implements the "management portion" of
It leans heavily on @neeral's WIP PR cockroachdb#17157.

- add `node decommission [--wait=all|live|none] [nodeID1 nodeID2 ...]`
- add `cockroach quit --decommission`
- add a comprehensive acceptance test that puts all of these to good use.

It works surprisingly well but as you'd expect there are kinks. Specificially,
in the acceptance test, the invocation `quit --decommission` tends to hang for
extended periods of time, sometimes "forever". In the most recent run, this was
traced to the fact that the lease holder for a replica remaining on
a decommissioning node had *no* ZoneConfig in gossip, which effectively
disables its leaseholder replication checks. It is not clear whether this is
related to decommissioning in the first place, though the leaseholder node was
itself decommissioned, recommissioned and restarted when this occured.

The acceptance test also requires at least four nodes to work, and it takes
around 10 minutes, so that we may only want to run a reduced version during
regular CI, with the long one running nightly.

The invocation for the failing acceptance test is:

```
make acceptance TESTS=Decom TESTFLAGS='-v -show-logs -nodes=4' TESTTIMEOUT=20m
```

(if the test runs and fails with the localcluster shim complaining about
unexpected events, then that's because I haven't had a chance to tell it
about the node we're intentionally `--quit`ting yet) or rather, test what
I did to tell it about that.

cc @a-robinson

Closes cockroachdb#17157.
tbg added a commit to tbg/cockroach that referenced this issue Aug 4, 2017
Ignore the first commit -- that's cockroachdb#16968.

See cockroachdb#6198. This implements the "management portion" of
It leans heavily on @neeral's WIP PR cockroachdb#17157.

- add `node decommission [--wait=all|live|none] [nodeID1 nodeID2 ...]`
- add `cockroach quit --decommission`
- add a comprehensive acceptance test that puts all of these to good use.

It works surprisingly well but as you'd expect there are kinks. Specificially,
in the acceptance test, the invocation `quit --decommission` tends to hang for
extended periods of time, sometimes "forever". In the most recent run, this was
traced to the fact that the lease holder for a replica remaining on
a decommissioning node had *no* ZoneConfig in gossip, which effectively
disables its leaseholder replication checks. It is not clear whether this is
related to decommissioning in the first place, though the leaseholder node was
itself decommissioned, recommissioned and restarted when this occured.

The acceptance test also requires at least four nodes to work, and it takes
around 10 minutes, so that we may only want to run a reduced version during
regular CI, with the long one running nightly.

The invocation for the failing acceptance test is:

```
make acceptance TESTS=Decom TESTFLAGS='-v -show-logs -nodes=4' TESTTIMEOUT=20m
```

(if the test runs and fails with the localcluster shim complaining about
unexpected events, then that's because I haven't had a chance to tell it
about the node we're intentionally `--quit`ting yet) or rather, test what
I did to tell it about that.

cc @a-robinson

Closes cockroachdb#17157.
@EamonZhang
Copy link
Contributor

mark

@tbg
Copy link
Member

tbg commented Aug 7, 2017

Remaining here is the ui work (hiding nodes that are successfully decommissioned in the ui, at least after a while) and adding an event (that shows up in the ui) when a node is decommissioning/decommissioned.

See the last commit in #17419 for beginnings of this.

We'll also want another pass over the cli commands (#17419) before writing the documentation.

cc @jseldess

@jseldess
Copy link
Contributor

jseldess commented Aug 7, 2017

Thanks for the update, @tschottdorf.

@cuongdo
Copy link
Contributor Author

cuongdo commented Aug 7, 2017

@couchand can you help @tschottdorf get up to speed with the UI changes?

@tschottdorf we're spreading out UI work, because we don't have nearly enough engineers conversant with the front-end to centralize admin UI development for new features. I think this should be a gentle intro given the minimal admin UI change needed for this

@tbg
Copy link
Member

tbg commented Aug 7, 2017

@cuongdo sgtm, but note that I still have version migrations on my plate and so I won't get to this for perhaps another two weeks.

@couchand
Copy link
Contributor

couchand commented Aug 7, 2017

@tschottdorf @cuongdo I'm happy to take a few minutes this week to talk through what will be needed here (but it has to be this week, I'm out for a month after Thursday...).

Moving forward, let's work to get the implementation of UI pieces of major features started well in advance of feature freeze, to be sure that there's plenty of time for the design to bake. And ideally let's try to include some implementation details in RFCs where possible.

@cuongdo
Copy link
Contributor Author

cuongdo commented Aug 8, 2017

@benesch has agreed to work on the UI component for this

@tbg
Copy link
Member

tbg commented Aug 8, 2017

Thanks @benesch, appreciate it! You'll want to look at the ui-related commit in https://github.com/cockroachdb/cockroach/pull/17157/commits.

@tbg
Copy link
Member

tbg commented Aug 31, 2017

I'll call this done, since the only remaining item is tracked in #17677, though the test is flaky (#17995).

@tbg tbg closed this as completed Aug 31, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

Successfully merging a pull request may close this issue.