-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cli: add command or flag to actively remove node from cluster #6198
Comments
This somewhat decomposes into a few more granular features:
On Thu, Apr 21, 2016 at 11:43 AM Cuong Do [email protected] wrote:
-- Tobias |
Whether we ultimately want a command like that is not determined (see cockroachdb#6197), but the code isn't currently working correctly (cockroachdb#6196) and it's set up in the legacy way, which hinders the upcoming migration of the `quit` command. A re-implementation of the code would likely want to erase the data on the server side (and not from the cli). cc @BramGruneir cc @jseldess for documentation Closes cockroachdb#6197, closes cockroachdb#6198.
Whether we ultimately want a command like that is not determined (see cockroachdb#6197), but the code isn't currently working correctly (cockroachdb#6196) and it's set up in the legacy way, which hinders the upcoming migration of the `quit` command. A re-implementation of the code would likely want to erase the data on the server side (and not from the cli). cc @BramGruneir cc @jseldess for documentation Closes cockroachdb#6197, closes cockroachdb#6198.
So... there is no way of retire a node in a cluster? |
There is (or at least should be), but this issue was closed prematurely. |
That's what I figured =) |
Yep, thanks for pointing this out. More precisely, here are intermediate steps for draining a cluster:
The first two steps are currently too manual, especially the waiting. Once we have them figured out, it might be worth to package the result into a |
Currently |
Why is that an inappropriate use? ZoneConfigs describe what data is where and how often, and when I want all data data of a node to move somewhere else, it seems like exactly the appropriate use - building a second mechanism for a special case of that seems unwise. Where value can be added is in making that process simple and intuitive, but nothing is required there that we don't want anyway. For example, we certainly want a tool to be able to tell you whether a given replication zone has all of its constraints satisfied (i.e. is everything replicated in the right way), the high level version of that giving you the cluster's overall status. This tool would tell you exactly when it's safe to take that node you're trying to get rid of down. I agree that it's not clear how the ZoneConfig would in general support the removal of nodes (it seems easy enough to do in any standard situation, but harder if the zone config is completely trivial or if there are a gazillion zones). What else did you have in mind? |
|
Ok, sure sounds like it. Glad we're on the same page. |
any progress on this? I'd like to cleanly remove a server from my cluster. in the meantime, is it safe to simply turn one of the servers off? would the others eventually give up on it and stop trying to talk to it? |
as a note, I see some work an RFC on this: |
Yes, it's safe to simply turn one of the servers off. The data will be re-replicated onto other nodes. |
I'm moving this off of 1.0. |
Ignore the first commit -- that's cockroachdb#16968. See cockroachdb#6198. This implements the "management portion" of It leans heavily on @neeral's WIP PR cockroachdb#17157. - add `node decommission [--wait=all|live|none] [nodeID1 nodeID2 ...]` - add `cockroach quit --decommission` - add a comprehensive acceptance test that puts all of these to good use. It works surprisingly well but as you'd expect there are kinks. Specificially, in the acceptance test, the invocation `quit --decommission` tends to hang for extended periods of time, sometimes "forever". In the most recent run, this was traced to the fact that the lease holder for a replica remaining on a decommissioning node had *no* ZoneConfig in gossip, which effectively disables its leaseholder replication checks. It is not clear whether this is related to decommissioning in the first place, though the leaseholder node was itself decommissioned, recommissioned and restarted when this occured. The acceptance test also requires at least four nodes to work, and it takes around 10 minutes, so that we may only want to run a reduced version during regular CI, with the long one running nightly. The invocation for the failing acceptance test is: ``` make acceptance TESTS=Decom TESTFLAGS='-v -show-logs -nodes=4' TESTTIMEOUT=20m ``` (if the test runs and fails with the localcluster shim complaining about unexpected events, then that's because I haven't had a chance to tell it about the node we're intentionally `--quit`ting yet) or rather, test what I did to tell it about that. cc @a-robinson Closes cockroachdb#17157.
Ignore the first commit -- that's cockroachdb#16968. See cockroachdb#6198. This implements the "management portion" of It leans heavily on @neeral's WIP PR cockroachdb#17157. - add `node decommission [--wait=all|live|none] [nodeID1 nodeID2 ...]` - add `cockroach quit --decommission` - add a comprehensive acceptance test that puts all of these to good use. It works surprisingly well but as you'd expect there are kinks. Specificially, in the acceptance test, the invocation `quit --decommission` tends to hang for extended periods of time, sometimes "forever". In the most recent run, this was traced to the fact that the lease holder for a replica remaining on a decommissioning node had *no* ZoneConfig in gossip, which effectively disables its leaseholder replication checks. It is not clear whether this is related to decommissioning in the first place, though the leaseholder node was itself decommissioned, recommissioned and restarted when this occured. The acceptance test also requires at least four nodes to work, and it takes around 10 minutes, so that we may only want to run a reduced version during regular CI, with the long one running nightly. The invocation for the failing acceptance test is: ``` make acceptance TESTS=Decom TESTFLAGS='-v -show-logs -nodes=4' TESTTIMEOUT=20m ``` (if the test runs and fails with the localcluster shim complaining about unexpected events, then that's because I haven't had a chance to tell it about the node we're intentionally `--quit`ting yet) or rather, test what I did to tell it about that. cc @a-robinson Closes cockroachdb#17157.
Ignore the first commit -- that's cockroachdb#16968. See cockroachdb#6198. This implements the "management portion" of It leans heavily on @neeral's WIP PR cockroachdb#17157. - add `node decommission [--wait=all|live|none] [nodeID1 nodeID2 ...]` - add `cockroach quit --decommission` - add a comprehensive acceptance test that puts all of these to good use. It works surprisingly well but as you'd expect there are kinks. Specificially, in the acceptance test, the invocation `quit --decommission` tends to hang for extended periods of time, sometimes "forever". In the most recent run, this was traced to the fact that the lease holder for a replica remaining on a decommissioning node had *no* ZoneConfig in gossip, which effectively disables its leaseholder replication checks. It is not clear whether this is related to decommissioning in the first place, though the leaseholder node was itself decommissioned, recommissioned and restarted when this occured. The acceptance test also requires at least four nodes to work, and it takes around 10 minutes, so that we may only want to run a reduced version during regular CI, with the long one running nightly. The invocation for the failing acceptance test is: ``` make acceptance TESTS=Decom TESTFLAGS='-v -show-logs -nodes=4' TESTTIMEOUT=20m ``` (if the test runs and fails with the localcluster shim complaining about unexpected events, then that's because I haven't had a chance to tell it about the node we're intentionally `--quit`ting yet) or rather, test what I did to tell it about that. cc @a-robinson Closes cockroachdb#17157.
Ignore the first commit -- that's cockroachdb#16968. See cockroachdb#6198. This implements the "management portion" of It leans heavily on @neeral's WIP PR cockroachdb#17157. - add `node decommission [--wait=all|live|none] [nodeID1 nodeID2 ...]` - add `cockroach quit --decommission` - add a comprehensive acceptance test that puts all of these to good use. It works surprisingly well but as you'd expect there are kinks. Specificially, in the acceptance test, the invocation `quit --decommission` tends to hang for extended periods of time, sometimes "forever". In the most recent run, this was traced to the fact that the lease holder for a replica remaining on a decommissioning node had *no* ZoneConfig in gossip, which effectively disables its leaseholder replication checks. It is not clear whether this is related to decommissioning in the first place, though the leaseholder node was itself decommissioned, recommissioned and restarted when this occured. The acceptance test also requires at least four nodes to work, and it takes around 10 minutes, so that we may only want to run a reduced version during regular CI, with the long one running nightly. The invocation for the failing acceptance test is: ``` make acceptance TESTS=Decom TESTFLAGS='-v -show-logs -nodes=4' TESTTIMEOUT=20m ``` (if the test runs and fails with the localcluster shim complaining about unexpected events, then that's because I haven't had a chance to tell it about the node we're intentionally `--quit`ting yet) or rather, test what I did to tell it about that. cc @a-robinson Closes cockroachdb#17157.
mark |
Remaining here is the ui work (hiding nodes that are successfully decommissioned in the ui, at least after a while) and adding an event (that shows up in the ui) when a node is decommissioning/decommissioned. See the last commit in #17419 for beginnings of this. We'll also want another pass over the cli commands (#17419) before writing the documentation. cc @jseldess |
Thanks for the update, @tschottdorf. |
@couchand can you help @tschottdorf get up to speed with the UI changes? @tschottdorf we're spreading out UI work, because we don't have nearly enough engineers conversant with the front-end to centralize admin UI development for new features. I think this should be a gentle intro given the minimal admin UI change needed for this |
@cuongdo sgtm, but note that I still have version migrations on my plate and so I won't get to this for perhaps another two weeks. |
@tschottdorf @cuongdo I'm happy to take a few minutes this week to talk through what will be needed here (but it has to be this week, I'm out for a month after Thursday...). Moving forward, let's work to get the implementation of UI pieces of major features started well in advance of feature freeze, to be sure that there's plenty of time for the design to bake. And ideally let's try to include some implementation details in RFCs where possible. |
@benesch has agreed to work on the UI component for this |
Thanks @benesch, appreciate it! You'll want to look at the ui-related commit in https://github.com/cockroachdb/cockroach/pull/17157/commits. |
We need some kind of command or flag to actively remove a node from the cluster when there's no intention to bring that node back up.
At a minimum, this would:
The text was updated successfully, but these errors were encountered: