From 851b0bb043f7d80f40d37656137d031eca015de5 Mon Sep 17 00:00:00 2001 From: Gyu-Ho Lee Date: Fri, 16 Dec 2016 11:44:45 -0800 Subject: [PATCH] Documentation: add FAQs on membership operation Copy Anthony's answer from: https://github.com/coreos/etcd/issues/6103 https://github.com/coreos/etcd/issues/6114 --- Documentation/faq.md | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/Documentation/faq.md b/Documentation/faq.md index bfa47e2567ff..6fa5b58d8731 100644 --- a/Documentation/faq.md +++ b/Documentation/faq.md @@ -62,6 +62,22 @@ With longer latencies, the default etcd configuration may cause frequent electio etcdctl provides a `snapshot` command to create backups. See [backup][backup] for more details. +#### Always remove first when replacing member? + +When replacing an etcd node, we recommend to remove the member first and then add its replacement. + +etcd employs distributed consensus based on a quorum model; (n+1)/2 members, a majority, must agree on a proposal before it can be committed to the cluster. These proposals include key-value updates and membership changes. This model totally avoids any possibility of split brain inconsistency. The downside is permanent quorum loss is catastrophic. + +How this applies to membership: If a 3-member cluster has 1 downed member, it can still make forward progress because the quorum is 2 and 2 members are still live. However, adding a new member to a 3-member cluster will increase the quorum to 3 because 3 votes are required for a majority of 4 members. Since the quorum increased, this extra member buys nothing in terms of fault tolerance; the cluster is still one node failure away from being unrecoverable. + +Additionally, that new member is risky because it may turn out to be misconfigured or incapable of joining the cluster. In that case, there's no way to recover quorum because the cluster has two members down and two members up, but needs three votes to change membership to undo the botched membership addition. etcd will by default reject member add attempts that could take down the cluster in this manner. + +On the other hand, if the downed member is removed from cluster membership first, the number of members becomes 2 and the quorum remains at 2. Following that removal by adding a new member will also keep the quorum steady at 2. So, even if the new node can't be brought up, it's still possible to remove the new member through quorum on the remaining live members. + +#### Why so strict about membership change? + +etcd sets `strict-reconfig-check` in order to reject reconfiguration requests that would cause quorum loss. Abandoning quorum is really risky (especially when the cluster is already in a bad way). We're aware that losing quorum is painful, but disabling quorum on membership could lead to full fledged cluster inconsistency and that would be even worse in many applications ("disk geometry corruption" being a candidate for most terrifying). + ### Performance #### How should I benchmark etcd?