membership: prevent quorum loss from membership change #6420

sinsharat · 2016-09-13T17:21:15Z

Steps to reproduce:

started a standalone member using the command ./etcd
This started listening to localhost:2379 for client urls and for peer urls at localhost:2380
started anoter etcd member as below:
.\etcd.exe --initial-advertise-peer-urls http://localhost:22380 --listen-peer-ur
ls http://localhost:22380 --advertise-client-urls http://localhost:22379 --listen-client-urls http://localhost:22379
Updated by mistake ./etcdctl member add newMember --peer-urls=https://localhost:22379
Got output as:
Member 6cba1075ac9d26b5 added to cluster cdf818194e3a8c32
Now trying to update the member to correct url using the command:
./etcdctl member update 6cba1075ac9d26b5 --peer-urls=https://localhost:22380
But getting the below error:
Error: context deadline exceeded

Let me know if i have understood the feature wrongly and what i need to do it to make it correct.

Thanks.

gyuho · 2016-09-13T17:39:16Z

@sinsharat Try specifying etcdctl endpoints flag with the new member?

sinsharat · 2016-09-13T17:46:57Z

@gyuho my issue is if by mistake a member adds a new member by specifying an invalid url by mistake, then the entire cluster becomes not usable. If i try to do anything, even trying to remove the newly added member, it will fail since the single cluster which is up is constantly getting into election and not handling any request.
Is it not better to keep a timeout for adding a new member, so that if a member is added by mistake with wrong url and if the member stays unstarted for a specific period it gets removed from cluster automatically?
This will prevent cluster from going into continous election and will continue working normally after the specified period.
Thanks!

gyuho · 2016-09-13T17:55:41Z

Now I understand your issue. Yeah I think it makes sense to have some timeout for membership change for the case you mentioned. Most cases, people still have quorum (e.g. add 1 member to 2-node cluster), so they can revert the membership change. But adding 1 member to 1-node cluster can be problematic.

Defer to @xiang90 @heyitsanthony

heyitsanthony · 2016-09-13T18:46:58Z

@sinsharat Autoremove for 1->2 means the original node will have to make a membership change without quorum. Too easy to get into a split brain mess.

Could the etcdserver probe the peer for the 1->2 case to see if the new member is up before submitting the membership change?

sinsharat · 2016-09-13T18:55:15Z

@heyitsanthony yes i totally agree that would be fine if the member doesn't get added blindly but keeps trying and once its able to contact the member then only the member gets added to cluster. That would prevent the cluster from getting un-usable. Even though i mentioned about two cluster system, but this scenerio which can happen when a three node cluster is intended and to a single member two new cluster is intended to be added.

gyuho · 2016-09-14T10:57:03Z

@heyitsanthony Do you have an easy solution in mind?

Member must be added first to the existing node, in order to pass ValidateClusterAndAssignIDs at https://github.com/coreos/etcd/blob/master/etcdserver/server.go#L279.

So in current implementation, probing a new member before committing the membership change is impossible.

heyitsanthony · 2016-09-14T21:47:36Z

@gyuho no quick fix in mind. That boottrap path would have to be changed a little for my suggestion to work. I believe (not 100% sure) the new node could force add itself to the remote peer list on validate, then start running without corrupting the cluster even if it hasn't been added yet.

xiang90 · 2016-11-10T18:03:50Z

I can see an easy fix here for 1 -> 2 case. The one member cluster can commit any proposal locally and immediately. So what we can do here is to let the one member cluster buffer the conf change and commit it once the 2nd member contact it. One problem is that the member might forget the configuration change request after restart. To fix that, we need to persist the buffer somehow. This makes the problem more complicated. But I think we can do the 1st step to start with.

gyuho · 2016-11-10T18:05:52Z

@xiang90

once the 2nd member contact it

How would we know know this? Raft progress?

xiang90 · 2016-11-10T18:07:39Z

@gyuho

No. The newly member will contact one existing member to get a list of existing members. For the one member case, it can only contact that member.

xiang90 · 2017-03-28T21:04:21Z

do not block our release. moving to 3.3

justinsb · 2018-05-20T20:07:26Z

I'm likely misunderstanding, but don't Raft's cluster membership rules prevent this happening? In the example given originally, we wouldn't have a strict majority of the new (2 member) cluster and thus the cluster membership change must fail.

gyuho · 2018-05-21T22:32:17Z

@justinsb

we wouldn't have a strict majority of the new (2 member) cluster and thus the cluster membership change must fail.

Correct.

Once you add a new member to a single node cluster, the quorum number becomes 2. That is why the member add command to a single-node cluster cannot be reverted. That is, if member add was done wrong but still committed, you can not remove it with member remove command, now that the quorum is 2.

We want to prevent such member add request either by requiring time-outs or something else.

gyuho · 2018-05-21T22:33:34Z

Moving to v3.5. We still have many others things planned for v3.4.

maxenglander · 2019-06-22T19:09:06Z

Would another viable solution for the 1->2 problem be to have server 1 first add server 2 as a non-voting member, replicate log entries to server 2 until server 2 is up to date, and only then complete the 1->2 quorum transition?

This is the approach the Raft paper recommends for addressing the challenge of new servers not having the full log, and thus being unable to accept new entries. I think this same approach would also solve the present issue of a new server being unavailable due to misconfiguration.

jingyih · 2019-06-28T03:57:33Z

@maxenglander I agree. Non-voting member should solve this issue.

@gyuho @xiang90 Can we close this issue? Or do we want to fix this issue without using the non-voting member feature?

stale · 2020-04-07T00:51:05Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

sinsharat changed the title ~~etcdctl v3: Not able perform member update on unstarted member~~ etcdctl v3: Not able perform member update on unstarted member, as a result the started member is constantly going for election and timing out. Sep 13, 2016

gyuho added the area/performance label Sep 14, 2016

gyuho changed the title ~~etcdctl v3: Not able perform member update on unstarted member, as a result the started member is constantly going for election and timing out.~~ membership: prevent quorum loss from membership change Sep 15, 2016

ethernetdan mentioned this issue Oct 3, 2016

Only single member clusters can be used as Seeds coreos/etcd-operator#162

Closed

xiang90 added this to the v3.2.0 milestone Nov 10, 2016

gyuho mentioned this issue Jan 18, 2017

etcdctl member list ==> Failed to get leader: client: etcd cluster is unavailable or misconfigured #7171

Closed

xiang90 modified the milestones: v3.3.0, v3.2.0 Mar 28, 2017

luomiao mentioned this issue Jul 20, 2017

Shared plugin: etcd starting up functions. vmware-archive/vsphere-storage-for-docker#1611

Merged

gyuho modified the milestones: v3.4.0, v3.3.0 Sep 6, 2017

gyuho added area/raft type/feature and removed area/performance labels Feb 25, 2018

gyuho modified the milestones: etcd-v3.4, etcd-v3.5 May 21, 2018

JTrotta mentioned this issue Dec 17, 2018

Question: add member without change the quorum #10329

Closed

stale bot added the stale label Apr 7, 2020

stale bot closed this as completed Apr 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

membership: prevent quorum loss from membership change #6420

membership: prevent quorum loss from membership change #6420

sinsharat commented Sep 13, 2016

gyuho commented Sep 13, 2016 •

edited

Loading

sinsharat commented Sep 13, 2016 •

edited

Loading

gyuho commented Sep 13, 2016

heyitsanthony commented Sep 13, 2016

sinsharat commented Sep 13, 2016

gyuho commented Sep 14, 2016

heyitsanthony commented Sep 14, 2016

xiang90 commented Nov 10, 2016

gyuho commented Nov 10, 2016

xiang90 commented Nov 10, 2016

xiang90 commented Mar 28, 2017

justinsb commented May 20, 2018

gyuho commented May 21, 2018

gyuho commented May 21, 2018

maxenglander commented Jun 22, 2019

jingyih commented Jun 28, 2019

stale bot commented Apr 7, 2020

membership: prevent quorum loss from membership change #6420

membership: prevent quorum loss from membership change #6420

Comments

sinsharat commented Sep 13, 2016

gyuho commented Sep 13, 2016 • edited Loading

sinsharat commented Sep 13, 2016 • edited Loading

gyuho commented Sep 13, 2016

heyitsanthony commented Sep 13, 2016

sinsharat commented Sep 13, 2016

gyuho commented Sep 14, 2016

heyitsanthony commented Sep 14, 2016

xiang90 commented Nov 10, 2016

gyuho commented Nov 10, 2016

xiang90 commented Nov 10, 2016

xiang90 commented Mar 28, 2017

justinsb commented May 20, 2018

gyuho commented May 21, 2018

gyuho commented May 21, 2018

maxenglander commented Jun 22, 2019

jingyih commented Jun 28, 2019

stale bot commented Apr 7, 2020

gyuho commented Sep 13, 2016 •

edited

Loading

sinsharat commented Sep 13, 2016 •

edited

Loading