Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

membership: prevent quorum loss from membership change #6420

Closed
sinsharat opened this issue Sep 13, 2016 · 17 comments
Closed

membership: prevent quorum loss from membership change #6420

sinsharat opened this issue Sep 13, 2016 · 17 comments

Comments

@sinsharat
Copy link
Contributor

Steps to reproduce:

  1. started a standalone member using the command ./etcd
    This started listening to localhost:2379 for client urls and for peer urls at localhost:2380
  2. started anoter etcd member as below:
    .\etcd.exe --initial-advertise-peer-urls http://localhost:22380 --listen-peer-ur
    ls http://localhost:22380 --advertise-client-urls http://localhost:22379 --listen-client-urls http://localhost:22379
  3. Updated by mistake ./etcdctl member add newMember --peer-urls=https://localhost:22379
    Got output as:
    Member 6cba1075ac9d26b5 added to cluster cdf818194e3a8c32
  4. Now trying to update the member to correct url using the command:
    ./etcdctl member update 6cba1075ac9d26b5 --peer-urls=https://localhost:22380
    But getting the below error:
    Error: context deadline exceeded

Let me know if i have understood the feature wrongly and what i need to do it to make it correct.

Thanks.

@sinsharat sinsharat changed the title etcdctl v3: Not able perform member update on unstarted member etcdctl v3: Not able perform member update on unstarted member, as a result the started member is constantly going for election and timing out. Sep 13, 2016
@gyuho
Copy link
Contributor

gyuho commented Sep 13, 2016

@sinsharat Try specifying etcdctl endpoints flag with the new member?

@sinsharat
Copy link
Contributor Author

sinsharat commented Sep 13, 2016

@gyuho my issue is if by mistake a member adds a new member by specifying an invalid url by mistake, then the entire cluster becomes not usable. If i try to do anything, even trying to remove the newly added member, it will fail since the single cluster which is up is constantly getting into election and not handling any request.
Is it not better to keep a timeout for adding a new member, so that if a member is added by mistake with wrong url and if the member stays unstarted for a specific period it gets removed from cluster automatically?
This will prevent cluster from going into continous election and will continue working normally after the specified period.
Thanks!

@gyuho
Copy link
Contributor

gyuho commented Sep 13, 2016

Now I understand your issue. Yeah I think it makes sense to have some timeout for membership change for the case you mentioned. Most cases, people still have quorum (e.g. add 1 member to 2-node cluster), so they can revert the membership change. But adding 1 member to 1-node cluster can be problematic.

Defer to @xiang90 @heyitsanthony

@heyitsanthony
Copy link
Contributor

@sinsharat Autoremove for 1->2 means the original node will have to make a membership change without quorum. Too easy to get into a split brain mess.

Could the etcdserver probe the peer for the 1->2 case to see if the new member is up before submitting the membership change?

@sinsharat
Copy link
Contributor Author

@heyitsanthony yes i totally agree that would be fine if the member doesn't get added blindly but keeps trying and once its able to contact the member then only the member gets added to cluster. That would prevent the cluster from getting un-usable. Even though i mentioned about two cluster system, but this scenerio which can happen when a three node cluster is intended and to a single member two new cluster is intended to be added.

@gyuho
Copy link
Contributor

gyuho commented Sep 14, 2016

@heyitsanthony Do you have an easy solution in mind?

Member must be added first to the existing node, in order to pass ValidateClusterAndAssignIDs at https://github.com/coreos/etcd/blob/master/etcdserver/server.go#L279.

So in current implementation, probing a new member before committing the membership change is impossible.

@heyitsanthony
Copy link
Contributor

@gyuho no quick fix in mind. That boottrap path would have to be changed a little for my suggestion to work. I believe (not 100% sure) the new node could force add itself to the remote peer list on validate, then start running without corrupting the cluster even if it hasn't been added yet.

@gyuho gyuho changed the title etcdctl v3: Not able perform member update on unstarted member, as a result the started member is constantly going for election and timing out. membership: prevent quorum loss from membership change Sep 15, 2016
@xiang90 xiang90 added this to the v3.2.0 milestone Nov 10, 2016
@xiang90
Copy link
Contributor

xiang90 commented Nov 10, 2016

I can see an easy fix here for 1 -> 2 case. The one member cluster can commit any proposal locally and immediately. So what we can do here is to let the one member cluster buffer the conf change and commit it once the 2nd member contact it. One problem is that the member might forget the configuration change request after restart. To fix that, we need to persist the buffer somehow. This makes the problem more complicated. But I think we can do the 1st step to start with.

@gyuho
Copy link
Contributor

gyuho commented Nov 10, 2016

@xiang90

once the 2nd member contact it

How would we know know this? Raft progress?

@xiang90
Copy link
Contributor

xiang90 commented Nov 10, 2016

@gyuho

No. The newly member will contact one existing member to get a list of existing members. For the one member case, it can only contact that member.

@xiang90
Copy link
Contributor

xiang90 commented Mar 28, 2017

do not block our release. moving to 3.3

@justinsb
Copy link
Contributor

I'm likely misunderstanding, but don't Raft's cluster membership rules prevent this happening? In the example given originally, we wouldn't have a strict majority of the new (2 member) cluster and thus the cluster membership change must fail.

@gyuho
Copy link
Contributor

gyuho commented May 21, 2018

@justinsb

we wouldn't have a strict majority of the new (2 member) cluster and thus the cluster membership change must fail.

Correct.

Once you add a new member to a single node cluster, the quorum number becomes 2. That is why the member add command to a single-node cluster cannot be reverted. That is, if member add was done wrong but still committed, you can not remove it with member remove command, now that the quorum is 2.

We want to prevent such member add request either by requiring time-outs or something else.

@gyuho
Copy link
Contributor

gyuho commented May 21, 2018

Moving to v3.5. We still have many others things planned for v3.4.

@gyuho gyuho modified the milestones: etcd-v3.4, etcd-v3.5 May 21, 2018
@maxenglander
Copy link

Would another viable solution for the 1->2 problem be to have server 1 first add server 2 as a non-voting member, replicate log entries to server 2 until server 2 is up to date, and only then complete the 1->2 quorum transition?

This is the approach the Raft paper recommends for addressing the challenge of new servers not having the full log, and thus being unable to accept new entries. I think this same approach would also solve the present issue of a new server being unavailable due to misconfiguration.

@jingyih
Copy link
Contributor

jingyih commented Jun 28, 2019

@maxenglander I agree. Non-voting member should solve this issue.

@gyuho @xiang90 Can we close this issue? Or do we want to fix this issue without using the non-voting member feature?

@stale
Copy link

stale bot commented Apr 7, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 7, 2020
@stale stale bot closed this as completed Apr 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

7 participants