Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

// , If one Consul Cluster connects to another, then both remove-peer each other, it makes the second Consul Cluster leaderless. #3218

Closed
v6 opened this issue Jul 3, 2017 · 4 comments
Labels
type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp

Comments

@v6
Copy link
Contributor

v6 commented Jul 3, 2017

I had a question, and directed it to the consul mailing list. It hasn't been addressed in the FAQ.

Here's a link to my question on the consul-tool mailing list:
https://groups.google.com/d/msg/consul-tool/Mnf71dzLi-k/cxOe1dnoAAAJ

I also asked on serverfault.com:

https://serverfault.com/questions/852520/how-do-i-completely-remove-a-node-from-a-consul-cluster

I will continue to post there, but an architect in my group mentioned that the issue about which I asked may be a bug. Perhaps it's worth asking here whether this is something you have observed before.

consul version for both Client and Server

Client: N/A
Server: 0.8.3

[nathan-basanese-zsh8@prd0consulserver2 ~]% consul version
Consul v0.8.3
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
[nathan-basanese-zsh8@prd0consulserver2 ~]%

consul info for both Production Server and Alpha Server

Production Server:

[nathan-basanese-zsh8@prd0consulserver2 ~]% consul info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 1
build:
	prerelease =
	revision = ea2a82b
	version = 0.8.3
consul:
	bootstrap = false
	known_datacenters = 2
	leader = false
	leader_addr = 192.100.100.101:8300
	server = true
raft:
	applied_index = 1217690
	commit_index = 1217690
	fsm_pending = 0
	last_contact = 49.686802ms
	last_log_index = 1217690
	last_log_term = 1423
	last_snapshot_index = 1212513
	last_snapshot_term = 771
	latest_configuration = [{Suffrage:Voter ID:192.176.100.1:8300 Address:192.176.100.1:8300} {Suffrage:Voter ID:192.176.100.3:8300 Address:192.176.100.3:8300} {Suffrage:Voter ID:192.100.100.101:8300 Address:192.100.100.101:8300} {Suffrage:Voter ID:192.100.100.102:8300 Address:192.100.100.102:8300} {Suffrage:Voter ID:192.100.100.103:8300 Address:192.100.100.103:8300}]
	latest_configuration_index = 1207557
	num_peers = 4
	protocol_version = 2
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 1423
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 84
	max_procs = 2
	os = linux
	version = go1.8.1
serf_lan:
	encrypted = false
	event_queue = 0
	event_time = 5
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 10
	members = 3
	query_queue = 0
	query_time = 1
serf_wan:
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 118
	members = 6
	query_queue = 0
	query_time = 1

Alpha Server:

[[email protected] ~]% consul info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 1
build:
	prerelease =
	revision = ea2a82b
	version = 0.8.3
consul:
	bootstrap = false
	known_datacenters = 2
	leader = false
	leader_addr =
	server = true
raft:
	applied_index = 2073575
	commit_index = 2073575
	fsm_pending = 0
	last_contact = 14m37.496640011s
	last_log_index = 2073575
	last_log_term = 1404
	last_snapshot_index = 2069604
	last_snapshot_term = 1404
	latest_configuration = [{Suffrage:Voter ID:192.176.100.1:8300 Address:192.176.100.1:8300} {Suffrage:Voter ID:192.176.100.2:8300 Address:192.176.100.2:8300} {Suffrage:Voter ID:192.176.100.3:8300 Address:192.176.100.3:8300}]
	latest_configuration_index = 1
	num_peers = 2
	protocol_version = 2
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Candidate
	term = 1538
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 147
	max_procs = 2
	os = linux
	version = go1.8.1
serf_lan:
	encrypted = false
	event_queue = 0
	event_time = 29
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 367
	members = 20
	query_queue = 0
	query_time = 1
serf_wan:
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 118
	members = 6
	query_queue = 0
	query_time = 1

Operating system and Environment details

CEntOS 6.7
Linux 2.6.32-696.1.1.el6.x86_64 #1 SMP Tue Apr 11 17:13:24 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Description of the Issue (and unexpected/desired result)

Reproduction steps

Create a 3 member Consul cluster Alpha with members on different subnets.
Create another 3 member Consul cluster Production with members on 2 different subnets from the above two.
Have one member from the second Consul cluster Production configured with a member of Alpha in start_join.
Reconfigure that member of the Production cluster to no longer use a member of Alpha, and have each cluster remove the peers of the other.
Observe that the first cluster, Alpha, will become leaderless.

If this is a bug, it is somewhat subtle. It may have to do with Raft, not Consul.

I have reproduced this problem three times in an effort to resolve it, even after deleting all raft & consul data from the affected servers.

Log Fragments or Link to gist

Include appropriate Client or Server log fragments. If the log is longer
than a few dozen lines, please include the URL to the
gist.

Please refer to the stackoverflow.com post: https://serverfault.com/questions/852520/how-do-i-completely-remove-a-node-from-a-consul-cluster

Or indicate which other logs or behavior you would want to observe.

@v6
Copy link
Contributor Author

v6 commented Jul 3, 2017

// , Manually removing each other's peers using consul operator raft remove-peer quickly from all Consul servers seemed to work, but I wonder why this has to be done using Raft, rather than Consul's force-leave command.

@pearkes pearkes added the type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp label Jul 24, 2018
@pearkes
Copy link
Contributor

pearkes commented Jul 24, 2018

Looks like from the above comment and the mailing list thread this got answered, so going to close it. Thanks for the questions and feel free to comment back if I've missed anything.

@pearkes pearkes closed this as completed Jul 24, 2018
@v6
Copy link
Contributor Author

v6 commented Jan 14, 2019

// , Eh, I'm mostly wondering about how this happens in the first place.

@cbednarski
Copy link

cbednarski commented Jan 14, 2019

@v6 Raft peers that do not leave gracefully persist in the cluster for up to 72 hours, and peers will attempt to heartbeat and reconnect to the dead nodes during that time. Any peer that has a record for that node will resurrect it if it can reestablish a connection, even if the node was wiped (the peer still remembers it).

The problem you described is caused by an operator error joining a node to the wrong cluster. To avoid it, do not join nodes to the wrong clusters. This can be prevented by using different gossip encryption keys for each cluster and by firewalling LAN ports between the clusters (only WAN ports should be open).

Specific behavior of the leader election depends on how many server nodes were present, and how many of those are alive and reachable at any given time. For example, if you start with 3 nodes and add 2 more, the cluster size is now 5 and you must have at least 3 nodes (not 2 nodes) to elect a new leader. Conversely if you reduce a cluster size from 3 to 2 you will be unable to elect a new leader because you can't have a majority.

Consul's design expects nodes to recover after a network outage or node failure. If you replace rather than restore a majority of your nodes you may grow your quorum size and may get stuck without enough members to elect a new leader. In this case you will need to manually intervene to fix the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp
Projects
None yet
Development

No branches or pull requests

3 participants