-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leader Election taking between 3 to 15 minutes in openshift deployment when any single node - even non leader - is evicted starting from version 1.13.x #15231
Comments
After digging around a bit more this is very likely caused by hashicorp/raft#524 |
My understanding is that we currently intend to release 1.13.4 (which includes PR #15175) in the window of Nov 30 - Dec 2. If I become aware of that changing substantially, I'll post here. Feel free to reply here if you haven't heard anything by the end of Dec 2 and 1.13.4 hasn't been released yet. I'll leave this open until one of the posters on this issue has confirmed that 1.13.4 improves their situation. |
Just to confirm, this fix is also already in 1.14.0 right? The changelog didn't specifically mention it, but it looks like the version was bumped in the go.mod for that version. |
Yes - the fix is already in 1.14.0! Here's how I checked: That fix went into Looking at the changelog entries from that PR, like |
From a quick test this issue looks to be resolved with 1.14.0 indeed. |
Finally got around to deploying and testing 1.13.4 today sorry for the wait and can confirm everything is stable again now. On loss a new leader is elected once within usuall <2 seconds and stays leader as long as no other eviction happens. Issue can be closed! |
Thanks @TheDevOps for confirming! |
Overview of the Issue
We are running a 3 node Consul Server deployment within an onprem openshift cluster using the official helm chart from https://github.com/hashicorp/consul-k8s/tree/main/charts/consul
We noticed that after updating consul to version 1.13.1 on loss of any single node, even a non leader node, e.g. because of a node eviction or a rolling update of the StatefulSet that the whole consul cluster would become unstable and constantly lose and reelect a new leader for anything between 3 to 15 minutes before he becomes stable again at which point he remains stable until yet another node is lost.
Up to version 1.12.5 this has only taken around 2 to 10 seconds at most for which our clients are prepared with short term fallback caches and so had 0 impact for us.
Due to this it is currently impossible for us to update further than 1.12.5 since it could happen at any time that 1 replica is evicted and the whole cluster becomes unstable for an unfreasibly long time impacting our java clients.
Reproduction Steps
Right away again: this happens in an onprem Openshift cluster, I can not entirely rule out it's not immediately reproducible everywhere.
To reproduce the following tools are used:
First provide a helm values file adjusted for an openshift deployment with some security rules and so on, see values-consul-helm.yaml.txt, (had to rename it to .yaml.txt to upload it...).
Note: There are two placeholder <url> and <secretname> inside for obvious reasons, adjust them as needed, the rest is exactly as used for our tests.
Next run
to deploy the cluster which then shows in openshift as
and in the consul UI a healthy cluster state where in this example "consul-helm-server-1" become the leader
(Note: I forgot to take the screenshot before deleting 1 node namely "consul-helm-server-2, so the IP no longer matches since after delete openshift assigned a new one when recreating it. Immediately after the initial deployments it obviously did match),
Original the first time we noticed the issue was when due to a disk issue openshift automatically evicted 1 replica from one node and we suddenly encountered issues in our java clients that persisted for multiple minutes. So in order to reproduce this "force" an eviction of 1 consul server by simply deleting any node even a non leader, in the case of the attached logs in the "log fragments" section this was done for "consul-helm-server-2" which was a "follower" but it works for any node also the old leader.
Now the deleted node gets recreated by openshift starts and tries to start a leader election that initially does not work because he is not yet a voter. Eventually he becomes a voter and from the on the cluster keeps electing a leader, losing it immediately again and repeat from start. This continues for anything between 3 to up to 15 minutes for a seemingly random amount of times until eventually the cluster elects a leader and stays stable again.
Deleting another node keeps repeating this and there is never any improvement.
On further tests we also found the same issue to start happening when updating from 1.12.5 to 1.13.3 by openshift rolling update mechanisms, but as explained above it can be reproduced even with a completely new cluster without any clients connected.
In the "log fragments" section I've attached logs from all 3 consul server nodes as well as the output of "consul operator raft list-peers -stale" on one node several times during the recovery process
Consul info for both Client and Server
Client info
Server info
Operating system and Environment details
Log Fragments
The following logs are debug level logs from all 3 consul server nodes during the failover process
consul-helm-server-0.log
consul-helm-server-1.log
consul-helm-server-2.log
The following log is the output of the "consul operator raft list-peers -stale" command multiple times before, during and after the whole failover
consul-list-peers.log
If you need anything else please just let me know!
The text was updated successfully, but these errors were encountered: