Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leader Election taking between 3 to 15 minutes in openshift deployment when any single node - even non leader - is evicted starting from version 1.13.x #15231

Closed
TheDevOps opened this issue Nov 2, 2022 · 8 comments

Comments

@TheDevOps
Copy link

TheDevOps commented Nov 2, 2022

Overview of the Issue

We are running a 3 node Consul Server deployment within an onprem openshift cluster using the official helm chart from https://github.com/hashicorp/consul-k8s/tree/main/charts/consul
We noticed that after updating consul to version 1.13.1 on loss of any single node, even a non leader node, e.g. because of a node eviction or a rolling update of the StatefulSet that the whole consul cluster would become unstable and constantly lose and reelect a new leader for anything between 3 to 15 minutes before he becomes stable again at which point he remains stable until yet another node is lost.
Up to version 1.12.5 this has only taken around 2 to 10 seconds at most for which our clients are prepared with short term fallback caches and so had 0 impact for us.
Due to this it is currently impossible for us to update further than 1.12.5 since it could happen at any time that 1 replica is evicted and the whole cluster becomes unstable for an unfreasibly long time impacting our java clients.

Reproduction Steps

Right away again: this happens in an onprem Openshift cluster, I can not entirely rule out it's not immediately reproducible everywhere.

To reproduce the following tools are used:

  • openshift cluster running version 4.10.34 with kubernetes version v1.23.5+8471591 (happens also for openshift 4.9 and kubernetes v1.22.8+9e95cb9)
  • oc client version 4.11.0-0.okd-2022-08-20-022919
  • helm version v3.10.0
  • consul version 3.13.3 (happens also with 3.13.1 or 1.14.0-beta1, but does NOT happen with 1.12.2 or 1.12.5)
  • consul helm chart from https://github.com/hashicorp/consul-k8s/tree/main/charts/consul with version 0.49.0
  • Make sure there are no previous persistent volumes left in openshift and the installation happens on a completely blank state

First provide a helm values file adjusted for an openshift deployment with some security rules and so on, see values-consul-helm.yaml.txt, (had to rename it to .yaml.txt to upload it...).
Note: There are two placeholder <url> and <secretname> inside for obvious reasons, adjust them as needed, the rest is exactly as used for our tests.

Next run

helm repo add hashicorp https://helm.releases.hashicorp.com
helm repo update
helm upgrade --install -f values-consul-helm.yaml consul-helm hashicorp/consul -n <namespace>

to deploy the cluster which then shows in openshift as

image

and in the consul UI a healthy cluster state where in this example "consul-helm-server-1" become the leader

image

(Note: I forgot to take the screenshot before deleting 1 node namely "consul-helm-server-2, so the IP no longer matches since after delete openshift assigned a new one when recreating it. Immediately after the initial deployments it obviously did match),

Original the first time we noticed the issue was when due to a disk issue openshift automatically evicted 1 replica from one node and we suddenly encountered issues in our java clients that persisted for multiple minutes. So in order to reproduce this "force" an eviction of 1 consul server by simply deleting any node even a non leader, in the case of the attached logs in the "log fragments" section this was done for "consul-helm-server-2" which was a "follower" but it works for any node also the old leader.

Now the deleted node gets recreated by openshift starts and tries to start a leader election that initially does not work because he is not yet a voter. Eventually he becomes a voter and from the on the cluster keeps electing a leader, losing it immediately again and repeat from start. This continues for anything between 3 to up to 15 minutes for a seemingly random amount of times until eventually the cluster elects a leader and stays stable again.

image

Deleting another node keeps repeating this and there is never any improvement.

On further tests we also found the same issue to start happening when updating from 1.12.5 to 1.13.3 by openshift rolling update mechanisms, but as explained above it can be reproduced even with a completely new cluster without any clients connected.

In the "log fragments" section I've attached logs from all 3 consul server nodes as well as the output of "consul operator raft list-peers -stale" on one node several times during the recovery process

Consul info for both Client and Server

Client info
We are not using actual consul clients but only java applications using https://github.com/Ecwid/consul-api as client. 
Since the issue can be reproduced even without a single client connecting I don't think this is relevant for the problem anyway
Server info
agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 0
build:
        prerelease = 
        revision = b29e5894
        version = 1.13.3
        version_metadata = 
consul:
        acl = disabled
        bootstrap = false
        known_datacenters = 1
        leader = false
        leader_addr = 10.128.13.223:8300
        server = true
raft:
        applied_index = 218
        commit_index = 218
        fsm_pending = 0
        last_contact = 57.449617ms
        last_log_index = 218
        last_log_term = 38
        last_snapshot_index = 0
        last_snapshot_term = 0
        latest_configuration = [{Suffrage:Voter ID:2e3fe218-6337-8425-a4a4-32ca569dacc2 Address:10.129.38.31:8300} {Suffrage:Voter ID:a784393f-e56f-55d7-23d4-38d90764eaea Address:10.128.13.223:8300} {Suffrage:Voter ID:545cf64c-1a8f-7997-d354-cb381c8feacc Address:10.131.11.181:8300}]
        latest_configuration_index = 0
        num_peers = 2
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Follower
        term = 38
runtime:
        arch = amd64
        cpu_count = 4
        goroutines = 129
        max_procs = 4
        os = linux
        version = go1.18.1
serf_lan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 31
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 11
        members = 3
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 7
        members = 3
        query_queue = 0
        query_time = 1

Operating system and Environment details

  • Openshift 4.10.34 with kubernetes version v1.23.5+8471591
  • CRI-O (1.23.3) container runtime
  • Red Hat Enterprise Linux CoreOS 410.84.202209161756-0 (kernel 4.18.0-305.62.1.el8_4.x86_64) compute nodes
  • VMware ESXi as host system
  • Consul server 1.13.3
  • Consul helm chart 0.49.0
  • helm 3.10.0

Log Fragments

The following logs are debug level logs from all 3 consul server nodes during the failover process

consul-helm-server-0.log
consul-helm-server-1.log
consul-helm-server-2.log

The following log is the output of the "consul operator raft list-peers -stale" command multiple times before, during and after the whole failover

consul-list-peers.log

If you need anything else please just let me know!

@TheDevOps
Copy link
Author

After digging around a bit more this is very likely caused by hashicorp/raft#524
Can someone confirm or deny this any maybe also provide a timeline until when and in which consul versions the raft fix will be included?

@nvx
Copy link

nvx commented Nov 7, 2022

It looks like PR #15175 merged bd3451f into the release/1.13.x branch already, so I'd expect the next point release (1.13.4) to contain the fix. No idea when that's scheduled to come out mind you - hopefully sooner rather than later due to the severity of this bug.

@jkirschner-hashicorp
Copy link
Contributor

My understanding is that we currently intend to release 1.13.4 (which includes PR #15175) in the window of Nov 30 - Dec 2. If I become aware of that changing substantially, I'll post here. Feel free to reply here if you haven't heard anything by the end of Dec 2 and 1.13.4 hasn't been released yet.

I'll leave this open until one of the posters on this issue has confirmed that 1.13.4 improves their situation.

@nvx
Copy link

nvx commented Nov 21, 2022

Just to confirm, this fix is also already in 1.14.0 right? The changelog didn't specifically mention it, but it looks like the version was bumped in the go.mod for that version.

@jkirschner-hashicorp
Copy link
Contributor

Yes - the fix is already in 1.14.0! Here's how I checked:

That fix went into main with this PR: #14897

Looking at the changelog entries from that PR, like raft: Fix a race condition where the snapshot file is closed without being opened, I see them in the Consul 1.14.0 release.

@TheDevOps
Copy link
Author

From a quick test this issue looks to be resolved with 1.14.0 indeed.
Right now we still would prefer to wait for 1.13.4 for now and not go for the new major that close to the end of the year which is a pretty critical time for us, also latest Dec 2. works pretty well for us as well.
I'll update once 1.13.4 is released and tested - probably on Dec 5. - but as said since the issue no longer happens with 1.14.0 in the test setup I'm very positive it will be fixed in 1.13.4 as well.

@TheDevOps
Copy link
Author

Finally got around to deploying and testing 1.13.4 today sorry for the wait and can confirm everything is stable again now. On loss a new leader is elected once within usuall <2 seconds and stays leader as long as no other eviction happens. Issue can be closed!

@david-yu
Copy link
Contributor

david-yu commented Dec 5, 2022

Thanks @TheDevOps for confirming!

@david-yu david-yu closed this as completed Dec 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants