Best way to monitor cluster leader presence/absence #10733

Abhimanyu-Jana · 2021-07-30T01:30:05Z

Our monitoring for cluster leader is based on consul_raft_leader exported by the prometheus consul exporter

https://github.com/prometheus/consul_exporter

As I understand it gets the leader using /v1/status/leader endpoint

However there were many instances where queries to consul (v1/catalog/nodes or v1/catalog/services) were failing with HTTP 500 / "No cluster leader" despite /v1/status/leader saying there was a leader according to each individual node.

Is consul_raft_leader or /v1/status/leader still a good way to monitor for leader presence/absence?

jkirschner-hashicorp · 2021-07-30T11:32:35Z

Copying additional info here from your post on the linked repo: prometheus/consul_exporter#208

What did you do?
setup monitoring for presence/absence of cluster leader using consul_raft_leader metric

What did you expect to see?
When external queries to consul cluster fail with HTTP 500 or "No cluster leader" error, we expect to see consul_raft_leader value change from 1 to 0

What did you see instead? Under which circumstances?
consul_raft_leader value still remains 1 despite there being obvious issues with cluster health. We can confirm based on logs that show the "No cluster leader" errors, as well as using "consul operator raft list-peers" command

Environment
Linux

consul_exporter version:
0.7.1

Consul version:
Consul v1.8.3

Prometheus version:
N/A

Prometheus configuration file:
N/A

Logs:

Error getting peers: Failed to retrieve raft configuration: Unexpected response code: 500 (No cluster leader)

jkirschner-hashicorp · 2021-08-02T22:30:55Z

Hi @Abhimanyu-Jana,

This seems like a bug - there shouldn't be a discrepancy between the leader status endpoint and what the rest of the cluster thinks.

To help us explore this, can you provide us with some additional information?

Are there any reproduction steps you can provide?
What's the output of consul info from the client agent and the server agent when this condition occurs?
Information about the OS, Architecture, and any other information you can provide about the environment.
Log fragments: Include appropriate Client or Server log fragments. If the log is longer than a few dozen lines, please include the URL to the gist of the log instead of posting it in the issue. Use -log-level=TRACE on the client and server to capture the maximum log detail.

Abhimanyu-Jana · 2021-08-03T01:57:44Z

Thank you for your response. We'll look into getting this info ASAP

dnephin · 2021-08-03T23:09:41Z

I had a quick look into this. At first we thought it might have been fixed by #8408, but I suspect now that it's probably more likely the underlying issue that prompted #8404 is the same as this one, but that change unfortunately probably does not fix the issue.

I wonder if this problem might have been fixed by #9487. That change was backported into Consul v1.8.8. Previous to that change networking problems between a client and a server could have caused "No cluster leader" errors for RPC requests, even when raft still had a leader.

Does that seem like it might be the cause of the problem? Would upgrading to 1.8.8 be an option to see if the errors change to "Raft leader not found in server lookup mapping" ?

nahsi · 2021-08-13T06:33:48Z

Our monitoring for cluster leader is based on consul_raft_leader exported by the prometheus consul exporter
https://github.com/prometheus/consul_exporter

Why do you need an outdated third party exporter when consul supports metrics in prometheus format natively?

Abhimanyu-Jana · 2021-08-20T03:50:00Z

@nahsi because of the issue described in #5140

However you raised a valid point. In theory these metrics are probably all that's needed for monitoring without having to use the exporter.

Can you confirm if this behaviour described in #5140 was addressed in later versions?

Abhimanyu-Jana · 2021-08-20T03:53:32Z

btw was able to reproduce this the other day on a broken cluster

[consul_node1] $ /usr/sbin/consul operator raft list-peers -http-addr=http://$(hostname):8500
Error getting peers: Failed to retrieve raft configuration: Unexpected response code: 500 (No cluster leader)

[consul_node1] $ curl -s http://$(hostname):8500/v1/status/leader | jq
"<consul_node3>:8300"

jkirschner-hashicorp · 2021-08-20T13:26:23Z

@Abhimanyu-Jana : #5140 was resolved by PR #9198 (in Nov 2020).

jkirschner-hashicorp added theme/telemetry Anything related to telemetry or observability type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp labels Jul 30, 2021

jkirschner-hashicorp added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Aug 2, 2021

dnephin mentioned this issue Aug 3, 2021

Modify Status Leader API to return Status 500 when there is no leader #8408

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best way to monitor cluster leader presence/absence #10733

Best way to monitor cluster leader presence/absence #10733

Abhimanyu-Jana commented Jul 30, 2021

jkirschner-hashicorp commented Jul 30, 2021

jkirschner-hashicorp commented Aug 2, 2021

Abhimanyu-Jana commented Aug 3, 2021

dnephin commented Aug 3, 2021

nahsi commented Aug 13, 2021

Abhimanyu-Jana commented Aug 20, 2021 •

edited

Loading

Abhimanyu-Jana commented Aug 20, 2021 •

edited

Loading

jkirschner-hashicorp commented Aug 20, 2021

Best way to monitor cluster leader presence/absence #10733

Best way to monitor cluster leader presence/absence #10733

Comments

Abhimanyu-Jana commented Jul 30, 2021

jkirschner-hashicorp commented Jul 30, 2021

jkirschner-hashicorp commented Aug 2, 2021

Abhimanyu-Jana commented Aug 3, 2021

dnephin commented Aug 3, 2021

nahsi commented Aug 13, 2021

Abhimanyu-Jana commented Aug 20, 2021 • edited Loading

Abhimanyu-Jana commented Aug 20, 2021 • edited Loading

jkirschner-hashicorp commented Aug 20, 2021

Abhimanyu-Jana commented Aug 20, 2021 •

edited

Loading

Abhimanyu-Jana commented Aug 20, 2021 •

edited

Loading