-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Best way to monitor cluster leader presence/absence #10733
Comments
Copying additional info here from your post on the linked repo: prometheus/consul_exporter#208 What did you do? What did you expect to see? What did you see instead? Under which circumstances? Environment consul_exporter version: Consul version: Prometheus version: Prometheus configuration file: Logs:
|
Hi @Abhimanyu-Jana, This seems like a bug - there shouldn't be a discrepancy between the leader status endpoint and what the rest of the cluster thinks. To help us explore this, can you provide us with some additional information?
|
Thank you for your response. We'll look into getting this info ASAP |
I had a quick look into this. At first we thought it might have been fixed by #8408, but I suspect now that it's probably more likely the underlying issue that prompted #8404 is the same as this one, but that change unfortunately probably does not fix the issue. I wonder if this problem might have been fixed by #9487. That change was backported into Consul v1.8.8. Previous to that change networking problems between a client and a server could have caused "No cluster leader" errors for RPC requests, even when raft still had a leader. Does that seem like it might be the cause of the problem? Would upgrading to 1.8.8 be an option to see if the errors change to "Raft leader not found in server lookup mapping" ? |
Why do you need an outdated third party exporter when consul supports metrics in prometheus format natively? |
btw was able to reproduce this the other day on a broken cluster [consul_node1] $ /usr/sbin/consul operator raft list-peers -http-addr=http://$(hostname):8500 [consul_node1] $ curl -s http://$(hostname):8500/v1/status/leader | jq |
@Abhimanyu-Jana : #5140 was resolved by PR #9198 (in Nov 2020). |
Our monitoring for cluster leader is based on consul_raft_leader exported by the prometheus consul exporter
https://github.com/prometheus/consul_exporter
As I understand it gets the leader using /v1/status/leader endpoint
However there were many instances where queries to consul (v1/catalog/nodes or v1/catalog/services) were failing with HTTP 500 / "No cluster leader" despite /v1/status/leader saying there was a leader according to each individual node.
Is consul_raft_leader or /v1/status/leader still a good way to monitor for leader presence/absence?
The text was updated successfully, but these errors were encountered: