Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best way to monitor cluster leader presence/absence #10733

Open
Abhimanyu-Jana opened this issue Jul 30, 2021 · 8 comments
Open

Best way to monitor cluster leader presence/absence #10733

Abhimanyu-Jana opened this issue Jul 30, 2021 · 8 comments
Labels
theme/telemetry Anything related to telemetry or observability type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp waiting-reply Waiting on response from Original Poster or another individual in the thread

Comments

@Abhimanyu-Jana
Copy link

Our monitoring for cluster leader is based on consul_raft_leader exported by the prometheus consul exporter

https://github.com/prometheus/consul_exporter

As I understand it gets the leader using /v1/status/leader endpoint

However there were many instances where queries to consul (v1/catalog/nodes or v1/catalog/services) were failing with HTTP 500 / "No cluster leader" despite /v1/status/leader saying there was a leader according to each individual node.

Is consul_raft_leader or /v1/status/leader still a good way to monitor for leader presence/absence?

@jkirschner-hashicorp jkirschner-hashicorp added theme/telemetry Anything related to telemetry or observability type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp labels Jul 30, 2021
@jkirschner-hashicorp
Copy link
Contributor

Copying additional info here from your post on the linked repo: prometheus/consul_exporter#208


What did you do?
setup monitoring for presence/absence of cluster leader using consul_raft_leader metric

What did you expect to see?
When external queries to consul cluster fail with HTTP 500 or "No cluster leader" error, we expect to see consul_raft_leader value change from 1 to 0

What did you see instead? Under which circumstances?
consul_raft_leader value still remains 1 despite there being obvious issues with cluster health. We can confirm based on logs that show the "No cluster leader" errors, as well as using "consul operator raft list-peers" command

Environment
Linux

consul_exporter version:
0.7.1

Consul version:
Consul v1.8.3

Prometheus version:
N/A

Prometheus configuration file:
N/A

Logs:

Error getting peers: Failed to retrieve raft configuration: Unexpected response code: 500 (No cluster leader)

@jkirschner-hashicorp
Copy link
Contributor

Hi @Abhimanyu-Jana,

This seems like a bug - there shouldn't be a discrepancy between the leader status endpoint and what the rest of the cluster thinks.

To help us explore this, can you provide us with some additional information?

  • Are there any reproduction steps you can provide?
  • What's the output of consul info from the client agent and the server agent when this condition occurs?
  • Information about the OS, Architecture, and any other information you can provide about the environment.
  • Log fragments: Include appropriate Client or Server log fragments. If the log is longer than a few dozen lines, please include the URL to the gist of the log instead of posting it in the issue. Use -log-level=TRACE on the client and server to capture the maximum log detail.

@jkirschner-hashicorp jkirschner-hashicorp added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Aug 2, 2021
@Abhimanyu-Jana
Copy link
Author

Thank you for your response. We'll look into getting this info ASAP

@dnephin
Copy link
Contributor

dnephin commented Aug 3, 2021

I had a quick look into this. At first we thought it might have been fixed by #8408, but I suspect now that it's probably more likely the underlying issue that prompted #8404 is the same as this one, but that change unfortunately probably does not fix the issue.

I wonder if this problem might have been fixed by #9487. That change was backported into Consul v1.8.8. Previous to that change networking problems between a client and a server could have caused "No cluster leader" errors for RPC requests, even when raft still had a leader.

Does that seem like it might be the cause of the problem? Would upgrading to 1.8.8 be an option to see if the errors change to "Raft leader not found in server lookup mapping" ?

@nahsi
Copy link

nahsi commented Aug 13, 2021

Our monitoring for cluster leader is based on consul_raft_leader exported by the prometheus consul exporter
https://github.com/prometheus/consul_exporter

Why do you need an outdated third party exporter when consul supports metrics in prometheus format natively?

@Abhimanyu-Jana
Copy link
Author

Abhimanyu-Jana commented Aug 20, 2021

@nahsi because of the issue described in #5140

However you raised a valid point. In theory these metrics are probably all that's needed for monitoring without having to use the exporter.

Can you confirm if this behaviour described in #5140 was addressed in later versions?

@Abhimanyu-Jana
Copy link
Author

Abhimanyu-Jana commented Aug 20, 2021

btw was able to reproduce this the other day on a broken cluster

[consul_node1] $ /usr/sbin/consul operator raft list-peers -http-addr=http://$(hostname):8500
Error getting peers: Failed to retrieve raft configuration: Unexpected response code: 500 (No cluster leader)

[consul_node1] $ curl -s http://$(hostname):8500/v1/status/leader | jq
"<consul_node3>:8300"

@jkirschner-hashicorp
Copy link
Contributor

@Abhimanyu-Jana : #5140 was resolved by PR #9198 (in Nov 2020).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/telemetry Anything related to telemetry or observability type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp waiting-reply Waiting on response from Original Poster or another individual in the thread
Projects
None yet
Development

No branches or pull requests

4 participants