Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

return 500 status when any node from cluster is unavailable #30

Closed
crysis-ps opened this issue May 14, 2020 · 14 comments
Closed

return 500 status when any node from cluster is unavailable #30

crysis-ps opened this issue May 14, 2020 · 14 comments

Comments

@crysis-ps
Copy link

I have proxmox cluster with similar nodes 5.4-13 version.
I used exporter version 1.1.2 and all was ok.
With 'http://host:9221/pve' url I get the summary cluster status and nodes info.

But I tried to update exporter to 1.2.0 version and found some troubles.
When one of cluster nodes unavailable - the exporter report 500 status page and no any metrics.
Also the page http://host:9221/pve?target=proxmox-08 (for example) says '595 Errors during connection establishment'
So, when one of my cluster nodes is unavailable - the exporter not show any metrics, only error page.
The previous version 1.1.2 works fine and shows metrics. Here is shown that unavailable node has 'pve_up' metric '0'.

@znerol
Copy link
Member

znerol commented May 14, 2020

Thank you for taking the time to report this issue.

Since version 1.2.1 pve exporter writes stack traces to stderr if something goes wrong. Can you please update to the latest version and then post the stack trace if a 500 is returned?

@znerol
Copy link
Member

znerol commented May 14, 2020

Looking through the changes, I suspect that this might be a problem introduced in #22 . It is possible that the lxc and qemu config cannot be accessed if a node is down. If this is the case, then we'd need to filter out nodes which are down after calling self._pve.nodes.get() in ClusterNodeConfigCollector.

@znerol
Copy link
Member

znerol commented May 14, 2020

I opened PR #31 which might fix the problem. I also attached the source distribution and a python wheel, so it is easier for you to test whether the fix works.

@crysis-ps
Copy link
Author

crysis-ps commented May 18, 2020

Hello. thanks for fast answer!
Just have a time to test.
I ran the latest 1.2.1 from pip. And try to reboot node.
And exporter return errors

Exception thrown while rendering view Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/pve_exporter/http.py", line 101, in view return self._views[endpoint](**params) File "/usr/local/lib/python3.7/site-packages/pve_exporter/http.py", line 53, in on_pve output = collect_pve(self._config[module], target) File "/usr/local/lib/python3.7/site-packages/pve_exporter/collector.py", line 300, in collect_pve return generate_latest(registry) File "/usr/local/lib/python3.7/site-packages/prometheus_client-0.7.1-py3.7.egg/prometheus_client/exposition.py", line 90, in generate_latest for metric in registry.collect(): File "/usr/local/lib/python3.7/site-packages/prometheus_client-0.7.1-py3.7.egg/prometheus_client/registry.py", line 75, in collect for metric in collector.collect(): File "/usr/local/lib/python3.7/site-packages/pve_exporter/collector.py", line 271, in collect for vmdata in self._pve.nodes(node['node']).qemu.get(): File "/usr/local/lib/python3.7/site-packages/proxmoxer/core.py", line 105, in get return self(args)._request("GET", params=params) File "/usr/local/lib/python3.7/site-packages/proxmoxer/core.py", line 94, in _request resp.reason, resp.content)) proxmoxer.core.ResourceException: 595 Errors during connection establishment, proxy handshake: Connection timed out - b'' 127.0.0.1 - - [18/May/2020 20:08:49] "GET /pve?target=hw-proxmox-01 HTTP/1.1" 500 -

And 595 Errors during connection establishment, proxy handshake: No route to host - b''
Exception thrown while rendering view Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/pve_exporter/http.py", line 101, in view return self._views[endpoint](**params) File "/usr/local/lib/python3.7/site-packages/pve_exporter/http.py", line 53, in on_pve output = collect_pve(self._config[module], target) File "/usr/local/lib/python3.7/site-packages/pve_exporter/collector.py", line 300, in collect_pve return generate_latest(registry) File "/usr/local/lib/python3.7/site-packages/prometheus_client-0.7.1-py3.7.egg/prometheus_client/exposition.py", line 90, in generate_latest for metric in registry.collect(): File "/usr/local/lib/python3.7/site-packages/prometheus_client-0.7.1-py3.7.egg/prometheus_client/registry.py", line 75, in collect for metric in collector.collect(): File "/usr/local/lib/python3.7/site-packages/pve_exporter/collector.py", line 271, in collect for vmdata in self._pve.nodes(node['node']).qemu.get(): File "/usr/local/lib/python3.7/site-packages/proxmoxer/core.py", line 105, in get return self(args)._request("GET", params=params) File "/usr/local/lib/python3.7/site-packages/proxmoxer/core.py", line 94, in _request resp.reason, resp.content)) proxmoxer.core.ResourceException: 595 Errors during connection establishment, proxy handshake: No route to host - b'' 127.0.0.1 - - [18/May/2020 20:12:44] "GET /pve?target=hw-proxmox-01 HTTP/1.1" 500 -

@crysis-ps
Copy link
Author

Also tried to install from sources from PR #31 to local dev machine.
And seems it works fine. With one node unavailable - the exporter still shows metrics: and node_up is 0.
Thank you!

@znerol
Copy link
Member

znerol commented May 18, 2020

I just published 1.2.2 which should fix this problem.

Thanks again for the report.

@znerol znerol closed this as completed May 18, 2020
@gigelu
Copy link

gigelu commented Jul 21, 2020

Hello

I have a similar problem, when a node is offline, the exporter sometimes returns 500 error.
Not every time, it's kind of random, couldn.t figure it out why.
The error comes from e KeyError exception from python:

Jul 21 22:58:32 node1 pve_exporter[79606]: Exception thrown while rendering view
Jul 21 22:58:32 node1 pve_exporter[79606]: Traceback (most recent call last):
Jul 21 22:58:32 node1 pve_exporter[79606]: File "/usr/local/lib/python3.7/dist-packages/pve_exporter/http.py", line 101, in view
Jul 21 22:58:32 node1 pve_exporter[79606]: return self._viewsendpoint
Jul 21 22:58:32 node1 pve_exporter[79606]: File "/usr/local/lib/python3.7/dist-packages/pve_exporter/http.py", line 53, in on_pve
Jul 21 22:58:32 node1 pve_exporter[79606]: output = collect_pve(self._config[module], target)
Jul 21 22:58:32 node1 pve_exporter[79606]: File "/usr/local/lib/python3.7/dist-packages/pve_exporter/collector.py", line 302, in collect_pve
Jul 21 22:58:32 node1 pve_exporter[79606]: return generate_latest(registry)
Jul 21 22:58:32 node1 pve_exporter[79606]: File "/usr/local/lib/python3.7/dist-packages/prometheus_client/exposition.py", line 106, in generate_latest
Jul 21 22:58:32 node1 pve_exporter[79606]: for metric in registry.collect():
Jul 21 22:58:32 node1 pve_exporter[79606]: File "/usr/local/lib/python3.7/dist-packages/prometheus_client/registry.py", line 82, in collect
Jul 21 22:58:32 node1 pve_exporter[79606]: for metric in collector.collect():
Jul 21 22:58:32 node1 pve_exporter[79606]: File "/usr/local/lib/python3.7/dist-packages/pve_exporter/collector.py", line 108, in collect
Jul 21 22:58:32 node1 pve_exporter[79606]: label_values = [str(node[key]) for key in labels]
Jul 21 22:58:32 node1 pve_exporter[79606]: File "/usr/local/lib/python3.7/dist-packages/pve_exporter/collector.py", line 108, in
Jul 21 22:58:32 node1 pve_exporter[79606]: label_values = [str(node[key]) for key in labels]
Jul 21 22:58:32 node1 pve_exporter[79606]: KeyError: 'ip'

I changed the file to print the nodes dict and the output was:

Jul 21 22:58:50 node1 pve_exporter[79606]: [{'name': 'node2', 'level': '', 'ip': '192.168.100.11', 'local': 0, 'nodeid': 2, 'id': 'node/node2'}, {'local': 0, 'nodeid': 3, 'id': 'node/node3', 'level': '', 'name': 'node3'}, {'level': '', 'ip': '192.168.100.10', 'local': 1, 'nodeid': 1, 'name': 'node1', 'id': 'node/node1'}]

The offline node (node3) doesn't have the ip key.

@znerol znerol reopened this Jul 22, 2020
@znerol
Copy link
Member

znerol commented Jul 22, 2020

Thanks @gigelu for the trace. I will have a look later.

@znerol
Copy link
Member

znerol commented Jul 25, 2020

It looks like the ip attribute is added conditionally to the /cluster/status API response (code). Still I wonder why the ip key is sometimes missing, since I expect a host to be resolvable even if it is down.

PVE docs used to stress the importance of adding names and ips of all nodes to /etc/hosts on every member. Nowadays they seemed to only recommend it. @gigelu did you add host names and ips to your /etc/hosts file?

@znerol
Copy link
Member

znerol commented Jul 25, 2020

@gigelu opened PR #41 which drops ip and local labels from the pve_node_info metric. I attached a wheel over there in order to make it easier for you to test it. Note, I've dropped support for Python 2 in #42, so please use Python 3 to test this. Would be very appreciated if you could report back whether the new version fixes the problem for you.

@gigelu
Copy link

gigelu commented Jul 26, 2020

Sorry for the late response.

did you add host names and ips to your /etc/hosts file?

No, I didn't (this is a test cluster). But adding them didn't fix the problem.

I attached a wheel over there in order to make it easier for you to test it.

I wasn't able to install from those files (the source files were missing).

Yes, changing the labels to a list without the ip key fixes the error.

L.E.: from reading the linked proxmox file, shouldn't you remove the level key instead of local? Although in my tests I only got the ip key missing, never level.

@znerol
Copy link
Member

znerol commented Jul 26, 2020

L.E.: from reading the linked proxmox file, shouldn't you remove the level key instead of local? Although in my tests I only got the ip key missing, never level.

Good point. The cheap answer: I never had reports about KeyError because of the level key. The long answer: It look like extract_node_stats will always return a Perl hash. If the level key is missing from it, then it is set to an empty string by the API method linked above.

The reason why I am tempted to remove the local key is the following: When monitoring a cluster it is desirable to collect metrics from all (or at least multiple) cluster members. Imagine you have cluster with nodes a, b, c and you collect metrics from all of them. Then you get the following results from a:

pve_node_info{id="node/a", local="1", ...} 1.0
pve_node_info{id="node/b", local="0", ...} 1.0
pve_node_info{id="node/c", local="0", ...} 1.0

From b you get:

pve_node_info{id="node/a", local="0", ...} 1.0
pve_node_info{id="node/b", local="1", ...} 1.0
pve_node_info{id="node/c", local="0", ...} 1.0

The rest of the labels denoted by .... are the same no matter from which node they were scraped. Prometheus treats each unique combination of labels as a separate time serie. In my opinion there is no value in treating pve_node_info records differently on whether they have their origin on the node they were scraped or on another cluster member. In fact in my own prometheus config I simply drop the local label in metric_relabel_configs for exact that reason. Since we are introducing breaking changes by dropping the ip label, I figured that would be a good opportunity to drop the local one as well (and also drop support for Python 2) and then bump the version to 2.0.

@gigelu
Copy link

gigelu commented Jul 26, 2020

I understand now, thanks for the explanation.

@znerol
Copy link
Member

znerol commented Nov 2, 2020

Fix in #41 is part of 2.0.1. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants