return 500 status when any node from cluster is unavailable #30

crysis-ps · 2020-05-14T15:44:05Z

I have proxmox cluster with similar nodes 5.4-13 version.
I used exporter version 1.1.2 and all was ok.
With 'http://host:9221/pve' url I get the summary cluster status and nodes info.

But I tried to update exporter to 1.2.0 version and found some troubles.
When one of cluster nodes unavailable - the exporter report 500 status page and no any metrics.
Also the page http://host:9221/pve?target=proxmox-08 (for example) says '595 Errors during connection establishment'
So, when one of my cluster nodes is unavailable - the exporter not show any metrics, only error page.
The previous version 1.1.2 works fine and shows metrics. Here is shown that unavailable node has 'pve_up' metric '0'.

znerol · 2020-05-14T18:44:29Z

Thank you for taking the time to report this issue.

Since version 1.2.1 pve exporter writes stack traces to stderr if something goes wrong. Can you please update to the latest version and then post the stack trace if a 500 is returned?

znerol · 2020-05-14T19:00:20Z

Looking through the changes, I suspect that this might be a problem introduced in #22 . It is possible that the lxc and qemu config cannot be accessed if a node is down. If this is the case, then we'd need to filter out nodes which are down after calling self._pve.nodes.get() in ClusterNodeConfigCollector.

znerol · 2020-05-14T19:16:18Z

I opened PR #31 which might fix the problem. I also attached the source distribution and a python wheel, so it is easier for you to test whether the fix works.

crysis-ps · 2020-05-18T17:10:40Z

Hello. thanks for fast answer!
Just have a time to test.
I ran the latest 1.2.1 from pip. And try to reboot node.
And exporter return errors

Exception thrown while rendering view Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/pve_exporter/http.py", line 101, in view return self._views[endpoint](**params) File "/usr/local/lib/python3.7/site-packages/pve_exporter/http.py", line 53, in on_pve output = collect_pve(self._config[module], target) File "/usr/local/lib/python3.7/site-packages/pve_exporter/collector.py", line 300, in collect_pve return generate_latest(registry) File "/usr/local/lib/python3.7/site-packages/prometheus_client-0.7.1-py3.7.egg/prometheus_client/exposition.py", line 90, in generate_latest for metric in registry.collect(): File "/usr/local/lib/python3.7/site-packages/prometheus_client-0.7.1-py3.7.egg/prometheus_client/registry.py", line 75, in collect for metric in collector.collect(): File "/usr/local/lib/python3.7/site-packages/pve_exporter/collector.py", line 271, in collect for vmdata in self._pve.nodes(node['node']).qemu.get(): File "/usr/local/lib/python3.7/site-packages/proxmoxer/core.py", line 105, in get return self(args)._request("GET", params=params) File "/usr/local/lib/python3.7/site-packages/proxmoxer/core.py", line 94, in _request resp.reason, resp.content)) proxmoxer.core.ResourceException: 595 Errors during connection establishment, proxy handshake: Connection timed out - b'' 127.0.0.1 - - [18/May/2020 20:08:49] "GET /pve?target=hw-proxmox-01 HTTP/1.1" 500 -

And 595 Errors during connection establishment, proxy handshake: No route to host - b''
Exception thrown while rendering view Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/pve_exporter/http.py", line 101, in view return self._views[endpoint](**params) File "/usr/local/lib/python3.7/site-packages/pve_exporter/http.py", line 53, in on_pve output = collect_pve(self._config[module], target) File "/usr/local/lib/python3.7/site-packages/pve_exporter/collector.py", line 300, in collect_pve return generate_latest(registry) File "/usr/local/lib/python3.7/site-packages/prometheus_client-0.7.1-py3.7.egg/prometheus_client/exposition.py", line 90, in generate_latest for metric in registry.collect(): File "/usr/local/lib/python3.7/site-packages/prometheus_client-0.7.1-py3.7.egg/prometheus_client/registry.py", line 75, in collect for metric in collector.collect(): File "/usr/local/lib/python3.7/site-packages/pve_exporter/collector.py", line 271, in collect for vmdata in self._pve.nodes(node['node']).qemu.get(): File "/usr/local/lib/python3.7/site-packages/proxmoxer/core.py", line 105, in get return self(args)._request("GET", params=params) File "/usr/local/lib/python3.7/site-packages/proxmoxer/core.py", line 94, in _request resp.reason, resp.content)) proxmoxer.core.ResourceException: 595 Errors during connection establishment, proxy handshake: No route to host - b'' 127.0.0.1 - - [18/May/2020 20:12:44] "GET /pve?target=hw-proxmox-01 HTTP/1.1" 500 -

crysis-ps · 2020-05-18T17:14:04Z

Also tried to install from sources from PR #31 to local dev machine.
And seems it works fine. With one node unavailable - the exporter still shows metrics: and node_up is 0.
Thank you!

znerol · 2020-05-18T18:19:39Z

I just published 1.2.2 which should fix this problem.

Thanks again for the report.

gigelu · 2020-07-21T20:20:08Z

Hello

I have a similar problem, when a node is offline, the exporter sometimes returns 500 error.
Not every time, it's kind of random, couldn.t figure it out why.
The error comes from e KeyError exception from python:

Jul 21 22:58:32 node1 pve_exporter[79606]: Exception thrown while rendering view
Jul 21 22:58:32 node1 pve_exporter[79606]: Traceback (most recent call last):
Jul 21 22:58:32 node1 pve_exporter[79606]: File "/usr/local/lib/python3.7/dist-packages/pve_exporter/http.py", line 101, in view
Jul 21 22:58:32 node1 pve_exporter[79606]: return self._viewsendpoint
Jul 21 22:58:32 node1 pve_exporter[79606]: File "/usr/local/lib/python3.7/dist-packages/pve_exporter/http.py", line 53, in on_pve
Jul 21 22:58:32 node1 pve_exporter[79606]: output = collect_pve(self._config[module], target)
Jul 21 22:58:32 node1 pve_exporter[79606]: File "/usr/local/lib/python3.7/dist-packages/pve_exporter/collector.py", line 302, in collect_pve
Jul 21 22:58:32 node1 pve_exporter[79606]: return generate_latest(registry)
Jul 21 22:58:32 node1 pve_exporter[79606]: File "/usr/local/lib/python3.7/dist-packages/prometheus_client/exposition.py", line 106, in generate_latest
Jul 21 22:58:32 node1 pve_exporter[79606]: for metric in registry.collect():
Jul 21 22:58:32 node1 pve_exporter[79606]: File "/usr/local/lib/python3.7/dist-packages/prometheus_client/registry.py", line 82, in collect
Jul 21 22:58:32 node1 pve_exporter[79606]: for metric in collector.collect():
Jul 21 22:58:32 node1 pve_exporter[79606]: File "/usr/local/lib/python3.7/dist-packages/pve_exporter/collector.py", line 108, in collect
Jul 21 22:58:32 node1 pve_exporter[79606]: label_values = [str(node[key]) for key in labels]
Jul 21 22:58:32 node1 pve_exporter[79606]: File "/usr/local/lib/python3.7/dist-packages/pve_exporter/collector.py", line 108, in
Jul 21 22:58:32 node1 pve_exporter[79606]: label_values = [str(node[key]) for key in labels]
Jul 21 22:58:32 node1 pve_exporter[79606]: KeyError: 'ip'

I changed the file to print the nodes dict and the output was:

Jul 21 22:58:50 node1 pve_exporter[79606]: [{'name': 'node2', 'level': '', 'ip': '192.168.100.11', 'local': 0, 'nodeid': 2, 'id': 'node/node2'}, {'local': 0, 'nodeid': 3, 'id': 'node/node3', 'level': '', 'name': 'node3'}, {'level': '', 'ip': '192.168.100.10', 'local': 1, 'nodeid': 1, 'name': 'node1', 'id': 'node/node1'}]

The offline node (node3) doesn't have the ip key.

znerol · 2020-07-22T07:02:32Z

Thanks @gigelu for the trace. I will have a look later.

znerol · 2020-07-25T08:38:24Z

It looks like the ip attribute is added conditionally to the /cluster/status API response (code). Still I wonder why the ip key is sometimes missing, since I expect a host to be resolvable even if it is down.

PVE docs used to stress the importance of adding names and ips of all nodes to /etc/hosts on every member. Nowadays they seemed to only recommend it. @gigelu did you add host names and ips to your /etc/hosts file?

znerol · 2020-07-25T09:45:39Z

@gigelu opened PR #41 which drops ip and local labels from the pve_node_info metric. I attached a wheel over there in order to make it easier for you to test it. Note, I've dropped support for Python 2 in #42, so please use Python 3 to test this. Would be very appreciated if you could report back whether the new version fixes the problem for you.

gigelu · 2020-07-26T06:29:55Z

Sorry for the late response.

did you add host names and ips to your /etc/hosts file?

No, I didn't (this is a test cluster). But adding them didn't fix the problem.

I attached a wheel over there in order to make it easier for you to test it.

I wasn't able to install from those files (the source files were missing).

Yes, changing the labels to a list without the ip key fixes the error.

L.E.: from reading the linked proxmox file, shouldn't you remove the level key instead of local? Although in my tests I only got the ip key missing, never level.

znerol · 2020-07-26T10:09:12Z

L.E.: from reading the linked proxmox file, shouldn't you remove the level key instead of local? Although in my tests I only got the ip key missing, never level.

Good point. The cheap answer: I never had reports about KeyError because of the level key. The long answer: It look like extract_node_stats will always return a Perl hash. If the level key is missing from it, then it is set to an empty string by the API method linked above.

The reason why I am tempted to remove the local key is the following: When monitoring a cluster it is desirable to collect metrics from all (or at least multiple) cluster members. Imagine you have cluster with nodes a, b, c and you collect metrics from all of them. Then you get the following results from a:

pve_node_info{id="node/a", local="1", ...} 1.0
pve_node_info{id="node/b", local="0", ...} 1.0
pve_node_info{id="node/c", local="0", ...} 1.0

From b you get:

pve_node_info{id="node/a", local="0", ...} 1.0
pve_node_info{id="node/b", local="1", ...} 1.0
pve_node_info{id="node/c", local="0", ...} 1.0

The rest of the labels denoted by .... are the same no matter from which node they were scraped. Prometheus treats each unique combination of labels as a separate time serie. In my opinion there is no value in treating pve_node_info records differently on whether they have their origin on the node they were scraped or on another cluster member. In fact in my own prometheus config I simply drop the local label in metric_relabel_configs for exact that reason. Since we are introducing breaking changes by dropping the ip label, I figured that would be a good opportunity to drop the local one as well (and also drop support for Python 2) and then bump the version to 2.0.

gigelu · 2020-07-26T17:23:34Z

I understand now, thanks for the explanation.

znerol · 2020-11-02T18:40:31Z

Fix in #41 is part of 2.0.1. Closing.

znerol mentioned this issue May 14, 2020

Fix failure when some node is unavailable #31

Merged

znerol closed this as completed May 18, 2020

znerol reopened this Jul 22, 2020

znerol mentioned this issue Jul 25, 2020

Remove ip and local label from pve_node_info gauge #41

Merged

znerol closed this as completed Nov 2, 2020

znerol mentioned this issue Feb 19, 2021

No metrics on authorization failure #55

Closed

znerol mentioned this issue Dec 19, 2022

Does not report any metrics when one node in the cluster is down #127

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

return 500 status when any node from cluster is unavailable #30

return 500 status when any node from cluster is unavailable #30

crysis-ps commented May 14, 2020

znerol commented May 14, 2020

znerol commented May 14, 2020

znerol commented May 14, 2020 •

edited

Loading

crysis-ps commented May 18, 2020 •

edited

Loading

crysis-ps commented May 18, 2020

znerol commented May 18, 2020

gigelu commented Jul 21, 2020 •

edited

Loading

znerol commented Jul 22, 2020

znerol commented Jul 25, 2020

znerol commented Jul 25, 2020 •

edited

Loading

gigelu commented Jul 26, 2020 •

edited

Loading

znerol commented Jul 26, 2020

gigelu commented Jul 26, 2020

znerol commented Nov 2, 2020

return 500 status when any node from cluster is unavailable #30

return 500 status when any node from cluster is unavailable #30

Comments

crysis-ps commented May 14, 2020

znerol commented May 14, 2020

znerol commented May 14, 2020

znerol commented May 14, 2020 • edited Loading

crysis-ps commented May 18, 2020 • edited Loading

crysis-ps commented May 18, 2020

znerol commented May 18, 2020

gigelu commented Jul 21, 2020 • edited Loading

znerol commented Jul 22, 2020

znerol commented Jul 25, 2020

znerol commented Jul 25, 2020 • edited Loading

gigelu commented Jul 26, 2020 • edited Loading

znerol commented Jul 26, 2020

gigelu commented Jul 26, 2020

znerol commented Nov 2, 2020

znerol commented May 14, 2020 •

edited

Loading

crysis-ps commented May 18, 2020 •

edited

Loading

gigelu commented Jul 21, 2020 •

edited

Loading

znerol commented Jul 25, 2020 •

edited

Loading

gigelu commented Jul 26, 2020 •

edited

Loading