[feature request] probe (module scoped) failure reason metric #1077

domcyrus · 2023-06-06T08:57:43Z

probe (module scoped) failure reason metric

Current Situation

Currently, the blackbox_exporter provides some general metrics such as probe_success and probe_duration_seconds that apply universally to all modules. Additionally, specific modules like the http module / prober offer their own metrics like probe_http_status_code, which help monitor the availability and performance of http endpoints. However, when a probe fails, it can be challenging to pinpoint the exact cause of the failure without manually inspecting blackbox_exporter logs or attempting to reproduce the error, if possible.

Proposal

To address this issue, we propose the addition of a new metric called probe_($MODULE)_failure_reason to the blackbox_exporter. This metric would provide more detailed information about the reasons behind probe failures. It would include a label named "reason" with descriptive and enumerable values such as "dns-resolution-error," "http-timeout," or "ssl-certificate-validation-failed," among others. Currently, these failures can only be inferred from the logged errors.

Benefits

The introduction of the probe_($MODULE)_failure_reason metric would significantly enhance troubleshooting capabilities. In most cases users would be able to identify the root cause of a probe failure without the need for manual log inspection or additional testing. Moreover, this new metric would facilitate the setup of alerts and notifications tailored to specific failure scenarios.

Contribution

We believe that incorporating the probe_($MODULE)_failure_reason metric would be a valuable enhancement for the blackbox_exporter, improving its usability and effectiveness. We would be happy to contribute to the development of this feature and provide feedback on its implementation.

Thank you for considering our proposal. If this is something that would be ok to go forward with we’d love to contribute the functionality to blackbox_exporter.

The text was updated successfully, but these errors were encountered:

druanoor · 2023-06-19T12:02:42Z

Relates to this: #1062

Signed-off-by: Marco Cadetg <[email protected]>

slrtbtfs · 2024-12-03T11:59:11Z

Hi, I'm a bit confused as to why this is marked closed as completed, as it doesn't seem to have been merged.

That being said, I'd be very happy to see a feature like this in blackbox exporter and am thankful for the work you put into it!

beorn7 · 2024-12-03T15:12:48Z

Hi, I'm a bit confused as to why this is marked closed as completed,

I guess because that is the default if you simply hit the "close" button. :)

Maybe the actual meaning was "I won't have time to work on this anymore" or "I never got feedback from the maintainers, so I gave up."

domcyrus · 2024-12-03T16:27:13Z

Hi, I'm a bit confused as to why this is marked closed as completed,

I guess because that is the default if you simply hit the "close" button. :)

Yes, sorry this is what I did.

Maybe the actual meaning was "I won't have time to work on this anymore" or "I never got feedback from the maintainers, so I gave up."

It was the second and therefore I thought that it may just not be interesting or needed by anyone. I guess if that is not the case it's still possible to reopen it.

slrtbtfs · 2024-12-03T16:57:41Z

Thanks for reopening!

I think this feature would be great to have and would significantly improve the service my team can provide.

One minor change i would propose is to call the resulting metric just probe_failure_reason instead of probe_{module}_failure_reason. This makes it easier on the Prometheus side to write Queries that work for multiple blackbox modules.

@roidelapluie @mem @electron0zero Would you be interested in principle to accept contributions for such a feature?

(feeling a bit bad to ping all the maintainers, but that was suggested on the prometheus-developers mailing list, so I hope its ok)

SuperQ · 2024-12-03T17:17:30Z

I think this would be a useful metric.

There was a request on slack about having a probe_timeout bool metric to indicate if a probe timeout is reached.

This implementation may also be useful

SuperQ · 2024-12-03T17:18:26Z

Minor nit, I would probably call it probe_failure_info. Not sure we need a per-module variation.

slrtbtfs · 2024-12-12T16:20:45Z

I did a draft implementation of this in #1334 and would appreciate some Feedback about whether you think the general approach taken there is viable.

domcyrus pushed a commit to domcyrus/blackbox_exporter that referenced this issue Oct 19, 2023

http failure reason prometheus#1077

6516a08

domcyrus mentioned this issue Oct 23, 2023

Add http failure reason metric #1077 #1139

Closed

domcyrus pushed a commit to domcyrus/blackbox_exporter that referenced this issue Oct 23, 2023

http failure reason prometheus#1077

7d22e65

Signed-off-by: Marco Cadetg <[email protected]>

domcyrus closed this as completed Nov 21, 2024

SuperQ reopened this Dec 3, 2024

slrtbtfs mentioned this issue Dec 12, 2024

Implement probe_failure_info Metric #1334

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature request] probe (module scoped) failure reason metric #1077

[feature request] probe (module scoped) failure reason metric #1077

domcyrus commented Jun 6, 2023

druanoor commented Jun 19, 2023

slrtbtfs commented Dec 3, 2024

beorn7 commented Dec 3, 2024

domcyrus commented Dec 3, 2024

slrtbtfs commented Dec 3, 2024

SuperQ commented Dec 3, 2024

SuperQ commented Dec 3, 2024

slrtbtfs commented Dec 12, 2024

[feature request] probe (module scoped) failure reason metric #1077

[feature request] probe (module scoped) failure reason metric #1077

Comments

domcyrus commented Jun 6, 2023

probe (module scoped) failure reason metric

Current Situation

Proposal

Benefits

Contribution

druanoor commented Jun 19, 2023

slrtbtfs commented Dec 3, 2024

beorn7 commented Dec 3, 2024

domcyrus commented Dec 3, 2024

slrtbtfs commented Dec 3, 2024

SuperQ commented Dec 3, 2024

SuperQ commented Dec 3, 2024

slrtbtfs commented Dec 12, 2024