Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more labels to prometheus output format #862

Closed
sebhoss opened this issue Dec 19, 2023 · 4 comments · Fixed by #864
Closed

Add more labels to prometheus output format #862

sebhoss opened this issue Dec 19, 2023 · 4 comments · Fixed by #864

Comments

@sebhoss
Copy link
Contributor

sebhoss commented Dec 19, 2023

Describe the feature:

The prometheus output format should include references to test cases in order to allow more fine grained alerting. The current implementation only contains a summary of test cases for each resource type and a summary across all tests. Since test cases can have different criticality levels, neither of those summaries are sufficient to decide whether a person should be called right now (potentially in the middle of the night) or not.

This has been previously discussed in #607 and my understanding is that it was postponed, but not rejected. There is a valid concern for storage cost, so I don't think this should/must be enabled by default, but rather should be an format option for the prometheus output.

Describe the solution you'd like

The metric goss_tests_outcomes_total should contain additional labels to uniquely identify a single test case or a different metric should be introduced that does exactly that.

Describe alternatives you've considered

My current workaround is an annotation I placed on the prometheus alert for goss tests that links to a runbook which tells the poor soul who has to deal with this to do:

$ ssh ...
$ goss --gossfile ... validate

This returns the failing tests and they can decide whether to fix it right now or go back to bed.

@aelsabbahy
Copy link
Member

@petemounce would love your thoughts on this?

@petemounce
Copy link
Collaborator

A format option makes sense to me if this is done at all. Should be off by default.

However, there's a potential workaround to needing this depending on the environment's observability capabilities.

@sebhoss the way we handled dealing with goss failure was twofold:

  • goss hosted a health check. If too many failures happened, the machine was terminated and replaced by auto scaling. No human was woken up; we filed a ticket-priority alert if this happened "too often" (as in; "important but not paging")
  • we logged the goss output centrally (with an output that was quiet on success, verbose on failure). The ticket contained the log query to ascertain the failing test(s). We had a way to make the health check pull the machine from service vs terminate it. It was then daytime support work to debug whatever was worth debugging to lower the machine-replacement rate.

@sebhoss
Copy link
Contributor Author

sebhoss commented Jan 7, 2024

Thanks for the feedback!

I'm totally fine with this feature being disabled by default. That said, I think even your setup @petemounce would benefit from it: Instead of terminating machines by counting all failures, you could count pager-priority failures only since there can be failures that do not really impact service availability and terminating machines because of those might not be what you want.

Since this feature is kinda blocking us from adopting goss for more machines/tests in our org, how about I open up a PR and we discuss technical details over there?

@petemounce
Copy link
Collaborator

Sure, PR welcome. Please include test coverage for the new code path?

(One of our guiding principles is that a machine failure should never wake someone up; the system should be resilient to that, and sand is cheaper than carbon to run. We replace them "quickly enough" that there's no overall impact. We page if service is impacted, which can happen for a number of reasons, machine taint and replacement being one).

sebhoss added a commit to sebhoss/goss that referenced this issue Jan 13, 2024
This adds a 'resource_id' label to each test-specific metric if the 'verbose' format option was enabled.

Since we only have access to format options in the 'Output' method, we are creating metrics during the first run of that method.

Metrics can only be registered once for any given set of labels, so we create a custom registry and use the existence of that as a check whether we need to create/register metrics. This has the added benefit of simplifying the test a tiny bit, and it exposes even less metrics on the /health endpoint.

fixes goss-org#862

Signed-off-by: Sebastian Hoß <[email protected]>
sebhoss added a commit to sebhoss/goss that referenced this issue Jan 13, 2024
This adds a 'resource_id' label to each test-specific metric if the 'verbose' format option was enabled.

Since we only have access to format options in the 'Output' method, we are creating metrics during the first run of that method.

Metrics can only be registered once for any given set of labels, so we create a custom registry and use the existence of that as a check whether we need to create/register metrics. This has the added benefit of simplifying the test a tiny bit, and it exposes even less metrics on the /health endpoint.

fixes goss-org#862

Signed-off-by: Sebastian Hoß <[email protected]>
aelsabbahy pushed a commit that referenced this issue Jan 17, 2024
* add more labels to prometheus output

This adds a 'resource_id' label to each test-specific metric if the 'verbose' format option was enabled.

Since we only have access to format options in the 'Output' method, we are creating metrics during the first run of that method.

Metrics can only be registered once for any given set of labels, so we create a custom registry and use the existence of that as a check whether we need to create/register metrics. This has the added benefit of simplifying the test a tiny bit, and it exposes even less metrics on the /health endpoint.

fixes #862

Signed-off-by: Sebastian Hoß <[email protected]>

* mention prometheus in docs for verbose flag

Signed-off-by: Sebastian Hoß <[email protected]>

---------

Signed-off-by: Sebastian Hoß <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants