Add RabbitMQ Alerts 🚨 #667

ansd · 2021-04-27T09:22:57Z

Add Prometheus rules for RabbitMQ and cluster operator.
Add Alertmanager config for Slack notifications.
Add RabbitMQ-Alerts Grafana dashboard.

Blog post on https://blog.rabbitmq.com/ will follow soon.

In future, we might want to move all dashboards from https://github.com/rabbitmq/rabbitmq-server/tree/master/deps/rabbitmq_prometheus/docker/grafana/dashboards into this repository and create GitHub actions to publish dashboards on https://grafana.com/orgs/rabbitmq.

Add Prometheus rules for RabbitMQ and cluster operator. Add Alertmanager config for Slack notifications. Add RabbitMQ-Alerts Grafana dashboard. Co-authored-by: Gerhard Lazu <[email protected]> Co-authored-by: Michal Kuratczyk <[email protected]>

coro

Looks amazing!

I love that you include remediation/next steps n each of the alerts, I think that's very useful for someone hitting these alerts and not knowing how to fix them.

I'm curious what you would/did use to test these alerts, and if it's worth sharing that tooling.

observability/quickstart.sh

coro · 2021-04-27T12:52:29Z

observability/grafana/dashboards/rabbitmq-alerts.yml

+              {
+                "matcher": {
+                  "id": "byName",
+                  "options": "PersistentVolumeClaim"


If I understand correctly, this provides additional metadata about this specific alert if it fires. So if I have lots of different alerts, only one type will populate this column, and the rest will have an empty entry:

I'm fine with this for now, but there may come a time when we want to display metadata for many more alerts in the future. I wonder if there's any other way of displaying just opaque metadata, which can be filled with whatever is relevant to the alert in question.

As I say, only a thought, curious to hear your opinions. No need to change this now if you'd like.

Good point. The renaming from alert label (i.e. the metadata) to table column name takes place in

cluster-operator/observability/grafana/dashboards/rabbitmq-alerts.yml

Lines 380 to 392 in ae6e74e

"renameByName": {

"Time": "Time of Query",

"Value": "Number of Times Alert started",

"alertname": "Alert Name",

"container": "",

"instance": "Instance (scraped endpoint)",

"namespace": "Namespace",

"persistentvolumeclaim": "PersistentVolumeClaim",

"pod": "Pod",

"rabbitmq_cluster": "RabbitMQ Cluster",

"rabbitmq_node": "RabbitMQ node",

"severity": "Severity"

}

. In your table output above only some alerts have a label persistentvolumeclaim. If an alert doesn't have such a label, the cell in the table will be empty.

I think if we observe that the table contains too many columns, we can either remap or omit certain columns or introduce an opaque metadata label as you suggest. As of now, the user can also filter the table by alert type (via the drop down at the top).

observability/prometheus/alertmanager/README.md

observability/quickstart.sh

coro · 2021-04-27T13:39:57Z

observability/prometheus/monitors/README.md

+Check the `spec` of your installed Prometheus custom resource.
+In this example, let's assume your Prometheus spec contains the following fields:


I'm confused about the purpose of this tutorial (admittedly going in rather fresh-faced). Perhaps a README one level up would be useful in explaining the purpose of the three directories are?

Essentially, my user experience has been run the quickstart, then investigate the other directories under observability/prometheus. My prometheus spec doesn't have a podMonitorSelector, or a serviceMonitorSelector, but a probeSelector, so as a user I'm unclear how this README relates to what I've just set up.

Installation experience is different depending on where you start:

Prometheus is not installed, or

Prometheus is already installed via Prometheus Operator, or

Prometheus is already installed, but not via Prometheus Operator

By doing the quickstart, you started from option 1.

Your Prometheus spec has an empty podMonitorSelector and serviceMonitorSelector:

kubectl -n kube-prometheus get prometheus -o yaml - apiVersion: monitoring.coreos.com/v1 kind: Prometheus ... spec: ... podMonitorNamespaceSelector: {} podMonitorSelector: {} ... probeNamespaceSelector: {} probeSelector: matchLabels: release: prom ... serviceMonitorNamespaceSelector: {} serviceMonitorSelector: {} ... version: v2.26.0

I left these selectors empty on purpose via

cluster-operator/observability/quickstart.sh

Lines 31 to 32 in 279e6a6

--set "prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false" \

--set "prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false" \

to keep things as simple as possible in the quickstart. So, you don't need to add any labels in option 1.

However, if you were to apply these yaml files starting from option 2, you'll likely need to add these labels.

The blog post (currently in review) will make these 3 options clearer.

For now, I added probeSelectorNilUsesHelmValues=false to the quickstart to not have any matchLabels selector at all in the Prometheus spec.

I also included a comment in the YAML files that the labels only need to be added if there is a matchLabels selector defined (e.g. in

cluster-operator/observability/prometheus/monitors/rabbitmq-servicemonitor.yml

Line 6 in 279e6a6

# If labels are defined in spec.serviceMonitorSelector.matchLabels of your deployed Prometheus object, make sure to include them here.

).

I can add more READMEs if you think it makes it clearer?
I personally would rather want to link to the blog post from within https://github.com/rabbitmq/cluster-operator/blob/alerts/observability/README.md once it has been reviewed.

Happy to wait on the blog post.

observability/prometheus/rules/rabbitmq/README.md

observability/prometheus/rules/rabbitmq/insufficient-established-erlang-distribution-links.yml

This commit adds docs on how to integrate RabbitMQ scrape targets, alerting rules, and Grafana dashboards with an existing Prometheus. We describe the integration for a Prometheus installed by Prometheus Operator (i.e. CRDs ServiceMonitor, PodMonitor, PrometheusRule are available) and for a Prometheus not installed by Prometheus Operator. Remove the section to scrape RabbitMQ by annotations since that approach is not recommended anymore. This commit tries to remove duplicate YAML definitions from the docs and instead links to the YAML definitions already present in the cluster-operator repo. Relates to rabbitmq/cluster-operator#667 and rabbitmq/cluster-operator#676

Add RabbitMQ alerts

ae6e74e

Add Prometheus rules for RabbitMQ and cluster operator. Add Alertmanager config for Slack notifications. Add RabbitMQ-Alerts Grafana dashboard. Co-authored-by: Gerhard Lazu <[email protected]> Co-authored-by: Michal Kuratczyk <[email protected]>

MirahImage approved these changes Apr 27, 2021

View reviewed changes

coro approved these changes Apr 27, 2021

View reviewed changes

ansd added 4 commits April 27, 2021 16:31

Fix typo

06cf59c

Clarify that some labels are provided by the PromQL expression

279e6a6

Don't use match label selectors in Prometheus spec

7510c78

Add output how to open UIs by port-forwarding

f50ff90

ansd merged commit 8836b1f into main Apr 27, 2021

ansd deleted the alerts branch April 27, 2021 16:08

ansd mentioned this pull request May 5, 2021

Update Monitoring RabbitMQ in Kubernetes rabbitmq/rabbitmq-website#1175

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RabbitMQ Alerts 🚨 #667

Add RabbitMQ Alerts 🚨 #667

ansd commented Apr 27, 2021

coro left a comment

coro Apr 27, 2021

ansd Apr 27, 2021

coro Apr 27, 2021

ansd Apr 27, 2021

coro Apr 28, 2021

	"renameByName": {
	"Time": "Time of Query",
	"Value": "Number of Times Alert started",
	"alertname": "Alert Name",
	"container": "",
	"instance": "Instance (scraped endpoint)",
	"namespace": "Namespace",
	"persistentvolumeclaim": "PersistentVolumeClaim",
	"pod": "Pod",
	"rabbitmq_cluster": "RabbitMQ Cluster",
	"rabbitmq_node": "RabbitMQ node",
	"severity": "Severity"
	}

		Check the `spec` of your installed Prometheus custom resource.
		In this example, let's assume your Prometheus spec contains the following fields:

	--set "prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false" \
	--set "prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false" \

Add RabbitMQ Alerts 🚨 #667

Add RabbitMQ Alerts 🚨 #667

Conversation

ansd commented Apr 27, 2021

coro left a comment

Choose a reason for hiding this comment

coro Apr 27, 2021

Choose a reason for hiding this comment

ansd Apr 27, 2021

Choose a reason for hiding this comment

coro Apr 27, 2021

Choose a reason for hiding this comment

ansd Apr 27, 2021

Choose a reason for hiding this comment

coro Apr 28, 2021

Choose a reason for hiding this comment