-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RabbitMQ Alerts 🚨 #667
Conversation
Add Prometheus rules for RabbitMQ and cluster operator. Add Alertmanager config for Slack notifications. Add RabbitMQ-Alerts Grafana dashboard. Co-authored-by: Gerhard Lazu <[email protected]> Co-authored-by: Michal Kuratczyk <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ | ||
"matcher": { | ||
"id": "byName", | ||
"options": "PersistentVolumeClaim" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this provides additional metadata about this specific alert if it fires. So if I have lots of different alerts, only one type will populate this column, and the rest will have an empty entry:
I'm fine with this for now, but there may come a time when we want to display metadata for many more alerts in the future. I wonder if there's any other way of displaying just opaque metadata, which can be filled with whatever is relevant to the alert in question.
As I say, only a thought, curious to hear your opinions. No need to change this now if you'd like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. The renaming from alert label (i.e. the metadata) to table column name takes place in
cluster-operator/observability/grafana/dashboards/rabbitmq-alerts.yml
Lines 380 to 392 in ae6e74e
"renameByName": { | |
"Time": "Time of Query", | |
"Value": "Number of Times Alert started", | |
"alertname": "Alert Name", | |
"container": "", | |
"instance": "Instance (scraped endpoint)", | |
"namespace": "Namespace", | |
"persistentvolumeclaim": "PersistentVolumeClaim", | |
"pod": "Pod", | |
"rabbitmq_cluster": "RabbitMQ Cluster", | |
"rabbitmq_node": "RabbitMQ node", | |
"severity": "Severity" | |
} |
persistentvolumeclaim
. If an alert doesn't have such a label, the cell in the table will be empty.
I think if we observe that the table contains too many columns, we can either remap or omit certain columns or introduce an opaque metadata label as you suggest. As of now, the user can also filter the table by alert type (via the drop down at the top).
Check the `spec` of your installed Prometheus custom resource. | ||
In this example, let's assume your Prometheus spec contains the following fields: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about the purpose of this tutorial (admittedly going in rather fresh-faced). Perhaps a README one level up would be useful in explaining the purpose of the three directories are?
Essentially, my user experience has been run the quickstart, then investigate the other directories under observability/prometheus
. My prometheus
spec doesn't have a podMonitorSelector, or a serviceMonitorSelector, but a probeSelector, so as a user I'm unclear how this README relates to what I've just set up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Installation experience is different depending on where you start:
- Prometheus is not installed, or
- Prometheus is already installed via Prometheus Operator, or
- Prometheus is already installed, but not via Prometheus Operator
By doing the quickstart, you started from option 1.
Your Prometheus spec has an empty podMonitorSelector and serviceMonitorSelector:
kubectl -n kube-prometheus get prometheus -o yaml
- apiVersion: monitoring.coreos.com/v1
kind: Prometheus
...
spec:
...
podMonitorNamespaceSelector: {}
podMonitorSelector: {}
...
probeNamespaceSelector: {}
probeSelector:
matchLabels:
release: prom
...
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector: {}
...
version: v2.26.0
I left these selectors empty on purpose via
cluster-operator/observability/quickstart.sh
Lines 31 to 32 in 279e6a6
--set "prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false" \ | |
--set "prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false" \ |
However, if you were to apply these yaml files starting from option 2, you'll likely need to add these labels.
The blog post (currently in review) will make these 3 options clearer.
For now, I added probeSelectorNilUsesHelmValues=false
to the quickstart to not have any matchLabels
selector at all in the Prometheus spec.
I also included a comment in the YAML files that the labels only need to be added if there is a matchLabels
selector defined (e.g. in
# If labels are defined in spec.serviceMonitorSelector.matchLabels of your deployed Prometheus object, make sure to include them here. |
I can add more READMEs if you think it makes it clearer?
I personally would rather want to link to the blog post from within https://github.com/rabbitmq/cluster-operator/blob/alerts/observability/README.md once it has been reviewed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to wait on the blog post.
observability/prometheus/rules/rabbitmq/insufficient-established-erlang-distribution-links.yml
Outdated
Show resolved
Hide resolved
This commit adds docs on how to integrate RabbitMQ scrape targets, alerting rules, and Grafana dashboards with an existing Prometheus. We describe the integration for a Prometheus installed by Prometheus Operator (i.e. CRDs ServiceMonitor, PodMonitor, PrometheusRule are available) and for a Prometheus not installed by Prometheus Operator. Remove the section to scrape RabbitMQ by annotations since that approach is not recommended anymore. This commit tries to remove duplicate YAML definitions from the docs and instead links to the YAML definitions already present in the cluster-operator repo. Relates to rabbitmq/cluster-operator#667 and rabbitmq/cluster-operator#676
This commit adds docs on how to integrate RabbitMQ scrape targets, alerting rules, and Grafana dashboards with an existing Prometheus. We describe the integration for a Prometheus installed by Prometheus Operator (i.e. CRDs ServiceMonitor, PodMonitor, PrometheusRule are available) and for a Prometheus not installed by Prometheus Operator. Remove the section to scrape RabbitMQ by annotations since that approach is not recommended anymore. This commit tries to remove duplicate YAML definitions from the docs and instead links to the YAML definitions already present in the cluster-operator repo. Relates to rabbitmq/cluster-operator#667 and rabbitmq/cluster-operator#676
Add Prometheus rules for RabbitMQ and cluster operator.
Add Alertmanager config for Slack notifications.
Add RabbitMQ-Alerts Grafana dashboard.
Blog post on https://blog.rabbitmq.com/ will follow soon.
In future, we might want to move all dashboards from https://github.com/rabbitmq/rabbitmq-server/tree/master/deps/rabbitmq_prometheus/docker/grafana/dashboards into this repository and create GitHub actions to publish dashboards on https://grafana.com/orgs/rabbitmq.