Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RabbitMQ Alerts 🚨 #667

Merged
merged 5 commits into from
Apr 27, 2021
Merged

Add RabbitMQ Alerts 🚨 #667

merged 5 commits into from
Apr 27, 2021

Conversation

ansd
Copy link
Member

@ansd ansd commented Apr 27, 2021

Add Prometheus rules for RabbitMQ and cluster operator.
Add Alertmanager config for Slack notifications.
Add RabbitMQ-Alerts Grafana dashboard.

Blog post on https://blog.rabbitmq.com/ will follow soon.

In future, we might want to move all dashboards from https://github.com/rabbitmq/rabbitmq-server/tree/master/deps/rabbitmq_prometheus/docker/grafana/dashboards into this repository and create GitHub actions to publish dashboards on https://grafana.com/orgs/rabbitmq.

Add Prometheus rules for RabbitMQ and cluster operator.
Add Alertmanager config for Slack notifications.
Add RabbitMQ-Alerts Grafana dashboard.

Co-authored-by: Gerhard Lazu <[email protected]>
Co-authored-by: Michal Kuratczyk <[email protected]>
Copy link
Contributor

@coro coro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks amazing!
image

I love that you include remediation/next steps n each of the alerts, I think that's very useful for someone hitting these alerts and not knowing how to fix them.

I'm curious what you would/did use to test these alerts, and if it's worth sharing that tooling.

observability/quickstart.sh Show resolved Hide resolved
{
"matcher": {
"id": "byName",
"options": "PersistentVolumeClaim"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this provides additional metadata about this specific alert if it fires. So if I have lots of different alerts, only one type will populate this column, and the rest will have an empty entry:
image

I'm fine with this for now, but there may come a time when we want to display metadata for many more alerts in the future. I wonder if there's any other way of displaying just opaque metadata, which can be filled with whatever is relevant to the alert in question.

As I say, only a thought, curious to hear your opinions. No need to change this now if you'd like.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. The renaming from alert label (i.e. the metadata) to table column name takes place in

"renameByName": {
"Time": "Time of Query",
"Value": "Number of Times Alert started",
"alertname": "Alert Name",
"container": "",
"instance": "Instance (scraped endpoint)",
"namespace": "Namespace",
"persistentvolumeclaim": "PersistentVolumeClaim",
"pod": "Pod",
"rabbitmq_cluster": "RabbitMQ Cluster",
"rabbitmq_node": "RabbitMQ node",
"severity": "Severity"
}
. In your table output above only some alerts have a label persistentvolumeclaim. If an alert doesn't have such a label, the cell in the table will be empty.

I think if we observe that the table contains too many columns, we can either remap or omit certain columns or introduce an opaque metadata label as you suggest. As of now, the user can also filter the table by alert type (via the drop down at the top).

observability/quickstart.sh Show resolved Hide resolved
Comment on lines +5 to +6
Check the `spec` of your installed Prometheus custom resource.
In this example, let's assume your Prometheus spec contains the following fields:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about the purpose of this tutorial (admittedly going in rather fresh-faced). Perhaps a README one level up would be useful in explaining the purpose of the three directories are?

Essentially, my user experience has been run the quickstart, then investigate the other directories under observability/prometheus. My prometheus spec doesn't have a podMonitorSelector, or a serviceMonitorSelector, but a probeSelector, so as a user I'm unclear how this README relates to what I've just set up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Installation experience is different depending on where you start:

  1. Prometheus is not installed, or
  2. Prometheus is already installed via Prometheus Operator, or
  3. Prometheus is already installed, but not via Prometheus Operator

By doing the quickstart, you started from option 1.

Your Prometheus spec has an empty podMonitorSelector and serviceMonitorSelector:

kubectl -n kube-prometheus get prometheus -o yaml

- apiVersion: monitoring.coreos.com/v1
  kind: Prometheus
...
  spec:
    ...
    podMonitorNamespaceSelector: {}
    podMonitorSelector: {}
    ...
    probeNamespaceSelector: {}
    probeSelector:
      matchLabels:
        release: prom
    ...
    serviceMonitorNamespaceSelector: {}
    serviceMonitorSelector: {}
    ...
    version: v2.26.0

I left these selectors empty on purpose via

--set "prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false" \
--set "prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false" \
to keep things as simple as possible in the quickstart. So, you don't need to add any labels in option 1.

However, if you were to apply these yaml files starting from option 2, you'll likely need to add these labels.

The blog post (currently in review) will make these 3 options clearer.

For now, I added probeSelectorNilUsesHelmValues=false to the quickstart to not have any matchLabels selector at all in the Prometheus spec.

I also included a comment in the YAML files that the labels only need to be added if there is a matchLabels selector defined (e.g. in

# If labels are defined in spec.serviceMonitorSelector.matchLabels of your deployed Prometheus object, make sure to include them here.
).

I can add more READMEs if you think it makes it clearer?
I personally would rather want to link to the blog post from within https://github.com/rabbitmq/cluster-operator/blob/alerts/observability/README.md once it has been reviewed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to wait on the blog post.

@ansd ansd merged commit 8836b1f into main Apr 27, 2021
@ansd ansd deleted the alerts branch April 27, 2021 16:08
ansd added a commit to rabbitmq/rabbitmq-website that referenced this pull request May 5, 2021
This commit adds docs on how to integrate RabbitMQ scrape targets,
alerting rules, and Grafana dashboards with an existing Prometheus.

We describe the integration for a Prometheus installed by Prometheus
Operator (i.e. CRDs ServiceMonitor, PodMonitor, PrometheusRule are available)
and for a Prometheus not installed by Prometheus Operator.

Remove the section to scrape RabbitMQ by annotations since that approach
is not recommended anymore.

This commit tries to remove duplicate YAML definitions from the docs
and instead links to the YAML definitions already present in the
cluster-operator repo.

Relates to rabbitmq/cluster-operator#667 and
rabbitmq/cluster-operator#676
ansd added a commit to rabbitmq/rabbitmq-website that referenced this pull request May 5, 2021
This commit adds docs on how to integrate RabbitMQ scrape targets,
alerting rules, and Grafana dashboards with an existing Prometheus.

We describe the integration for a Prometheus installed by Prometheus
Operator (i.e. CRDs ServiceMonitor, PodMonitor, PrometheusRule are available)
and for a Prometheus not installed by Prometheus Operator.

Remove the section to scrape RabbitMQ by annotations since that approach
is not recommended anymore.

This commit tries to remove duplicate YAML definitions from the docs
and instead links to the YAML definitions already present in the
cluster-operator repo.

Relates to rabbitmq/cluster-operator#667 and
rabbitmq/cluster-operator#676
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants