Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixed the LowDiskWatermarkPredicted Alert #1290

Conversation

sherifkayad
Copy link
Contributor

This closes #

Note to reviewers: remember to look at the commits in this PR and consider if they can be squashed

Summary Of Changes

The current alert rule for the LowDiskWatermarkPredicted can cause an error due to multiple matches for labels: grouping labels must ensure unique matches.

If the join is done on both the pod and the instance that fixes the issue.

Additional Context

The full error log from Prometheus:

ts=2023-03-14T15:17:25.487Z caller=manager.go:636 level=warn component="rule manager" file=/etc/prometheus/rules/prometheus-kube-prometheus-prometheus-rulefiles-0/rabbitmq-rabbitmq-alerting-rules-2acff1e4-8959-4af2-bd88-725641ae175d.yaml group=rabbitmq name=LowDiskWatermarkPredicted index=1 msg="Evaluating rule failed" rule="alert: LowDiskWatermarkPredicted\nexpr: (predict_linear(rabbitmq_disk_space_available_bytes[1d], 60 * 60 * 24) * on\n  (instance) group_left (rabbitmq_cluster, rabbitmq_node, pod) rabbitmq_identity_info\n  < rabbitmq_disk_space_available_limit_bytes * on (instance) group_left (rabbitmq_cluster,\n  rabbitmq_node, pod) rabbitmq_identity_info) and (count_over_time(rabbitmq_disk_space_available_limit_bytes[2h]\n  offset 22h) * on (instance) group_left (rabbitmq_cluster, rabbitmq_node, pod) rabbitmq_identity_info\n  > 0)\nfor: 1h\nlabels:\n  forwardToDynatrace: \"true\"\n  rulesgroup: rabbitmq\n  severity: warning\nannotations:\n  description: |\n    The predicted free disk space in 24 hours from now is `{{ $value | humanize1024 }}B`\n    in RabbitMQ node `{{ $labels.rabbitmq_node }}`, pod `{{ $labels.pod }}`,\n    RabbitMQ cluster `{{ $labels.rabbitmq_cluster }}`, namespace `{{ $labels.namespace }}`.\n  summary: |\n    Based on the trend of available disk space over the past 24 hours, it's predicted that, in 24 hours from now, a disk alarm will be triggered since the free disk space will drop below the free disk space limit.\n    This alert is reported for the partition where the RabbitMQ data directory is stored.\n    When the disk alarm will be triggered, all publishing connections across all cluster nodes will be blocked.\n    See\n    https://www.rabbitmq.com/alarms.html,\n    https://www.rabbitmq.com/disk-alarms.html,\n    https://www.rabbitmq.com/production-checklist.html#resource-limits-disk-space,\n    https://www.rabbitmq.com/persistence-conf.html,\n    https://www.rabbitmq.com/connection-blocked.html.\n" err="multiple matches for labels: grouping labels must ensure unique matches"

Local Testing

Please ensure you run the unit, integration and system tests before approving the PR.

To run the unit and integration tests:

$ make unit-tests integration-tests

You will need to target a k8s cluster and have the operator deployed for running the system tests.

For example, for a Kubernetes context named dev-bunny:

$ kubectx dev-bunny
$ make destroy deploy-dev
# wait for operator to be deployed
$ make system-tests

@sherifkayad
Copy link
Contributor Author

What can be done to get this guy to green?

@sherifkayad sherifkayad force-pushed the observability-lowdiskwatermarkpredicted-alert-fix branch from b590b74 to 2f00954 Compare March 27, 2023 09:20
@DanielePalaia
Copy link
Contributor

@sherifkayad it seems like an issue in our pipeline. your fix seems good I will merge it. Thank you

@DanielePalaia DanielePalaia merged commit 47ff9c2 into rabbitmq:main Mar 27, 2023
@sherifkayad sherifkayad deleted the observability-lowdiskwatermarkpredicted-alert-fix branch March 27, 2023 10:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants