-
Notifications
You must be signed in to change notification settings - Fork 12.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alerting: Support per series state change tracking for queries that return multiple series #6041
Comments
Just a +1 for this. So our use-case for this feature, is that we have ~ 150 boxes at different locations and it could be really useful if we could setup a alert like "if memory usage 75%> of total". |
Implementation details will likely overlap with #6557 I believe. |
I realize this is a more suitable issue for my comment in #6685 (comment) I have another slightly different use case for this. I have multiple series in a graph: https://snapshot.raintank.io/dashboard/snapshot/V6cQW9mzHHfnDJHKgNKbyyCfzp4nZSBy When the first reaches the alert level, everything goes smooth, but when the second reaches the alert level, it doesn't flip the alert state, and thus doesn't trigger anything. It'd be useful though :) An idea how to go about this: a per-alert "notification id", which can use variables to distinguish between the desired alerts. In my graph above, I'd set the notification id to |
@lgierth see also #6553 (comment) for how I'd like to see this sort of thing work. |
Would absolutely love this |
I was thinking that if we support multiple alerts per graph, we would have a solution for this issue. |
Updated the description. Feedback on how to implement this would be appreciated. |
I would like to share the way I got to work around this.
|
Hi pdu, do you have time to share some of your script or any generic script to add or delete a dashboard? I've contemplated your approach, but would appreciate any head-start you're able to share. Best, Greg. |
+1 |
1 similar comment
+1 |
@gregorsini sorry I just saw the message, please refer to https://github.com/pdu/grafana_ec2_monitoring |
Hi, just checking if there are any chances of getting this feature in newer future, due to lack of template variable use in alerting , i have created separate dashboard for alerting only where i am not using any template variable but managing multiple series within single graph by tagging. So this feature will be very useful in my scenario. Thanks, |
+1 |
@bergquist Is there any progress in this? Best regards, |
@bergquist : Kindly update on the status of this feature. |
I do not understand how this functionality is not available 3 years later, it is a problem not being able to have alarms for each series. If the CPU alarm goes off, it will no longer trigger while it is still active even if another computer has another CPU problem. It is not reasonable. |
I am another one that does not understand why you cannot have alarms for each series. |
And please stop considering workarounds like repeated notifications. There are already too many notifications we cannot add even more as a workaround of that feature. |
+1 for this |
+1 |
+1 |
basically this feature will bring Prometheus (or cortex) alert engine functionality to grafana. |
+1 |
Grafana Cortex alerting UI +1 |
we so need this |
Any chance someone from Grafana could update on the state of this? |
+1... |
bump! we also could really use this :) |
any further update ? |
+1 from me too. We are currently getting updates on alerts by using "Send reminders" in the notification channel settings, but this is obviously not ideal. This problem combined with #16662 really cripples alerts for us. |
Is there going to be an update on this? |
My use case for this is we have multiple microservices running in Kubernetes and we calculate the replica utilisation for each service using a prometheus query. We get an alert if the replica utilisation is not 100% for more than 5 minutes (ie we get an alert when a microservice in our team's Kubernetes namespace dies and can't restart). Currently, we only get a single alert when the first microservice dies and then get an OK alert when all services are back up. But in between a second, third or fourth microservice might have died and we don't get an additional alert for these. Ideally we would like a fail/OK alert for each microservice (ie a fail/OK for each series in the chart). I could create a separate panel/chart/alert for each series but we are likely to have 20/50/100 microservices in future and this would mean copy/pasting Grafana JSON every time I add a new microservice. Currently, using a single panel/chart/alert, the list of applications is dynamically calculated via a prometheus query and we don't need to make any changes to Grafana config when we add more microservices to our Kubernetes namespace. |
Example use case: Then, there are two possibilities, in the current grafana version, as far as I know:
This funcionality to support per series state change would solve this in a nice way, at least for multiple use cases of mine. I can't think of any case where alerting per series would cause too many alerts. We currently use another alerting solution for this use cases, as grafana can't handle them in a scalable way (creating one graph per series is not scalable is several cases... for example, when new series are created/deleted dynamically). |
@bergquist Can you update us on this topic? You said that you guys were working on redesigning the alert system in and writing up a design doc, but I didn't see any follow up on that. Did that doc include this in some way? |
Bumping into this. This feature would be useful to connect Grafana alerting with other supervision tools, and keep users updated about what's really going on on their infrastructure. I'm pretty sure this is not a complicated thing to do, because the "Test button" already does the job : even if the alert is already in alerting, it will display actualized information. /api/alerts : {
"id": 13,
"dashboardId": 17,
"dashboardUid": "___",
"dashboardSlug": "supervision",
"panelId": 38,
"name": "my_alert",
"state": "alerting",
"newStateDate": "2021-02-04T08:55:03Z",
"evalDate": "0001-01-01T00:00:00Z",
"evalData": {
"evalMatches": [
{
"metric": "controller-2",
"tags": {
"__name__": "up",
"address_ori": "controller-2:7",
"instance": "controller-2",
"job": "node-exporter",
"type": "controller"
},
"value": 0
}
]
},
"executionError": "",
"url": "/d/____/supervision"
}, "Test button" result : {
"firing": true,
"state": "pending",
"conditionEvals": "true = true",
"timeMs": "7.959ms",
"matches": [
{
"metric": "controller-0",
"value": 0
},
{
"metric": "controller-1",
"value": 0
},
{
"metric": "controller-2",
"value": 0
}
] One (awful) workaround is to use /api/alerts/:id/pause to suspend and reactivate alert, so evalMatches is updated... If you don't want to add this as a "base functionality", is this possible to have a /api/alerts/:id/recalculate endpoint, which will update evalMatches when asked ? Edit : going a little bit in the code, I think I found where to do changes : |
The new beta version of alerting in Grafana 8 (opt-in with "ngalert" feature toggle) supports "multi-dimensional" alerting based on labels and often in combination with Server Side Expressions. So one can have multiple alert instances from a single rule. Each instance (based on the set of labels) has its own state. For example: Would create alert (instances) per device,instance,job: The exception is the "classic condition" operation within SSE, which is not per series and behaves like the pre-8 dashboard alerting conditions. Demos etc regarding the new alerting V8 will be in the Grafacon session (online streaming) on June 16, 2021: https://grafana.com/go/grafanaconline/2021/alerting/ |
I don't really understand how is completely new feature (which is currently in alpha and not really doing the same alerting mechanism that we all used to (it separates alerts from graphs after creating)) closes issue from where it starts It also strange how this feature closes issue #11849, where alerts api not being updated after "alerting" state. |
Currently, we don't update the alert message if the alert query return new/other series then the first one.
Let's say that one alert executes and have an eval match on "serie 1".
Next time the alert executes "serie 1" is fine but "serie 2" is alerting.
This will go unnoticed in our current implementation.
When an alert is triggered we should create a key based on the alerting series. The next time the same alert is triggered we can compare the state and the key created by the series.
If the data source support dimensions, we should include those when creating the key.
One suggestions for how to implement such key creation:
ex
Should become
serie 1;serie 2
Should become
web-front-01{server=web-front-01,application=web};web-front-02{server=web-front-02,application=web}
Keys might seem long but I don't think it matters that much. Perhaps we could find a way of shortening them by just storing a hash of it or something like that.
The text was updated successfully, but these errors were encountered: