Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alerting: Support per series state change tracking for queries that return multiple series #6041

Closed
bergquist opened this issue Sep 14, 2016 · 48 comments
Labels
area/alerting/evaluation Issues when evaluating alerts area/alerting Grafana Alerting type/feature-request

Comments

@bergquist
Copy link
Contributor

bergquist commented Sep 14, 2016

Currently, we don't update the alert message if the alert query return new/other series then the first one.

Let's say that one alert executes and have an eval match on "serie 1".
Next time the alert executes "serie 1" is fine but "serie 2" is alerting.
This will go unnoticed in our current implementation.

When an alert is triggered we should create a key based on the alerting series. The next time the same alert is triggered we can compare the state and the key created by the series.

If the data source support dimensions, we should include those when creating the key.

One suggestions for how to implement such key creation:
ex

{ "serie 1", datapoints: ... },
{ "serie 2", datapoints: ... }

Should become serie 1;serie 2

{ "web-front-01", tags: {"server": "web-front-01", "application": "web"}, datapoints: ... },
{ "web-front-01", tags: {"server": "web-front-02", "application": "web"}, datapoints: ... }

Should become web-front-01{server=web-front-01,application=web};web-front-02{server=web-front-02,application=web}

Keys might seem long but I don't think it matters that much. Perhaps we could find a way of shortening them by just storing a hash of it or something like that.

@bergquist bergquist added area/alerting Grafana Alerting area/alerting/evaluation Issues when evaluating alerts labels Sep 14, 2016
@torkelo torkelo changed the title Alerting: Update state and eval_data if alerting eval_match changes. Alerting: Support per series state change tracking for queries that return multiple series Dec 12, 2016
@klausenbusk
Copy link
Contributor

Just a +1 for this.

So our use-case for this feature, is that we have ~ 150 boxes at different locations and it could be really useful if we could setup a alert like "if memory usage 75%> of total".
With the current implementation that wouldn't work very well, as multiple boxes will only trigger 1 alert..

@pdf
Copy link

pdf commented Jan 14, 2017

Implementation details will likely overlap with #6557 I believe.

@ghost
Copy link

ghost commented Jan 14, 2017

I realize this is a more suitable issue for my comment in #6685 (comment)

I have another slightly different use case for this. I have multiple series in a graph: https://snapshot.raintank.io/dashboard/snapshot/V6cQW9mzHHfnDJHKgNKbyyCfzp4nZSBy

When the first reaches the alert level, everything goes smooth, but when the second reaches the alert level, it doesn't flip the alert state, and thus doesn't trigger anything. It'd be useful though :)

An idea how to go about this: a per-alert "notification id", which can use variables to distinguish between the desired alerts. In my graph above, I'd set the notification id to {{host}}.

@pdf
Copy link

pdf commented Jan 14, 2017

@lgierth see also #6553 (comment) for how I'd like to see this sort of thing work.

@yannispanousis
Copy link

Would absolutely love this

@ddhirajkumar
Copy link

ddhirajkumar commented Mar 14, 2017

I was thinking that if we support multiple alerts per graph, we would have a solution for this issue.

@bergquist
Copy link
Contributor Author

Updated the description. Feedback on how to implement this would be appreciated.

@pdu
Copy link

pdu commented Nov 29, 2017

I would like to share the way I got to work around this.

  1. Manually create a dashboard to monitor one EC2 instance, I've created the separate panels in the dashboard to monitor CPU, memory, disk, network rx/tx/rx_error/tx_error, etc.
  2. Then I export the dashboard as a JSON template file.
  3. I write a cron script to automatically create/delete dashboards
    3.1. get the EC2 list
    3.2. get the dashboard list
    3.3. call grafana API to add the missing dashboard based on the JSON template file
    3.4. call grafana API to delete the dashboard if the EC2 instance got terminated

@gregorsini
Copy link

Hi pdu, do you have time to share some of your script or any generic script to add or delete a dashboard? I've contemplated your approach, but would appreciate any head-start you're able to share.

Best, Greg.

@shurshun
Copy link

shurshun commented Dec 7, 2017

+1

1 similar comment
@whidrasl
Copy link

whidrasl commented Dec 8, 2017

+1

@pdu
Copy link

pdu commented Dec 13, 2017

@gregorsini sorry I just saw the message, please refer to https://github.com/pdu/grafana_ec2_monitoring

@ashuw018
Copy link

ashuw018 commented Feb 2, 2018

Hi, just checking if there are any chances of getting this feature in newer future, due to lack of template variable use in alerting , i have created separate dashboard for alerting only where i am not using any template variable but managing multiple series within single graph by tagging. So this feature will be very useful in my scenario.

Thanks,

@karimcitoh
Copy link

+1

@micw
Copy link

micw commented Sep 21, 2019

@bergquist Is there any progress in this?

Best regards,
Michael.

@manhojviknesh
Copy link

@bergquist : Kindly update on the status of this feature.
It would be extremely handy if we have alert triggers for every condition separately instead of alerting on single condition as a whole.

@fernandobeltranjsc
Copy link

I do not understand how this functionality is not available 3 years later, it is a problem not being able to have alarms for each series. If the CPU alarm goes off, it will no longer trigger while it is still active even if another computer has another CPU problem. It is not reasonable.

@mgiammarco
Copy link

I am another one that does not understand why you cannot have alarms for each series.
In these years I have made 7 of mine customers to use Influx stack that has this feature. You have lost 7 potential customers. Worse when I have explained them that Grafana does not support this feature they replied me: "it is not a useful product" (the real sentence was far worse).

@mgiammarco
Copy link

And please stop considering workarounds like repeated notifications. There are already too many notifications we cannot add even more as a workaround of that feature.

@cqcn1991
Copy link

cqcn1991 commented Apr 30, 2020

+1 for this
Really looking for this feature

@erkexzcx
Copy link

erkexzcx commented Jun 4, 2020

+1

@lmondoux
Copy link

+1

@vbichov
Copy link

vbichov commented Jul 8, 2020

basically this feature will bring Prometheus (or cortex) alert engine functionality to grafana.
I'm waiting for this feature for more than a year now.
Any chance this will be implemented?

@czd890
Copy link

czd890 commented Jul 15, 2020

+1

@ghost
Copy link

ghost commented Jul 15, 2020

Grafana Cortex alerting UI +1

@johntdyer
Copy link
Contributor

we so need this

@janbrunrasmussen
Copy link

Any chance someone from Grafana could update on the state of this?

@wiardvanrij
Copy link

+1...

@bbl232
Copy link

bbl232 commented Jul 31, 2020

bump! we also could really use this :)

@leoowu
Copy link

leoowu commented Sep 18, 2020

any further update ?

@knmorgan
Copy link

+1 from me too.

We are currently getting updates on alerts by using "Send reminders" in the notification channel settings, but this is obviously not ideal. This problem combined with #16662 really cripples alerts for us.

@rmccarthy-ellevation
Copy link

Is there going to be an update on this?

@uklance
Copy link

uklance commented Jan 6, 2021

My use case for this is we have multiple microservices running in Kubernetes and we calculate the replica utilisation for each service using a prometheus query. We get an alert if the replica utilisation is not 100% for more than 5 minutes (ie we get an alert when a microservice in our team's Kubernetes namespace dies and can't restart).

Currently, we only get a single alert when the first microservice dies and then get an OK alert when all services are back up. But in between a second, third or fourth microservice might have died and we don't get an additional alert for these. Ideally we would like a fail/OK alert for each microservice (ie a fail/OK for each series in the chart).

I could create a separate panel/chart/alert for each series but we are likely to have 20/50/100 microservices in future and this would mean copy/pasting Grafana JSON every time I add a new microservice. Currently, using a single panel/chart/alert, the list of applications is dynamically calculated via a prometheus query and we don't need to make any changes to Grafana config when we add more microservices to our Kubernetes namespace.

@danielfariati
Copy link

danielfariati commented Jan 20, 2021

Example use case:
Consider you have more than 1.000 different databases.
Consider you have an alert for RDS burst balance (Cloudwatch).
An alert is triggered because one of the databases have a low burst balance.
Even if you fix it right away, it can take a lot of time do the burst balance to go up again.

Then, there are two possibilities, in the current grafana version, as far as I know:

  1. With reminders off: You will not be notified if other databases have a low burst balance in the meantime, as the alert will not be triggered again;
  2. With reminders on: You will be notified several times, even if there are no new alerts, which can lead to information overload (you will stop several times to read the alerts and then discover that nothing has changed);

This funcionality to support per series state change would solve this in a nice way, at least for multiple use cases of mine.
We currently use another alerting solution for those cases, but this is also not ideal, as centralizing our graphs / alerts would be much better.

I can't think of any case where alerting per series would cause too many alerts.
If you created an alert is because you want to know if something happened.
If you are not insterested in knowing that, then it problably means that your alert is not useful or is not configured with the right parameters.
Also, if you grouped / created series for an alert is probably because the series matter.
Otherwise, you would create the alert without any sort of grouping.
Even then, nothing that an option to enable support per series or keep the current implementation wouldn't fix.

We currently use another alerting solution for this use cases, as grafana can't handle them in a scalable way (creating one graph per series is not scalable is several cases... for example, when new series are created/deleted dynamically).
But I would love to see this functionality in grafana, so we could centralize our alerts.

@danielfariati
Copy link

@bergquist Can you update us on this topic? You said that you guys were working on redesigning the alert system in and writing up a design doc, but I didn't see any follow up on that. Did that doc include this in some way?

@florian-forestier
Copy link

florian-forestier commented Feb 4, 2021

Bumping into this. This feature would be useful to connect Grafana alerting with other supervision tools, and keep users updated about what's really going on on their infrastructure.

I'm pretty sure this is not a complicated thing to do, because the "Test button" already does the job : even if the alert is already in alerting, it will display actualized information.

/api/alerts :

    {
        "id": 13,
        "dashboardId": 17,
        "dashboardUid": "___",
        "dashboardSlug": "supervision",
        "panelId": 38,
        "name": "my_alert",
        "state": "alerting",
        "newStateDate": "2021-02-04T08:55:03Z",
        "evalDate": "0001-01-01T00:00:00Z",
        "evalData": {
            "evalMatches": [
                {
                    "metric": "controller-2",
                    "tags": {
                        "__name__": "up",
                        "address_ori": "controller-2:7",
                        "instance": "controller-2",
                        "job": "node-exporter",
                        "type": "controller"
                    },
                    "value": 0
                }
            ]
        },
        "executionError": "",
        "url": "/d/____/supervision"
    },

"Test button" result :

{
  "firing": true,
  "state": "pending",
  "conditionEvals": "true = true",
  "timeMs": "7.959ms",
  "matches": [
    {
      "metric": "controller-0",
      "value": 0
    },
    {
      "metric": "controller-1",
      "value": 0
    },
    {
      "metric": "controller-2",
      "value": 0
    }
  ]

One (awful) workaround is to use /api/alerts/:id/pause to suspend and reactivate alert, so evalMatches is updated...

If you don't want to add this as a "base functionality", is this possible to have a /api/alerts/:id/recalculate endpoint, which will update evalMatches when asked ?

Edit : going a little bit in the code, I think I found where to do changes : pkg/services/alerting/result_handler.go, line 50; but I presume that line 61 will return an error (because bus.Dispatch seems to be able to throw error when oldState=newState, as described on line 67).

@kylebrandt
Copy link
Contributor

kylebrandt commented Jun 8, 2021

The new beta version of alerting in Grafana 8 (opt-in with "ngalert" feature toggle) supports "multi-dimensional" alerting based on labels and often in combination with Server Side Expressions. So one can have multiple alert instances from a single rule. Each instance (based on the set of labels) has its own state.

For example:

image

Would create alert (instances) per device,instance,job:

image

The exception is the "classic condition" operation within SSE, which is not per series and behaves like the pre-8 dashboard alerting conditions.

Demos etc regarding the new alerting V8 will be in the Grafacon session (online streaming) on June 16, 2021: https://grafana.com/go/grafanaconline/2021/alerting/

@pere3
Copy link

pere3 commented Aug 30, 2021

I don't really understand how is completely new feature (which is currently in alpha and not really doing the same alerting mechanism that we all used to (it separates alerts from graphs after creating)) closes issue from where it starts

It also strange how this feature closes issue #11849, where alerts api not being updated after "alerting" state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/alerting/evaluation Issues when evaluating alerts area/alerting Grafana Alerting type/feature-request
Projects
None yet
Development

No branches or pull requests