Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notify for alerts failed #1683

Closed
fshabir opened this issue Dec 25, 2018 · 9 comments
Closed

Notify for alerts failed #1683

fshabir opened this issue Dec 25, 2018 · 9 comments

Comments

@fshabir
Copy link

fshabir commented Dec 25, 2018

Do you want to request a feature or report a bug?

Bug

What did you do?

We are running 2 node alertmanager cluster. Its been running just fine from past few months, until tomorrow. We noticed that after adding few new receivers and route configurations, and reloading the configuration by (kill -s SIGHUP), the alertmanager immediately spitted out some messages like this:

2018-12-23T10:14:50.812179+00:00 prometheus1 alertmanager: level=info ts=2018-12-23T10:14:50.811699345Z caller=main.go:322 msg="Loading configuration file" file=/opt/alertmanager/alertmanager.yml
2018-12-23T10:14:50.817745+00:00 prometheus1 alertmanager: level=error ts=2018-12-23T10:14:50.817560618Z caller=notify.go:332 component=dispatcher msg="Error on notify" err="context canceled"
2018-12-23T10:14:50.817893+00:00 prometheus1 alertmanager: level=error ts=2018-12-23T10:14:50.817580974Z caller=notify.go:332 component=dispatcher msg="Error on notify" err="context canceled"
2018-12-23T10:14:50.817981+00:00 prometheus1 alertmanager: level=error ts=2018-12-23T10:14:50.817600149Z caller=dispatch.go:280 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="context canceled"
2018-12-23T10:14:50.818057+00:00 prometheus1 alertmanager: level=error ts=2018-12-23T10:14:50.817597885Z caller=dispatch.go:280 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="context canceled"

What did you expect to see?

As I mentioned that we have 2 node cluster, we didn't notice any such error messages on the second server, after reloading configuration. We didn't see those errors prior to this and error messages do not make mention of the alert or receiver which is failing. This happened twice, right after reloading alertmanager.

I have already take a look at #282 and #346 but firstly I think its different because this happened only right after reloads and secondly the error message are of not much help because there is no mention of the alert which wasn't dispatched.

Environment

  • System information:

    Linux 4.4.110-1.el7.elrepo.x86_64 x86_64

  • Alertmanager version:

    alertmanager, version 0.15.2 (branch: HEAD, revision: d19fae3)
    build user: root@3101e5b68a55
    build date: 20180814-10:53:39
    go version: go1.10.3

  • Prometheus version:

    prometheus, version 2.4.3 (branch: HEAD, revision: 167a4b4e73a8eca8df648d2d2043e21bdb9a7449)
    build user: root@1e42b46043e9
    build date: 20181004-08:42:02
    go version: go1.11.1

  • Alertmanager configuration file:

Standard configuration working well otherwise.
  • Prometheus configuration file:
Standard configuration working well otherwise.
  • Logs:
2018-12-23T10:14:50.812179+00:00 prometheus1 alertmanager: level=info ts=2018-12-23T10:14:50.811699345Z caller=main.go:322 msg="Loading configuration file" file=/opt/alertmanager/alertmanager.yml
2018-12-23T10:14:50.817745+00:00 prometheus1 alertmanager: level=error ts=2018-12-23T10:14:50.817560618Z caller=notify.go:332 component=dispatcher msg="Error on notify" err="context canceled"
2018-12-23T10:14:50.817893+00:00 prometheus1 alertmanager: level=error ts=2018-12-23T10:14:50.817580974Z caller=notify.go:332 component=dispatcher msg="Error on notify" err="context canceled"
2018-12-23T10:14:50.817981+00:00 prometheus1 alertmanager: level=error ts=2018-12-23T10:14:50.817600149Z caller=dispatch.go:280 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="context canceled"
2018-12-23T10:14:50.818057+00:00 prometheus1 alertmanager: level=error ts=2018-12-23T10:14:50.817597885Z caller=dispatch.go:280 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="context canceled"
2018-12-24T06:13:27.917149+00:00 prometheus1 alertmanager: level=info ts=2018-12-24T06:13:27.916675987Z caller=main.go:322 msg="Loading configuration file" file=/opt/alertmanager/alertmanager.yml
2018-12-24T06:13:27.921093+00:00 prometheus1 alertmanager: level=error ts=2018-12-24T06:13:27.920944281Z caller=notify.go:332 component=dispatcher msg="Error on notify" err="context canceled"
2018-12-24T06:13:27.921225+00:00 prometheus1 alertmanager: level=error ts=2018-12-24T06:13:27.92097041Z caller=dispatch.go:280 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="context canceled"
2018-12-25T01:17:26.809024+00:00 prometheus1 alertmanager: level=info ts=2018-12-25T01:17:26.808452365Z caller=main.go:322 msg="Loading configuration file" file=/opt/alertmanager/alertmanager.yml```
@geekodour
Copy link
Contributor

can you check if the alertmanager_notifications_failed_total metric doesn't change on this?

an alert would look like rate(alertmanager_notifications_failed_total[1m]) > 0 you need to have the /mertrics endpoint of the alertmanager for this metric.

@fshabir
Copy link
Author

fshabir commented Dec 25, 2018

thank you @geekodour, we will start monitoring that metric (alertmanager_notifications_failed_total).

But questions still remains: we need to find out the root cause of this (why this happened out-of-blue on configuration reloads) and we do not know the name of alert which didn't get dispatched. Any ideas?

@tzz
Copy link
Contributor

tzz commented Dec 28, 2018

I get the same errors on reload and have alertmanager_notifications_failed_total 0 going back 12h. It would be very helpful to have more detail in these messages.

@simonpasquier
Copy link
Member

My explanation is that when the configuration gets reloaded, AlertManager stops the notification's dispatcher which in turn cancels its context to stop the running aggregation groups. So the logs are "expected" but we need to do a better job at not logging this as error (though real errors still need to be logged as such).

@julienlau
Copy link

julienlau commented Nov 10, 2020

may be related to #1307 ?

You mention "routes changed", did you also made some change in the DNS ?

@XI1062-abhisheksinghal
Copy link

caller=dispatch.go:309 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="gmail-notifications/email[0]: notify retry canceled after 2 attempts: create SMTP client: EOF"

Getting the above error on triggering alerts on GMAIL mail from prometheus via alertManager , any help

@pratnaparkhi2000
Copy link

@XI1062-abhisheksinghal got the same error, don't know how to proceed. Let me know if it is resolved!

@XI1062-abhisheksinghal
Copy link

No Still it is not resolved , did you find the solution @pratnaparkhi2000

@pratnaparkhi2000
Copy link

No, I couldn't find the solution. So I switched to slack notifications instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants