Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on notify: context deadline exceeded #282

Closed
kavyakul opened this issue Mar 28, 2016 · 15 comments
Closed

Error on notify: context deadline exceeded #282

kavyakul opened this issue Mar 28, 2016 · 15 comments

Comments

@kavyakul
Copy link

If I group my alerts in such a way that 4 alerts fall in a group I get following error:

ERRO[0300] Error on notify: context deadline exceeded source=notify.go:152
ERRO[0300] Notify for 1 alerts failed: context deadline exceeded source=dispatch.go:238

& if I modify grouping so that 1 group has one alert each, then thing move smoothly.

Please help to correct where I am wrong.
alertmanager.conf:

 group_by: ['severity', 'alertname', 'instance']
  group_wait: 1m
  group_interval: 5m
  repeat_interval: 30m
  receiver: alerta
  routes:
  - match_re:
      service: container
    receiver: mail
  - match:
      service: jira_chk
    receiver: JIRA

@fabxc
Copy link
Contributor

fabxc commented Mar 28, 2016

That seems highly unlikely to be related to the grouping. context deadline exceeded simply means the request timed out.
The error message shows that only one of the four alerts failed. So against your expectation the 4 alerts are not in the same group.

The most likely thing here is that notification requests to one of your receivers are timing out. Your configuration changes then cause no requests being sent to that receiver in the first place.

For further debugging we need the full configuration (before AND after) and the label sets of the four alerts.

@kavyakul
Copy link
Author

Difference in old & new configuration is "group_by" & "repeat_interval" value
Old alertmanager.conf:

global:
  # The smarthost and SMTP sender used for mail notifications.
  smtp_smarthost: 'mailhostX:25'
  smtp_from: 'XYZ@ABC'

route:
  group_by: ['severity', 'alertname']
  group_wait: 1m
  group_interval: 5m
  repeat_interval: 333h
  receiver: alerta
  routes:
  - match_re:
      service: container
    receiver: mail
  - match:
      service: jira_chk
    receiver: JIRA

inhibit_rules:
- source_match:
    severity: 'low'
  target_match:
    severity: 'warning'
  # Apply inhibition if the alertname is the same.
  equal: ['alertname', 'cluster', 'service']

receivers:
- name: 'alerta'
  webhook_configs:
  - url: 'http://<IP>:8995/api/webhooks/prometheus'
    send_resolved: true

- name: 'mail'
  email_configs:
  - to: 'kk@ABC'
    send_resolved: true

- name: 'JIRA'
  webhook_configs:
  - url: 'http://<IP>:8955/issues/projects/LP'
    send_resolved: true

New alertmanager.conf:

global:
  # The smarthost and SMTP sender used for mail notifications.
  smtp_smarthost: 'mailhostX:25'
  smtp_from: 'XYZ@ABC'

route:
  group_by: ['severity', 'alertname', 'instance']
  group_wait: 1m
  group_interval: 5m
  repeat_interval: 30m
  receiver: alerta
  routes:
  - match_re:
      service: container
    receiver: mail
  - match:
      service: jira_chk
    receiver: JIRA

inhibit_rules:
- source_match:
    severity: 'low'
  target_match:
    severity: 'warning'
  # Apply inhibition if the alertname is the same.
  equal: ['alertname', 'cluster', 'service']

receivers:
- name: 'alerta'
  webhook_configs:
  - url: 'http://<IP>:8995/api/webhooks/prometheus'
    send_resolved: true

- name: 'mail'
  email_configs:
  - to: 'kk@ABC'
    send_resolved: true

- name: 'JIRA'
  webhook_configs:
  - url: 'http://<IP>:8955/issues/projects/LP'
    send_resolved: true

Complete Error:
ERRO[239100] Error on notify: context deadline exceeded source=notify.go:152
ERRO[239100] Notify for 4 alerts failed: context deadline exceeded source=dispatch.go:238
WARN[239100] Notify attempt 1 failed: Post http://:8955/issues/projects/LP: dial tcp :8955: getsockopt: connection refused source=notify.go:193
ERRO[239101] Error on notify: context deadline exceeded source=notify.go:152
ERRO[239101] Notify for 1 alerts failed: context deadline exceeded source=dispatch.go:238

With old configuration: I have 5 alerts for jira_chk receiver; 4 in a group & 1 in another.

@rohit01
Copy link

rohit01 commented Aug 4, 2016

I am also facing the same issue. Does alertmanager retries fail notifications?

@rohit01
Copy link

rohit01 commented Aug 4, 2016

Here are my logs (using latest stable version - v0.3.0):

{"level":"warning","msg":"Notify attempt 2 failed: context deadline exceeded","source":"notify.go:193","time":"2016-08-04T03:02:01Z"}
{"level":"warning","msg":"Notify attempt 3 failed: context deadline exceeded","source":"notify.go:193","time":"2016-08-04T03:02:01Z"}
{"level":"error","msg":"Error on notify: context deadline exceeded","source":"notify.go:152","time":"2016-08-04T03:02:01Z"}
{"level":"error","msg":"Notify for 1 alerts failed: context deadline exceeded","source":"dispatch.go:238","time":"2016-08-04T03:02:01Z"}
{"level":"warning","msg":"Notify attempt 1 failed: Post https://events.pagerduty.com/generic/2010-04-15/create_event.json: dial tcp: lookup events.pagerduty.com on 10.14.0.2:53: read udp 10.14.22.55:44485-\u003e10.14.0.2:53: i/o timeout","source":"notify.go:193","time":"2016-08-04T03:02:02Z"}
{"level":"warning","msg":"Notify attempt 2 failed: context deadline exceeded","source":"notify.go:193","time":"2016-08-04T03:02:11Z"}
{"level":"warning","msg":"Notify attempt 3 failed: context deadline exceeded","source":"notify.go:193","time":"2016-08-04T03:02:11Z"}
{"level":"error","msg":"Error on notify: context deadline exceeded","source":"notify.go:152","time":"2016-08-04T03:02:11Z"}
{"level":"error","msg":"Notify for 1 alerts failed: context deadline exceeded","source":"dispatch.go:238","time":"2016-08-04T03:02:11Z"}
{"level":"warning","msg":"Notify attempt 1 failed: Post https://events.pagerduty.com/generic/2010-04-15/create_event.json: dial tcp: lookup events.pagerduty.com on 10.14.0.2:53: read udp 10.14.22.55:53401-\u003e10.14.0.2:53: i/o timeout","source":"notify.go:193","time":"2016-08-04T03:02:12Z"}
{"level":"warning","msg":"Notify attempt 2 failed: context deadline exceeded","source":"notify.go:193","time":"2016-08-04T03:02:21Z"}
{"level":"warning","msg":"Notify attempt 3 failed: context deadline exceeded","source":"notify.go:193","time":"2016-08-04T03:02:21Z"}
{"level":"error","msg":"Error on notify: context deadline exceeded","source":"notify.go:152","time":"2016-08-04T03:02:21Z"}
{"level":"error","msg":"Notify for 1 alerts failed: context deadline exceeded","source":"dispatch.go:238","time":"2016-08-04T03:02:21Z"}
{"level":"warning","msg":"Notify attempt 1 failed: Post https://events.pagerduty.com/generic/2010-04-15/create_event.json: dial tcp: lookup events.pagerduty.com on 10.14.0.2:53: read udp 10.14.22.55:47474-\u003e10.14.0.2:53: i/o timeout","source":"notify.go:193","time":"2016-08-04T03:02:22Z"}

@fabxc
Copy link
Contributor

fabxc commented Aug 4, 2016

@rohit01 yes, Alertmanager retries failed notifications.

@fabxc
Copy link
Contributor

fabxc commented Aug 4, 2016

@rohit01 could you share your configuration as well? Did it also happened after you changed something or was it there from the beginning?

@rohit01
Copy link

rohit01 commented Aug 4, 2016

It started happening automatically one day without any config change. All of a sudden, we stopped getting any alerts. Did random things like restart container, minor config changes and it started working again with intermittent failures.
We were on version 01.0, we tried 0.1.1 and finally moved to v0.3.0 (yesterday).

Here's the configuration:

global:
  # The smarthost and SMTP sender used for mail notifications.
  smtp_smarthost: '****'
  smtp_from: '****'

# The root route on which each incoming alert enters.
route:
  # The labels by which incoming alerts are grouped together. For example,
  # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
  # be batched into a single group.
  group_by: ['severity', 'alertname', 'instance', 'job']

  # When a new group of alerts is created by an incoming alert, wait at
  # least 'group_wait' to send the initial notification.
  # This way ensures that you get multiple alerts for the same group that start
  # firing shortly after another are batched together on the first
  # notification.
  group_wait: 3s

  # When the first notification was sent, wait 'group_interval' to send a batch
  # of new alerts that started firing for that group.
  group_interval: 3s

  # If an alert has successfully been sent, wait 'repeat_interval' to
  # resend them.
  repeat_interval: 3h

  # A default receiver
  receiver: 'notifications'

  # All the above attributes are inherited by all child routes and can
  # overwritten on each.

  # The child route trees.
  routes:
  # This routes performs a regular expression match on alert labels to
  # catch alerts that are related to a list of services.
    - match_re:
        severity: ^(critical|warning)$
      receiver: 'notifications'
      # Child Routes (having greater precedence)
      routes:
      - match:
          severity: critical
        receiver: 'notifications-and-pagerduty'

# Inhibition rules allow to mute a set of alerts given that another alert is
# firing.
# We use this to mute any warning-level notifications if the same alert is
# already critical.
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    # Apply inhibition if the following matches.
    equal: ['CITY', 'SERVICE', 'instance']

  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    # Apply inhibition if the following matches.
    equal: ['alertname', 'instance', 'job']

receivers:
  - name: 'notifications'
    # Send email
    email_configs:
      - to: '<email 1>'
        require_tls: false
      - to: '<email 2>'
        require_tls: false


  - name: 'notifications-and-pagerduty'
    # Trigger pagerduty
    pagerduty_configs:
      - service_key: <hidden>
    # Send email
    email_configs:
      - to: '<email 1>'
        require_tls: false
      - to: '<email 2>'
        require_tls: false

@fabxc
Copy link
Contributor

fabxc commented Aug 4, 2016

This is worrying as it's two cases. But OTOH in both cases the error messages indicate a timeout when sending the notification (not anywhere in our code before or after). So I'm wondering if it's just an unlucky coincidence.

Have you verified whether you can curl the respective notification endpoint (PagerDuty in your case @rohit01) within the timeout?

@rohit01
Copy link

rohit01 commented Aug 4, 2016

@fabxc What is the timeout value?

Unable to find in code.

@fabxc
Copy link
Contributor

fabxc commented Aug 5, 2016

@rohit01 I was not quite precise earlier.

The general timeout is the grouping interval, which you can think of the "update frequency" for a group of alerts. You have set yours very low, it should generally be at the order of 1 minute or more as it is meant to avoid getting spammed. PagerDuty for that matter will not even page you again and has issues updating the data (last time I checked). So there's no value in having it that low here.
It also means your effective request timeout to PagerDuty is only 3 seconds.
The timeout is assigned here:

alertmanager/dispatch.go

Lines 314 to 316 in 7b20b77

// Give the notifcations time until the next flush to
// finish before terminating them.
ctx, cancel := context.WithTimeout(ag.ctx, timeout)

Now for your error logs on the failed retry attempts. This seems to be an issue here:
https://github.com/prometheus/alertmanager/blob/master/notify/notify.go#L191-L199

We retry notification with backoff based on a ticker. We attempt to receive a signal from that ticker, but want to abort that if the context timed out. In the code we just do a plain select over both signals, the choice is random.
If we retry with a timed out context, the notification immediately fails. The code should check the context first before then waiting on whatever signal appears first.

Does that make sense? So your problem would most likely disappear if you set group_interval to a value of 3-5 minutes.

Sorry about addressing your issue super late, @kavyakul. Your error logs also simply indicates a failure to reach the notification endpoint: dial tcp :8955: getsockopt: connection refused. Have you checked whether you can manually curl it from the Alertmanager host?

fabxc added a commit that referenced this issue Aug 5, 2016
This addresses the misleading error messages reported in #212
Explanation: #282 (comment)

Fixes #282
@fabxc fabxc closed this as completed in #446 Aug 5, 2016
@rohit01
Copy link

rohit01 commented Aug 5, 2016

@fabxc Thanks for the explanation. We were running with group_interval as 1s as well. Now, I think we know the reason of our pain :)

@fabxc
Copy link
Contributor

fabxc commented Aug 5, 2016

Yes, I suppose we could think about making changes to have a timeout larger than group_interval to allow for "immediate updates". But I believe this at best enables self harm for users and cannot see any practical value.

@brian-brazil what's your opinion?

@brian-brazil
Copy link
Contributor

I'd not consider any group interval under a minute to be sane. I think our current behaviour is fine.

If you want some form of automated action faster than that, then you want a dedicated realtime control system because humans can't respond that fast.

@Keshav0690
Copy link

Hi
I have similar issue.
Getting context deadline exceeded and also I can able to curl the endpoint from alert manager host

here is my alert rules config:

alert.rules
groups:

  • name: sample
    rules:
    • alert: prometheus_remote_storage_succeeded_samples_total_alert
      expr: sum(prometheus_remote_storage_succeeded_samples_total{instance="prometheus:9090"}) == 0
      for: 30s
      labels:
      severity: warning
      annotations:
      summary: Instance {{ $labels.instance }} cannot store sample to remote storage
  • name: sample1
    rules:
    • alert: sample_alert
      expr: testmetric == 0
      labels:
      severity: warning
      instance: i1
      annotations:
      summary: this is sample summary for {{ $labels.instance }}

alertmanager.yaml

global:
resolve_timeout: 5m

route:
group_by: ['instance']
group_wait: 1m
group_interval: 1m
repeat_interval: 1h
receiver: 'web.hook'
receivers:

Alert:
See Attachment

Can connect to endpoint:
sh-4.2# curl -v http://json-server.2304650011.svc:3030/EspServiceRequest/

  • About to connect() to json-server.2304650011.svc port 3030 (#0)
  • Trying 192.168.81.205...
  • Connected to json-server.2304650011.svc (192.168.81.205) port 3030 (#0)

GET /EspServiceRequest/ HTTP/1.1
User-Agent: curl/7.29.0
Host: json-server.2304650011.svc:3030
Accept: /

screen shot 2018-09-19 at 4 30 45 pm

Error:

level=error ts=2018-09-19T23:31:48.278890325Z caller=notify.go:332 component=dispatcher msg="Error on notify" err="context deadline exceeded"
level=error ts=2018-09-19T23:31:48.278990617Z caller=dispatch.go:280 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="context deadline exceeded"
level=error ts=2018-09-19T23:31:58.279279537Z caller=notify.go:332 component=dispatcher msg="Error on notify" err="context deadline exceeded"
level=error ts=2018-09-19T23:31:58.279346534Z caller=dispatch.go:280 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="context deadline exceeded"

Could you please let me know for any solution for this.

Regards,
Keshav Sharma

@brummetj
Copy link

brummetj commented Oct 25, 2018

also having the same issue here

My alertmanager and prometheus are both in a kubernetes cluster. And using helm with the stable/prometheus chart.

here is my configuration for alertmanager.

alertmanagerFiles:
  alertmanager.yml:
    global:
      resolve_timeout: 1h
      slack_api_url: <secret>
    
    route:
      group_by: ['instance','alertname']
      group_wait: 1s
      group_interval: 3m
      receiver: default
      repeat_interval: 30m

    receivers:
      - name: 'default'
        slack_configs:
          - channel: '@me'
            send_resolved: false
            text: 'text'

and here are the logs

level=info ts=2018-10-25T23:10:17.313219036Z caller=cluster.go:587 component=cluster msg="gossip settled; proceeding" elapsed=10.000852659s level=error ts=2018-10-25T23:17:19.970316317Z caller=notify.go:332 component=dispatcher msg="Error on notify" err="context deadline exceeded" level=error ts=2018-10-25T23:17:19.970394998Z caller=dispatch.go:280 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="context deadline exceeded" level=error ts=2018-10-25T23:17:19.987694626Z caller=notify.go:332 component=dispatcher msg="Error on notify" err="context deadline exceeded" level=error ts=2018-10-25T23:17:19.987740048Z caller=dispatch.go:280 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="context deadline exceeded"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants