Error on notify: context deadline exceeded #282

kavyakul · 2016-03-28T09:50:41Z

If I group my alerts in such a way that 4 alerts fall in a group I get following error:

ERRO[0300] Error on notify: context deadline exceeded source=notify.go:152
ERRO[0300] Notify for 1 alerts failed: context deadline exceeded source=dispatch.go:238

& if I modify grouping so that 1 group has one alert each, then thing move smoothly.

Please help to correct where I am wrong.
alertmanager.conf:

 group_by: ['severity', 'alertname', 'instance']
  group_wait: 1m
  group_interval: 5m
  repeat_interval: 30m
  receiver: alerta
  routes:
  - match_re:
      service: container
    receiver: mail
  - match:
      service: jira_chk
    receiver: JIRA

The text was updated successfully, but these errors were encountered:

fabxc · 2016-03-28T10:02:28Z

That seems highly unlikely to be related to the grouping. context deadline exceeded simply means the request timed out.
The error message shows that only one of the four alerts failed. So against your expectation the 4 alerts are not in the same group.

The most likely thing here is that notification requests to one of your receivers are timing out. Your configuration changes then cause no requests being sent to that receiver in the first place.

For further debugging we need the full configuration (before AND after) and the label sets of the four alerts.

kavyakul · 2016-03-28T10:36:01Z

Difference in old & new configuration is "group_by" & "repeat_interval" value
Old alertmanager.conf:

global:
  # The smarthost and SMTP sender used for mail notifications.
  smtp_smarthost: 'mailhostX:25'
  smtp_from: 'XYZ@ABC'

route:
  group_by: ['severity', 'alertname']
  group_wait: 1m
  group_interval: 5m
  repeat_interval: 333h
  receiver: alerta
  routes:
  - match_re:
      service: container
    receiver: mail
  - match:
      service: jira_chk
    receiver: JIRA

inhibit_rules:
- source_match:
    severity: 'low'
  target_match:
    severity: 'warning'
  # Apply inhibition if the alertname is the same.
  equal: ['alertname', 'cluster', 'service']

receivers:
- name: 'alerta'
  webhook_configs:
  - url: 'http://<IP>:8995/api/webhooks/prometheus'
    send_resolved: true

- name: 'mail'
  email_configs:
  - to: 'kk@ABC'
    send_resolved: true

- name: 'JIRA'
  webhook_configs:
  - url: 'http://<IP>:8955/issues/projects/LP'
    send_resolved: true

New alertmanager.conf:

global:
  # The smarthost and SMTP sender used for mail notifications.
  smtp_smarthost: 'mailhostX:25'
  smtp_from: 'XYZ@ABC'

route:
  group_by: ['severity', 'alertname', 'instance']
  group_wait: 1m
  group_interval: 5m
  repeat_interval: 30m
  receiver: alerta
  routes:
  - match_re:
      service: container
    receiver: mail
  - match:
      service: jira_chk
    receiver: JIRA

inhibit_rules:
- source_match:
    severity: 'low'
  target_match:
    severity: 'warning'
  # Apply inhibition if the alertname is the same.
  equal: ['alertname', 'cluster', 'service']

receivers:
- name: 'alerta'
  webhook_configs:
  - url: 'http://<IP>:8995/api/webhooks/prometheus'
    send_resolved: true

- name: 'mail'
  email_configs:
  - to: 'kk@ABC'
    send_resolved: true

- name: 'JIRA'
  webhook_configs:
  - url: 'http://<IP>:8955/issues/projects/LP'
    send_resolved: true

Complete Error:
ERRO[239100] Error on notify: context deadline exceeded source=notify.go:152
ERRO[239100] Notify for 4 alerts failed: context deadline exceeded source=dispatch.go:238
WARN[239100] Notify attempt 1 failed: Post http://:8955/issues/projects/LP: dial tcp :8955: getsockopt: connection refused source=notify.go:193
ERRO[239101] Error on notify: context deadline exceeded source=notify.go:152
ERRO[239101] Notify for 1 alerts failed: context deadline exceeded source=dispatch.go:238

With old configuration: I have 5 alerts for jira_chk receiver; 4 in a group & 1 in another.

rohit01 · 2016-08-04T07:34:44Z

I am also facing the same issue. Does alertmanager retries fail notifications?

rohit01 · 2016-08-04T07:36:12Z

Here are my logs (using latest stable version - v0.3.0):

{"level":"warning","msg":"Notify attempt 2 failed: context deadline exceeded","source":"notify.go:193","time":"2016-08-04T03:02:01Z"}
{"level":"warning","msg":"Notify attempt 3 failed: context deadline exceeded","source":"notify.go:193","time":"2016-08-04T03:02:01Z"}
{"level":"error","msg":"Error on notify: context deadline exceeded","source":"notify.go:152","time":"2016-08-04T03:02:01Z"}
{"level":"error","msg":"Notify for 1 alerts failed: context deadline exceeded","source":"dispatch.go:238","time":"2016-08-04T03:02:01Z"}
{"level":"warning","msg":"Notify attempt 1 failed: Post https://events.pagerduty.com/generic/2010-04-15/create_event.json: dial tcp: lookup events.pagerduty.com on 10.14.0.2:53: read udp 10.14.22.55:44485-\u003e10.14.0.2:53: i/o timeout","source":"notify.go:193","time":"2016-08-04T03:02:02Z"}
{"level":"warning","msg":"Notify attempt 2 failed: context deadline exceeded","source":"notify.go:193","time":"2016-08-04T03:02:11Z"}
{"level":"warning","msg":"Notify attempt 3 failed: context deadline exceeded","source":"notify.go:193","time":"2016-08-04T03:02:11Z"}
{"level":"error","msg":"Error on notify: context deadline exceeded","source":"notify.go:152","time":"2016-08-04T03:02:11Z"}
{"level":"error","msg":"Notify for 1 alerts failed: context deadline exceeded","source":"dispatch.go:238","time":"2016-08-04T03:02:11Z"}
{"level":"warning","msg":"Notify attempt 1 failed: Post https://events.pagerduty.com/generic/2010-04-15/create_event.json: dial tcp: lookup events.pagerduty.com on 10.14.0.2:53: read udp 10.14.22.55:53401-\u003e10.14.0.2:53: i/o timeout","source":"notify.go:193","time":"2016-08-04T03:02:12Z"}
{"level":"warning","msg":"Notify attempt 2 failed: context deadline exceeded","source":"notify.go:193","time":"2016-08-04T03:02:21Z"}
{"level":"warning","msg":"Notify attempt 3 failed: context deadline exceeded","source":"notify.go:193","time":"2016-08-04T03:02:21Z"}
{"level":"error","msg":"Error on notify: context deadline exceeded","source":"notify.go:152","time":"2016-08-04T03:02:21Z"}
{"level":"error","msg":"Notify for 1 alerts failed: context deadline exceeded","source":"dispatch.go:238","time":"2016-08-04T03:02:21Z"}
{"level":"warning","msg":"Notify attempt 1 failed: Post https://events.pagerduty.com/generic/2010-04-15/create_event.json: dial tcp: lookup events.pagerduty.com on 10.14.0.2:53: read udp 10.14.22.55:47474-\u003e10.14.0.2:53: i/o timeout","source":"notify.go:193","time":"2016-08-04T03:02:22Z"}

fabxc · 2016-08-04T07:39:08Z

@rohit01 yes, Alertmanager retries failed notifications.

fabxc · 2016-08-04T07:42:24Z

@rohit01 could you share your configuration as well? Did it also happened after you changed something or was it there from the beginning?

rohit01 · 2016-08-04T08:22:01Z

It started happening automatically one day without any config change. All of a sudden, we stopped getting any alerts. Did random things like restart container, minor config changes and it started working again with intermittent failures.
We were on version 01.0, we tried 0.1.1 and finally moved to v0.3.0 (yesterday).

Here's the configuration:

global:
  # The smarthost and SMTP sender used for mail notifications.
  smtp_smarthost: '****'
  smtp_from: '****'

# The root route on which each incoming alert enters.
route:
  # The labels by which incoming alerts are grouped together. For example,
  # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
  # be batched into a single group.
  group_by: ['severity', 'alertname', 'instance', 'job']

  # When a new group of alerts is created by an incoming alert, wait at
  # least 'group_wait' to send the initial notification.
  # This way ensures that you get multiple alerts for the same group that start
  # firing shortly after another are batched together on the first
  # notification.
  group_wait: 3s

  # When the first notification was sent, wait 'group_interval' to send a batch
  # of new alerts that started firing for that group.
  group_interval: 3s

  # If an alert has successfully been sent, wait 'repeat_interval' to
  # resend them.
  repeat_interval: 3h

  # A default receiver
  receiver: 'notifications'

  # All the above attributes are inherited by all child routes and can
  # overwritten on each.

  # The child route trees.
  routes:
  # This routes performs a regular expression match on alert labels to
  # catch alerts that are related to a list of services.
    - match_re:
        severity: ^(critical|warning)$
      receiver: 'notifications'
      # Child Routes (having greater precedence)
      routes:
      - match:
          severity: critical
        receiver: 'notifications-and-pagerduty'

# Inhibition rules allow to mute a set of alerts given that another alert is
# firing.
# We use this to mute any warning-level notifications if the same alert is
# already critical.
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    # Apply inhibition if the following matches.
    equal: ['CITY', 'SERVICE', 'instance']

  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    # Apply inhibition if the following matches.
    equal: ['alertname', 'instance', 'job']

receivers:
  - name: 'notifications'
    # Send email
    email_configs:
      - to: '<email 1>'
        require_tls: false
      - to: '<email 2>'
        require_tls: false


  - name: 'notifications-and-pagerduty'
    # Trigger pagerduty
    pagerduty_configs:
      - service_key: <hidden>
    # Send email
    email_configs:
      - to: '<email 1>'
        require_tls: false
      - to: '<email 2>'
        require_tls: false

fabxc · 2016-08-04T08:31:32Z

This is worrying as it's two cases. But OTOH in both cases the error messages indicate a timeout when sending the notification (not anywhere in our code before or after). So I'm wondering if it's just an unlucky coincidence.

Have you verified whether you can curl the respective notification endpoint (PagerDuty in your case @rohit01) within the timeout?

rohit01 · 2016-08-04T09:34:55Z

@fabxc What is the timeout value?

Unable to find in code.

fabxc · 2016-08-05T08:14:50Z

@rohit01 I was not quite precise earlier.

The general timeout is the grouping interval, which you can think of the "update frequency" for a group of alerts. You have set yours very low, it should generally be at the order of 1 minute or more as it is meant to avoid getting spammed. PagerDuty for that matter will not even page you again and has issues updating the data (last time I checked). So there's no value in having it that low here.
It also means your effective request timeout to PagerDuty is only 3 seconds.
The timeout is assigned here:

alertmanager/dispatch.go

Lines 314 to 316 in 7b20b77

    
           // Give the notifcations time until the next flush to 
        
           // finish before terminating them. 
        
           ctx, cancel := context.WithTimeout(ag.ctx, timeout)

Now for your error logs on the failed retry attempts. This seems to be an issue here:
https://github.com/prometheus/alertmanager/blob/master/notify/notify.go#L191-L199

We retry notification with backoff based on a ticker. We attempt to receive a signal from that ticker, but want to abort that if the context timed out. In the code we just do a plain select over both signals, the choice is random.
If we retry with a timed out context, the notification immediately fails. The code should check the context first before then waiting on whatever signal appears first.

Does that make sense? So your problem would most likely disappear if you set group_interval to a value of 3-5 minutes.

Sorry about addressing your issue super late, @kavyakul. Your error logs also simply indicates a failure to reach the notification endpoint: dial tcp :8955: getsockopt: connection refused. Have you checked whether you can manually curl it from the Alertmanager host?

This addresses the misleading error messages reported in #212 Explanation: #282 (comment) Fixes #282

rohit01 · 2016-08-05T08:48:10Z

@fabxc Thanks for the explanation. We were running with group_interval as 1s as well. Now, I think we know the reason of our pain :)

fabxc · 2016-08-05T09:01:32Z

Yes, I suppose we could think about making changes to have a timeout larger than group_interval to allow for "immediate updates". But I believe this at best enables self harm for users and cannot see any practical value.

@brian-brazil what's your opinion?

brian-brazil · 2016-08-05T10:09:30Z

I'd not consider any group interval under a minute to be sane. I think our current behaviour is fine.

If you want some form of automated action faster than that, then you want a dedicated realtime control system because humans can't respond that fast.

Keshav0690 · 2018-09-19T23:36:12Z

Hi
I have similar issue.
Getting context deadline exceeded and also I can able to curl the endpoint from alert manager host

here is my alert rules config:

alert.rules
groups:

name: sample
rules:
- alert: prometheus_remote_storage_succeeded_samples_total_alert
  expr: sum(prometheus_remote_storage_succeeded_samples_total{instance="prometheus:9090"}) == 0
  for: 30s
  labels:
  severity: warning
  annotations:
  summary: Instance {{ $labels.instance }} cannot store sample to remote storage
name: sample1
rules:
- alert: sample_alert
  expr: testmetric == 0
  labels:
  severity: warning
  instance: i1
  annotations:
  summary: this is sample summary for {{ $labels.instance }}

alertmanager.yaml

global:
resolve_timeout: 5m

route:
group_by: ['instance']
group_wait: 1m
group_interval: 1m
repeat_interval: 1h
receiver: 'web.hook'
receivers:

name: 'web.hook'
webhook_configs:
- url : http://json-server.2304650011.svc:3030/EspServiceRequest/

Alert:
See Attachment

Can connect to endpoint:
sh-4.2# curl -v http://json-server.2304650011.svc:3030/EspServiceRequest/

About to connect() to json-server.2304650011.svc port 3030 (#0)
Trying 192.168.81.205...
Connected to json-server.2304650011.svc (192.168.81.205) port 3030 (#0)

GET /EspServiceRequest/ HTTP/1.1
User-Agent: curl/7.29.0
Host: json-server.2304650011.svc:3030
Accept: /

Error:

level=error ts=2018-09-19T23:31:48.278890325Z caller=notify.go:332 component=dispatcher msg="Error on notify" err="context deadline exceeded"
level=error ts=2018-09-19T23:31:48.278990617Z caller=dispatch.go:280 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="context deadline exceeded"
level=error ts=2018-09-19T23:31:58.279279537Z caller=notify.go:332 component=dispatcher msg="Error on notify" err="context deadline exceeded"
level=error ts=2018-09-19T23:31:58.279346534Z caller=dispatch.go:280 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="context deadline exceeded"

Could you please let me know for any solution for this.

Regards,
Keshav Sharma

brummetj · 2018-10-25T23:42:44Z

also having the same issue here

My alertmanager and prometheus are both in a kubernetes cluster. And using helm with the stable/prometheus chart.

here is my configuration for alertmanager.

alertmanagerFiles:
  alertmanager.yml:
    global:
      resolve_timeout: 1h
      slack_api_url: <secret>
    
    route:
      group_by: ['instance','alertname']
      group_wait: 1s
      group_interval: 3m
      receiver: default
      repeat_interval: 30m

    receivers:
      - name: 'default'
        slack_configs:
          - channel: '@me'
            send_resolved: false
            text: 'text'

and here are the logs

level=info ts=2018-10-25T23:10:17.313219036Z caller=cluster.go:587 component=cluster msg="gossip settled; proceeding" elapsed=10.000852659s level=error ts=2018-10-25T23:17:19.970316317Z caller=notify.go:332 component=dispatcher msg="Error on notify" err="context deadline exceeded" level=error ts=2018-10-25T23:17:19.970394998Z caller=dispatch.go:280 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="context deadline exceeded" level=error ts=2018-10-25T23:17:19.987694626Z caller=notify.go:332 component=dispatcher msg="Error on notify" err="context deadline exceeded" level=error ts=2018-10-25T23:17:19.987740048Z caller=dispatch.go:280 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="context deadline exceeded"

grobie added bug labels Apr 4, 2016

fabxc added kind/question kind/bug and removed question labels Apr 27, 2016

fabxc added a commit that referenced this issue Aug 5, 2016

notify: always check context before retrying

bdcd377

This addresses the misleading error messages reported in #212 Explanation: #282 (comment) Fixes #282

fabxc mentioned this issue Aug 5, 2016

notify: always check context before retrying #446

Merged

fabxc closed this as completed in #446 Aug 5, 2016

fshabir mentioned this issue Dec 25, 2018

Notify for alerts failed #1683

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error on notify: context deadline exceeded #282

Error on notify: context deadline exceeded #282

kavyakul commented Mar 28, 2016

fabxc commented Mar 28, 2016

kavyakul commented Mar 28, 2016

rohit01 commented Aug 4, 2016

rohit01 commented Aug 4, 2016

fabxc commented Aug 4, 2016 •

edited

Loading

fabxc commented Aug 4, 2016

rohit01 commented Aug 4, 2016

fabxc commented Aug 4, 2016

rohit01 commented Aug 4, 2016

fabxc commented Aug 5, 2016

rohit01 commented Aug 5, 2016 •

edited

Loading

fabxc commented Aug 5, 2016

brian-brazil commented Aug 5, 2016

Keshav0690 commented Sep 19, 2018

brummetj commented Oct 25, 2018 •

edited

Loading

Error on notify: context deadline exceeded #282

Error on notify: context deadline exceeded #282

Comments

kavyakul commented Mar 28, 2016

fabxc commented Mar 28, 2016

kavyakul commented Mar 28, 2016

rohit01 commented Aug 4, 2016

rohit01 commented Aug 4, 2016

fabxc commented Aug 4, 2016 • edited Loading

fabxc commented Aug 4, 2016

rohit01 commented Aug 4, 2016

fabxc commented Aug 4, 2016

rohit01 commented Aug 4, 2016

fabxc commented Aug 5, 2016

rohit01 commented Aug 5, 2016 • edited Loading

fabxc commented Aug 5, 2016

brian-brazil commented Aug 5, 2016

Keshav0690 commented Sep 19, 2018

brummetj commented Oct 25, 2018 • edited Loading

fabxc commented Aug 4, 2016 •

edited

Loading

rohit01 commented Aug 5, 2016 •

edited

Loading

brummetj commented Oct 25, 2018 •

edited

Loading