Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Part 2] How to setup alertmanager and send alerts ? · ashish.one #3

Closed
utterances-bot opened this issue Oct 31, 2019 · 27 comments
Closed

Comments

@utterances-bot
Copy link

[Part 2] How to setup alertmanager and send alerts ? · ashish.one

undefined

https://ashish.one/blogs/setup-alertmanager/

Copy link

Hi Ashish!

Thanks for share your knowledge! This 2nd part is very complete and the links to examples pages are so usefull!

Thanks again!
defabiouy

Copy link

Thanks for a precise post, it helped me configure alerts properly. However I cannot find the Part-3 of this. Would you be able to point me towards that?

Copy link
Owner

HI @akanshadureja Thanks for your words. I am on Part - 3. Till that i can try to solve your doubts :)

Copy link

Thanks a lot for the response Ashish :) I am able to connect Prometheus Data source with Grafana. I am trying to figure out if there is a way to connect Grafana Alerts with Alert Manager to configure Threshold based alerts.

Copy link
Owner

Hey akansha, As far i research on this, I haven't found this integration. Even in past i was looking for the same where i can simply set my alert rules on UI on grafana.

So there is 2 alertmanagers we have

  1. Prometheus alertmanager 2. Grafana's alertmanager

I'll recommend go with Prometheus alertmanager It provides much flexible feature like grouping, batching etc.

Keep grafana only for data visualization.

And I just noticed i already released my part - 3 where I have shown how you can create your own custom exporters but not about grafana <---> Prometheus

Copy link
Owner

@akankshadureja I am live with Part 4 (Setup grafana with Prometheus). You can check here https://ashish.one/blogs/setup-grafana-with-prometheus/

Copy link

hi ashish, really simplified the process for me, i am just starting my cloud admin switch ,

so just to be clear, we have to install all the exporters like jmx, node exporter and alert manager on the machine our application is running / on the machine we want to monitor

and i have my prometheus and grafana running on a seperate machine where in i can define the alerting rules and within just point the private ip of my application machine. is that correct ? thank you

Copy link
Owner

Hey @Cryptopanda07 Sorry for delay in reply.

  1. Your alert file alert.rules.yml should be present on same server where your prometheus service is running because you need to specify the alert rule file path in prometheus.yml. You have to specify all rules in alert.rules.yml file only. You can check Setup Alerts Heading above.

  2. You can specify your private IP in prometheus.yml where your alertmanager is running.

Let me know if i clearly understand your doubts.

Thanks

Copy link

i might have confused you, my question was,

i have 3 kafka brokers and 1 zookeeper broker and 1 admin instance ( total 5 instances )

on my zookeeper and kafka instances i have node exporter and jmx running which will expose metrics for my prometheus to catch

my prometheus and grafana is running on my admin instance ( different AZ )

should my alertmanager be running on all ( kafka & zookeeper ) OR since alertmanager is only used to fire alerts, it should be run on admin machine ONLY

so prometheus will be catching for e.g " instance down " from node exporter and alert the alert manager and then alertmanager fires the alert.

Copy link
Owner

You no need to setup alertmanager on all instances ( Kafka & Zookeeper). Like Prometheus, Alertmanager also standalone service. In your case, you should run on admin instance only (Though you can run on any instances but not on all.). Just specify your alertmanager's url and port in prometheus.yml file, As I explained in blog.

The flow is:

Prometheus runs in specific interval -> It pull the metrics -> Evaluates the alert rules -> if alert is true, It will forward to alertmanager -> Alertmanager will push the alert on various channels.

So if prometheus will be catching "instance down" from node exporter, Prometheus will forward to alertmanager ( youralertmanagerurl:9300 ) and then alertmanager will broadcast the alert on various channel.

Copy link

thank you so much, this is what i wanted to clear.
I am on the final step of firing my alerts right now :D

Copy link

hey ashish a quick one, although all services are up and running the alert manager is not firing any alerts to my slack,
or even showing anything on when i go in localhost:9093.

following is my rules.yml

"/prometheus/rules.yml":
content: |
groups:
- name: AllInstances
rules:
- alert: InstanceDown
# Condition for alerting
expr: up == 0
for: 1m
# Annotation - additional informational labels to store more information
annotations:
title: 'Instance {{ $labels.instance }} down'
description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minute.'
# Labels - additional labels to be attached to the alert
labels:
severity: 'critical'
owner: ec2-user
group: ec2-user
mode: '000644'

and following is my prometheus.yml

"/prometheus/prometheus.yml":
content: !Sub
- |
global:
scrape_interval: 10s
evaluation_interval: 10s

                              rule_files:
                               - /prometheus/rules.yml

                              alerting:
                                alertmanagers:
                                - static_configs:
                                   - targets:
                                      - localhost:9093

                              scrape_configs:
                               - job_name: 'kafka'
                                 static_configs:
                                  - targets:
                                    - ${kafka_1}:8080
                                    - ${kafka_2}:8080
                                    - ${kafka_3}:8080
                               - job_name: 'kafka machine node'
                                 static_configs:
                                  - targets:
                                    - ${kafka_1}:9100
                                    - ${kafka_2}:9100
                                    - ${kafka_3}:9100

following is my alertmanager.yml

"/prometheus/alertmanager/alertmanager.yml":
content: |
global:
resolve_timeout: 1m
slack_api_url: 'my hook api here

                            route:
                             receiver: 'slack-notifications'

                            receivers:
                            - name: 'slack-notifications'
                              slack_configs:
                                   - channel: '#sysops-test'
                                     send_resolved: true
                        owner: ec2-user
                        group: ec2-user
                        mode: '000644'

Copy link
Owner

@Cryptopanda07 Here are some resources which will help you to do unit testing for your alert rule. Also to test your alertmanager configuration.

prometheus/alertmanager#437

https://gist.github.com/cherti/61ec48deaaab7d288c9fcf17e700853a

https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/

If you still not found anything, Please go through the alertmanager and prometheus logs, You will get some lead over there.

Copy link

thank you so much for your help ! ill go through everything

Copy link

Hi Ashish, I am using prometheus alertmanager to send email notifications. Now I am able to get email notifications to admin, but I need to send email notifications to the customers when their pod memory limit reached or cpu usage reached. Can you please help me.

@Cryptopanda07
Copy link

Hi Ashish, I am using prometheus alertmanager to send email notifications. Now I am able to get email notifications to admin, but I need to send email notifications to the customers when their pod memory limit reached or cpu usage reached. Can you please help me.

Hello Sridhar, I believe you need to specify the gmail / email config in the alertmanager.yml file seperately and changes rules.yml which will specify what you want the alert for.

Copy link
Owner

@sridhar551 If you want to send alert to your customers then you want to rewrite alertmanager.yml file too frequently For example if alerts raised for 10 servers (Let's say high disk usage) then you need to send 10 alerts to 10 different users for which you need to rewrite alertmanager.yml file and restart the service again. Which is not a good thing.

Alertmanager will send the alert event to only the admin (Or a single user). You need to accept that event and then you can route that alert to your users.

In your use case, I would suggest configuring the webhook with alertmanager. Whenever an alert will generate, The alertmanager will send the payload to your HTTP endpoint. From there you can add your business logic to send the email to your users.

For the webhook_config you can refer below link:
https://prometheus.io/docs/alerting/latest/configuration/#webhook_config

@Cryptopanda07
Copy link

@sridhar551 If you want to send alert to your customers then you want to rewrite alertmanager.yml file too frequently For example if alerts raised for 10 servers (Let's say high disk usage) then you need to send 10 alerts to 10 different users for which you need to rewrite alertmanager.yml file and restart the service again. Which is not a good thing.

Alertmanager will send the alert event to only the admin (Or a single user). You need to accept that event and then you can route that alert to your users.

In your use case, I would suggest configuring the webhook with alertmanager. Whenever an alert will generate, The alertmanager will send the payload to your HTTP endpoint. From there you can add your business logic to send the email to your users.

For the webhook_config you can refer below link:
https://prometheus.io/docs/alerting/latest/configuration/#webhook_config

I agree ! It's easy to setup webhook

You can route the alerts to slack and add your customers on the slack channel and give them access to a seperate alert channel on slack.
They will be able to see that.

However sending alerts to customers make no sense. Why would you want your customers to know your setup is on fire ? :P

Copy link
Owner

@Cryptopanda07 Yes slack is also good option.

And there can be use cases where you need to send alert to your customer. Lets say you are hosting provider, Where you providing droplets/Servers like Digitalocean and you want to offer the alert services on the hardware usage.

In that case you need to send an alert to you client.

Copy link

Hi Ashish,

I have setup alertmanager with the below configuration. But Alertmanager is matching only the first match_re entry and send alerts to the TX team but not the second match_re entry and routes it to the default route i.e. UX team. I thought "continue: true" would do it but it is not working. Is there any issue with the configuration? The alertmanager version I am using is 0.21

global:
smtp_smarthost: 'localhost:25'
smtp_from: '[email protected]'
smtp_require_tls: false

route:
group_by: ['instance', 'alert']
group_wait: 30s
group_interval: 5m
repeat_interval: 3h
receiver: 'UX team'
routes:

  • match_re:
    job: ^(Windows Servers 1|ECS Windows Set 1|ECS Windows Set 2)$
    receiver: 'TX team'
    repeat_interval: '5h'
    continue: true
  • match_re:
    job: ^(windows_1|CDD Windows Servers|Win Servers)$
    receiver: 'Windows Team'
    repeat_interval: '5h'
    continue: true

receivers:

Copy link

Hi Team
It's good to hear that the new Loki release supports Alert Configuration through alertmanager.
I am working with my team so that Digivalet can deploy grafana-loki-promtail as a centralized logging system. But our team is facing some few challenges. I am not sure whether it's a bug or our team fault.
Here my scenario is I am running Grafana-loki on 192.168.126.167 and a Promtail client on 192.168.126.168
1> where Portail client is sending my HTTPD logs to loki
2> I have installed alert manager on 192.168.126.167:9093
3> I have defined rule file to trigger alert whenever incoming log per second is more then 5
4> when Loki invoke rule file he gives output as follow

  • Feb 09 06:09:06 centos 7.linux vm images.local loki[3394]: level=info ts=2021-02-09T11:09:06.883855921Z caller=metrics.go:83 org_id=1 traceID=5a9b9e046985fa05 latency=fast query="sum(count_over_time({filename="/var/log/httpd/access_log"}[1s])) > 5" query_type=metric range_type=instant length=0s step=0s duration=28.679653ms status=200 throughput=0B total_bytes=0B
    6> Here range type is instant and I believe that when query type is instant its doesn't return anything.
    7> Help us to find the way to change to query type from instant to range.
    Please find below config file of loki,alertmanager,promtail,rules1.yaml
    ######################### Promtail.yml #####################################################
    server:
    http_listen_port: 9080
    grpc_listen_port: 0
    positions:
    filename: /tmp/positions.yaml
    clients:
  • url: http://192.168.126.167:3100/loki/api/v1/push
    tenant_id: 1
    scrape_configs:
  • job_name: journal
    journal:
    max_age: 12h
    labels:
    job: systemd-journal
    relabel_configs:
    • source_labels: ['__journal__systemd_unit']
      target_label: 'unit'
  • job_name: httpd
    entry_parser: raw
    static_configs:
    • targets:
      • localhost
        labels:
        job: httpd
        path: /var/log/httpd/*log
        pipeline_stages:
  • match:
    selector: '{job="httpd"}'
    stages:
    • regex:
      expression: '^(?P<remote_addr>[\w.]+) - (?P<remote_user>[^ ]) [(?P<time_local>.)] "(?P[^ ]) (?P[^ ]) (?P[^ ])" (?P[\d]+) (?P<body_bytes_sent>[\d]+) "(?P<http_referer>[^"])" "(?P<http_user_agent>[^"]*)"?'
    • labels:
      remote_addr:
      remote_user:
      time_local:
      method:
      request:
      protocol:
      status:
      body_bytes_sent:
      http_referer:
      http_user_agent:
      ######################################################################################################
      ############################### LOKI.YML ###############################################################
      auth_enabled: true
      server:
      http_listen_port: 3100
      ingester:
      lifecycler:
      address: 127.0.0.1
      ring:
      kvstore:
      store: inmemory
      replication_factor: 1
      final_sleep: 0s
      chunk_idle_period: 5m
      chunk_retain_period: 30s
      max_transfer_retries: 0
      schema_config:
      configs:
  • from: 2018-04-15
    store: boltdb
    object_store: filesystem
    schema: v11
    index:
    prefix: index_
    period: 168h
    ruler:
    storage:
    type: local
    local:
    directory: /tmp/loki/rules
    rule_path: /tmp/scratch
    alertmanager_url: http://192.168.126.167:9093
    ring:
    kvstore:
    store: inmemory
    enable_api: true
    storage_config:
    boltdb:
    directory: /tmp/loki/index
    filesystem:
    directory: /tmp/loki/chunks
    limits_config:
    enforce_metric_name: false
    reject_old_samples: true
    reject_old_samples_max_age: 168h
    chunk_store_config:
    max_look_back_period: 0s
    table_manager:
    retention_deletes_enabled: false
    retention_period: 0s
    ################################################################################################
    ############################ RULES1.YAML #####################################################
    groups:
  • name: rate-alerting
    rules:
    • alert: HighLogRate
      expr: sum(count_over_time({filename="/var/log/httpd/access_log"}[1s])) > 5
      for: 1m
      labels:
      severity: warning
      annotations:
      title: "High LogRate Alert"
      description: "something is logging a lot"
      ###################################################################################################
      ############################ Alertmanager.yml ########################################################
      global:
      resolve_timeout: 1m
      route:
      group_by: ['alertname']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 1h
      receiver: Slack-Notifications
      receivers:
  • name: 'Slack-Notifications'
    slack_configs:
  • api_url: 'https://hooks.slack.com/services/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'
    channel: '#loki-alert-test'
    send_resolved: true
    ###############################################################################################

Copy link

AbhinJames commented Aug 16, 2021

Hey Ashish,

I have a small doubt. I am working on alertmanager to send alerts via email. But apparently I had to include extra pair of "{{" to finally be able to replace values.

For example:
summary: "{{ "{{ $labels.instance}}" }}'s computer {{ "{{ $labels.instance_hostname }}" }} / {{ "{{ $labels.instance }}" }} has used {{ "{{ $value }}" }}% of space in Volume C "
I guess this might be because of the differences between alertmanager and prometheus interpreters

I want to now , include the printf "%.2f" to {{ "{{ $value }}" }} to allow only 2 values after decimal point. Any idea on how to change the syntax?

I appreciate anyone's help

Copy link
Owner

Hi @AbhinJames

You can try {{ $value | printf "%.2f" }} expression will round it to two decimal.

Copy link

Hi

while configuring getting mails, but last few days not getting alertmanager email

alertmanager status showing like"evel=error ts=2021-11-25T07:43:00.086Z caller=dispatch.go:310 component=dispatcher msg="Notify for alerts failed" num_alerts=2 err="email/email[0]: notify retry canceled after 7"
what was this error means..

Copy link
Owner

@maheshkapil Can you confirm you SMTP is working from the same server ?

@maheshkapil
Copy link

maheshkapil commented Nov 28, 2021 via email

Copy link
Owner

@maheshkapil
Can you try to find more details by setting log level to debug for alertmanager ?

Also you can refer this similar issue:
prometheus-operator/prometheus-operator#1660
prometheus/alertmanager#1683

Repository owner locked and limited conversation to collaborators Jun 23, 2022
@ashishtiwari1993 ashishtiwari1993 converted this issue into discussion #15 Jun 23, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants