Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds: Active time interval #2779

Merged
merged 49 commits into from
Mar 4, 2022
Merged

Conversation

dubyte
Copy link
Contributor

@dubyte dubyte commented Dec 2, 2021

This is a PR that implements active time interval as in: #2778

The point is to use active_time_interval to express that alertmanager only sent alerts in the interval mentioned.
so in other words active_time_interval mutes in the complement of mute_time_interval

@dubyte
Copy link
Contributor Author

dubyte commented Dec 2, 2021

Example:

  routes:
  - receiver: devs
    mute_time_intervals:
      - offhours
      - holidays
    continue: true
  - receiver: monitoring
    active_time_intervals: 
      - offhours
      - holidays

time_intervals:
- name: offhours
  time_intervals:
  # Monday to Friday 5pm - 9:00 in UTC
  - weekdays: ['monday:friday']
    times:
    - start_time: '00:00'
      end_time: '13:00'
    - start_time: '21:00'
      end_time: '24:00'
  - weekdays: ['saturday', 'sunday']
- name: holidays
  time_intervals:
  - months: ['december']
    days_of_month: ['24']
  - months: ['december']
    days_of_month: ['25:31']
  - months: ['january']
    days_of_month: ['1:2']
  - ...

@dubyte dubyte force-pushed the active_time_interval branch from 1bca2ae to 10218f0 Compare December 2, 2021 04:35
@dubyte dubyte changed the title Active time interval Adds: Active time interval Dec 2, 2021
Copy link
Member

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it makes sense.

notify/notify.go Outdated Show resolved Hide resolved
notify/notify.go Outdated Show resolved Hide resolved
dispatch/route.go Outdated Show resolved Hide resolved
config/config.go Outdated Show resolved Hide resolved
config/config.go Outdated Show resolved Hide resolved
docs/configuration.md Show resolved Hide resolved
notify/notify.go Show resolved Hide resolved
notify/notify.go Outdated Show resolved Hide resolved
notify/notify.go Outdated Show resolved Hide resolved
notify/notify.go Outdated Show resolved Hide resolved
notify/notify.go Outdated Show resolved Hide resolved
config/config.go Outdated Show resolved Hide resolved
docs/configuration.md Outdated Show resolved Hide resolved
docs/configuration.md Outdated Show resolved Hide resolved
docs/configuration.md Outdated Show resolved Hide resolved
docs/configuration.md Show resolved Hide resolved
docs/configuration.md Outdated Show resolved Hide resolved
notify/notify.go Show resolved Hide resolved
Copy link
Member

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few nits, otherwise lgtm

docs/configuration.md Outdated Show resolved Hide resolved
docs/configuration.md Outdated Show resolved Hide resolved
docs/configuration.md Outdated Show resolved Hide resolved
docs/configuration.md Outdated Show resolved Hide resolved
docs/configuration.md Outdated Show resolved Hide resolved
docs/configuration.md Outdated Show resolved Hide resolved
cmd/alertmanager/main.go Outdated Show resolved Hide resolved
cmd/alertmanager/main.go Outdated Show resolved Hide resolved
@dubyte
Copy link
Contributor Author

dubyte commented Jan 26, 2022

Thank you @simonpasquier

@dubyte dubyte force-pushed the active_time_interval branch 2 times, most recently from fd7cd46 to 8563331 Compare January 26, 2022 13:21
Copy link
Member

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me (minus 2 last nits). I'd ask @roidelapluie to give it a review to make sure that I haven't missed anything critical. Thanks!

@simonpasquier
Copy link
Member

could you also rebase on top of the main branch (it should fix the CI failure with mixins)?

dubyte and others added 14 commits January 26, 2022 18:09
Signed-off-by: Sinuhe Tellez <[email protected]>
Signed-off-by: Sinuhe Tellez <[email protected]>
Co-authored-by: Simon Pasquier <[email protected]>
Signed-off-by: Sinuhe Tellez <[email protected]>
Co-authored-by: Simon Pasquier <[email protected]>
Signed-off-by: Sinuhe Tellez <[email protected]>
Adds doc for a helper function

Co-authored-by: Simon Pasquier <[email protected]>
Signed-off-by: Sinuhe Tellez <[email protected]>
Co-authored-by: Simon Pasquier <[email protected]>
Signed-off-by: Sinuhe Tellez <[email protected]>
Co-authored-by: Simon Pasquier <[email protected]>
Signed-off-by: Sinuhe Tellez <[email protected]>
Co-authored-by: Simon Pasquier <[email protected]>
Signed-off-by: Sinuhe Tellez <[email protected]>
Co-authored-by: Simon Pasquier <[email protected]>
Signed-off-by: Sinuhe Tellez <[email protected]>
clyang82 and others added 3 commits January 26, 2022 18:09
* Support http_config for amtool

Co-authored-by: Julien Pivotto <[email protected]>
Co-authored-by: Simon Pasquier <[email protected]>
Signed-off-by: clyang82 <[email protected]>
Signed-off-by: Sinuhe Tellez <[email protected]>
Since the TopicARN field is a template string, it's safer to check for
the ".fifo" suffix in the rendered string.

Signed-off-by: Simon Pasquier <[email protected]>
Signed-off-by: Sinuhe Tellez <[email protected]>
This change also adds unit tests for SNS configuration.

Signed-off-by: Simon Pasquier <[email protected]>
Signed-off-by: Sinuhe Tellez <[email protected]>
@dubyte dubyte force-pushed the active_time_interval branch from d9435e6 to d7d335e Compare January 26, 2022 23:09
@xkilian
Copy link

xkilian commented Feb 1, 2022

An issue has been identified related to this PR, @simonpasquier @roidelapluie , during testing with @dubyte

If an active alert is muted, the only effect currently, is to mute notifications.
But the alert is still considered active, which causes duplicate alertes to be created in the alertmanager/karma UI and alertmanager API. This is due to the use of the "continue" keywork and the fact intervals are evaluated at the notification phase. There is no way to distinguish between a real alert and a muted alert.

Suggested paths to correct this:

  • Add a new possible state: active, silenced, inhibited, muted
    • change the state of the alert to muted
    • add the muted state to the @State LABEL.
    • add muted to the API filters when querying for alerts
  • MAYBE: muted and unmuted alerts that are equivalent should be treated as the same alert for repeat-interval
  • alerts should not clear/trigger at each time_interval cross.
  • alertmanager UI should permit to filter between active/silenced/inhibited/muted alerts

Or

  • add a LABEL to the alert, to indicated this alert is muted (like state=muted or muted=true); This could be a temporary fix while solution 1 is fleshed out.

Or

  • consider active_interval and mute_interval in the matching stage and not in the notification processing stage. Though this may lead to clearing+re-firing alerts with no way to deal with them. So I do not think this is the right path.

@dubyte
Copy link
Contributor Author

dubyte commented Feb 14, 2022

Hi
Agreed that there is no way to know if the notification is muted -for certain receivers- or not, so I propose the next: #2832

If all agreed I would like to move the discussion about the api change to that feature request. So we can close this PR.
The tests show that the feature works for mute and active time intervals. So we are not blocked.

Thank you

@simonpasquier
Copy link
Member

But the alert is still considered active, which causes duplicate alertes to be created in the alertmanager/karma UI and alertmanager API. This is due to the use of the "continue" keywork and the fact intervals are evaluated at the notification phase. There is no way to distinguish between a real alert and a muted alert.

Correct but the state of an alert (muted/active) is always relative to an alert group. I agree with @dubyte that it should be addressed at the API/UI level.

@simonpasquier
Copy link
Member

@benridley would you have time to review this PR by any chance (at least the UX)?

notify/notify.go Outdated Show resolved Hide resolved
notify/notify.go Outdated Show resolved Hide resolved
@benridley
Copy link
Contributor

Besides the minor typo above, this looks great. Thanks @dubyte, I think this will be a nice addition to AM.

@dubyte
Copy link
Contributor Author

dubyte commented Feb 20, 2022

Thanks @benridley

Copy link
Member

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry @dubyte I had 2 last pending comments that I forgot to submit. Hopefully they are minor and easy to fix :)

docs/configuration.md Outdated Show resolved Hide resolved
docs/configuration.md Outdated Show resolved Hide resolved
dubyte and others added 2 commits February 22, 2022 18:00
Fix typo in documentation

Co-authored-by: Simon Pasquier <[email protected]>
Signed-off-by: Sinuhe Tellez <[email protected]>
Co-authored-by: Simon Pasquier <[email protected]>
Signed-off-by: Sinuhe Tellez <[email protected]>
@dubyte dubyte force-pushed the active_time_interval branch from 0ab7210 to 17a5f5e Compare February 22, 2022 23:01
Copy link
Member

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@bollmann
Copy link

Hi,

I'm trying to use this feature and I'm defining my time intervals and corresponding routes as follows:

    time_intervals:
      - name: workinghours
        time_intervals:
          - times:
              - start_time: 08:00
                end_time: 18:00
            weekdays:
              - monday
              - tuesday
              - wednesday
              - thursday
              - friday
...
    route:
      routes:
        - receiver: oncall-pager
          matchers:
            - severity="workinghours"
          active_time_intervals:
            - workinghours

Furthermore, I have a prometheus rule with label severity="workinghours" defined. With this configuration, I would expect my prometheus rule to only be active during workinghours, i.e., Monday to Friday from 8am until 6pm. However, for some reason my prometheus rule also fires outside of the hours 8am until 6pm. That is, it seems to be active even when it should not be.

Did I do something wrong here in the configuration?

@xkilian
Copy link

xkilian commented Feb 13, 2023

Here is a working example for you.

It is a best practice to define your alertes with severity levels. We use 0-5, 5 being the most critical.
The group label defines the administrative group responsible for handling the alert.

So based on the group, severity and time of day the alert will be routed to the appropriate receiver configuration. We use distinct receiver configurations as the email template (in our case) need to be different for each case. Ex. Offhours is an email to a specific box. Critical daytime hours will open a ticket automatically with a ticketing app. And severity 3 will be handled with an email regardless of time of day.

What you were doing wrong is not setting the mute_time_intervals. Which is when to NOT send a notification (ex. off-hours), Then the active_time_interval indicates what to actually DO during offhours. This is documented, but may not be clear at first read.

    - matchers:
        - group= system-admins
      receiver: receiver-system-admins
      routes:
        - matchers:
            - severity = 3
          receiver: receiver-system-admins-email
        - matchers:
            - "severity =~ 4|5"
          receiver: receiver-system-admins-critical-daytime
          mute_time_intervals: 
            - offhours
            - holidays
          continue: true
        - matchers:
            - "severity =~ 4|5"
            - env = PR
          receiver: receiver-system-admins-critical-offhours
          active_time_intervals: 
            - offhours
            - holidays       

Hope this helps

@xkilian
Copy link

xkilian commented Feb 13, 2023

As a bonus, you can see how to do a regex OR statement for the severity (had me stumped for a while) with the new matchers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants