Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alertmanager routing tree doesn't respect the active time interval (?) #3249

Open
bollmann opened this issue Feb 13, 2023 · 11 comments
Open

Comments

@bollmann
Copy link

bollmann commented Feb 13, 2023

Hi devs,

I'm trying to use the recently added time_intervals and active_time_intervals features to enable an alert to go to our pager only during working hours. I'm defining my time interval and corresponding route as follows:

    time_intervals:
      - name: workinghours
        time_intervals:
          - times:
              - start_time: 08:00
                end_time: 18:00
            weekdays:
              - monday
              - tuesday
              - wednesday
              - thursday
              - friday
...
    route:
      routes:
        - receiver: oncall-pager
          matchers:
            - severity="workinghours"
          active_time_intervals:
            - workinghours

Furthermore, I have a prometheus rule with label severity="workinghours" defined. With this configuration, I would expect my prometheus rule to only be active during working hours, i.e., Monday to Friday from 8am until 6pm. However, for some reason my prometheus rule also fires and moreover gets routed to our pager outside of the hours 8am until 6pm. That is, I get paged even when I shouldn't get paged according to the above-defined time interval workinghours.

Did I do something wrong here in the configuration? Or might this be a problem with the K8s prometheus-operator (or the kube-prometheus-stack helm chart) through which I'm using this new alertmanager feature?

At the moment, I'm using the following versions:

alertmanager: v0.24.0
prometheus-operator: v0.62.0
kube-prometheus-stack helm chart: 44.4.1

Originally posted by @bollmann in #2779 (comment)

@bollmann bollmann changed the title Active time interval is not respected by the alertmanager Alertmanager routing tree doesn't respect the active time interval (?) Feb 13, 2023
@simonpasquier
Copy link
Member

Have you checked the final configuration generated by the operator? I've tried a similar example on my local machine and it works. I'd enable --log.level=debug to see if anything pops up from the logs.

@cbryant42
Copy link

You've defined the times for the hours you expect, but by default I believe the config is going to use UTC time. I would make sure the times you have are correct in UTC. v0.25.0 release support for time zones which has made my life easier.

@dtwilliamsWork
Copy link

I'm experiencing a similar issue using the following config

global:
  resolve_timeout: 5m

route:
  receiver: default
  group_by:
  - alertname

  routes:

  - receiver: DevOutOfHours
    matchers:
    - namespace=~".+dev|.+uat"
    active_time_intervals:
    - outofhours
    - weekends
    mute_time_intervals:
    - officehours

  - receiver: TeamA
    matchers:
    - label_team="TeamA"

  - receiver: TeamB
    matchers:
    - label_team="TeamB"

receivers:
- name: DevOutOfHours
- name: default
- name: TeamA
- name: TeamB

time_intervals:
- name: officehours
  time_intervals:
  - weekdays: ['monday:friday']
  - times:
    - start_time: "08:00"
      end_time: "20:00"
    location: Europe/London
- name: outofhours
  time_intervals:
  - weekdays: ['monday:friday']
  - times:
    - start_time: "00:00"
      end_time: "08:00"
    - start_time: "20:00"
      end_time: "23:59"
    location: Europe/London

- name: weekends
  time_intervals:
  - weekdays: [saturday, sunday]
    location: Europe/London

I've tried copying and pasting into https://prometheus.io/webtools/alerting/routing-tree-editor/ and using { namespace="app-dev", label_team="TeamA"} as the test labal set, but my alerts keep routing to DevOutOfHours instead of TeamA during office hours.

I only want the dev and uat namespace alerts to be notified during working hours, not out of hours and weekends

@cbryant42
Copy link

It looks to me like the matching is working fine. Matching is done top-down, so the DevOutOfHours route is matching first. Perhaps give the configuration doc a re-read to make sure you are familiar with the routing and configuration option. It seems like you may be confused about what active_time_intervals and mute_time_intervals do (I believe you have them inverted).

To me, it seems like you would want to set up sub-routes for the teams routes. My general structure for each team's routing looks like this: I have the default route that matches only team name, then I have sub-routes that match more specific alerts, etc. In your case, this would look like:

- receiver: TeamA
    matchers:
    - label_team="TeamA"
    routes:
      - receiver: TeamA
      matchers:
      - namespace=~".+dev|.+uat"
    mute_time_intervals:
      - outofhours
      - weekends
    active_time_intervals:
      - officehours

This matches any label_team="TeamA", then if namespace=~".+dev|.+uat" alertmanager will match the sub-route and fire under those times specified. And under this model, you can completely remove the DevOutOfHours route.

Let me know if that all makes sense, and solves your problem!

@dtwilliamsWork
Copy link

thanks for the reply, I'll give it a go.

I was using the example found here https://prometheus.io/docs/alerting/latest/configuration/#example

# All alerts with the service=inhouse-service label match this sub-route
    # the route will be active only during offhours and holidays time intervals.
  - receiver: 'on-call-pager'
    matchers:
      - service="inhouse-service"
    active_time_intervals:
      - offhours
      - holidays

I would expect my DevOutOfHours route only to be active out of hours and move on to the next route during office hours. Maybe I've understood it wrong.

@dtwilliamsWork
Copy link

just saw this bit Additionally, the root node cannot have any active times.

let me try it with different subroutes

@dtwilliamsWork
Copy link

not having much luck. Shouldn't the below just route to the default receiver? It seems like the time_intervals aren't having any impact. Running it using { namespace="app-dev", label_team="TeamA"} routes to TeamA when i assume it should only do during the weekends. I've tried adding an additional route within the receiver, but it didn't like it.


global:
  resolve_timeout: 5m

route:
  receiver: default
  group_by:
  - alertname

  routes:

  - receiver: TeamA
    matchers:
    - label_team="TeamA"
    active_time_intervals:
      - weekends

  - receiver: TeamB
    matchers:
    - label_team="TeamB"
    active_time_intervals:
      - weekends

receivers:
- name: default
- name: TeamA
- name: TeamB

time_intervals:
- name: officehours
  time_intervals:
  - weekdays: ['monday:friday']
  - times:
    - start_time: "08:00"
      end_time: "20:00"
    location: Europe/London
- name: outofhours
  time_intervals:
  - weekdays: ['monday:friday']
  - times:
    - start_time: "00:00"
      end_time: "08:00"
    - start_time: "20:00"
      end_time: "23:59"
    location: Europe/London

- name: weekends
  time_intervals:
  - weekdays: [saturday, sunday]
    location: Europe/London

@dtwilliamsWork
Copy link

this doesn't work either. This should route to TeamC on weekends and TeamD in office hours, but it always routes to TeamC. Are my time_intervals set correctly??


global:
  resolve_timeout: 5m

route:
  receiver: default
  group_by:
  - alertname

  routes:

  - receiver: TeamA
    matchers:
    - label_team="TeamA"
    routes:
      - receiver: TeamC
        matchers:
        - namespace=~".+dev|.+uat"
        active_time_intervals:
          - weekends
      - receiver: TeamD
        matchers:
        - namespace=~".+dev|.+uat"
        active_time_intervals:
          - officehours

  - receiver: TeamB
    matchers:
    - label_team="TeamB"
    routes:
      - receiver: TeamD
        matchers:
        - namespace=~".+dev|.+uat"
        active_time_intervals:
          - weekends

receivers:
- name: default
- name: TeamA
- name: TeamB
- name: TeamC
- name: TeamD

time_intervals:
- name: officehours
  time_intervals:
  - weekdays: ['monday:friday']
  - times:
    - start_time: "08:00"
      end_time: "20:00"
    location: Europe/London
- name: outofhours
  time_intervals:
  - weekdays: ['monday:friday']
  - times:
    - start_time: "00:00"
      end_time: "08:00"
    - start_time: "20:00"
      end_time: "23:59"
    location: Europe/London

- name: weekends
  time_intervals:
  - weekdays: [saturday, sunday]
    location: Europe/London

@cbryant42
Copy link

It's possible there is a formatting issue with the time_intervals. Yours do look different than my own. I believe the routes look correct.

I would check using Amtool and the online routing tree editor to confirm. I find this tool extremely useful: https://www.prometheus.io/webtools/alerting/routing-tree-editor/

Here's an example of one of my own time intervals:

time_intervals:
- name: interval_1
  time_intervals:
  - times:
    - start_time: '02:45'
      end_time: '23:45'
    weekdays: ['monday:friday']
    location: 'America/Chicago'

@lordievader
Copy link

I think the subtle difference between the config of @cbryant42 and @dtwilliamsWork is a '-'.

Take for example (A):

time_intervals:
- name: officehours
  time_intervals:
  - weekdays: ['monday:friday']
  - times:
    - start_time: "08:00"
      end_time: "20:00"
    location: Europe/London

And (B):

time_intervals:
- name: officehours
  time_intervals:
  - weekdays: ['monday:friday']
    times:
    - start_time: "08:00"
      end_time: "20:00"
    location: Europe/London

In (A) there are two time interval definitions.

  1. The weekdays Monday through Friday.
  2. The time ranges from 8 o'clock till 20 o'clock.

In (B) there is one time interval defined, the weekdays Monday through Friday and the time range between 8 o'clock and 20 o'clock.

I think the alert manager uses multiple time intervals as a logical OR rather than an AND.

I had the same problem and noticed this subtle difference. I still need to verify that my hunch is correct though ;)

@benjy44
Copy link

benjy44 commented Mar 27, 2024

I struggled with this as well, something that was unclear to me, but the doc states something important for active_time_intervals (and similar for mute_time_intervals):

The route will send notifications only when active, but otherwise acts normally (including ending the route-matching process if the `continue` option is not set).

I was thinking that using active_time_intervals would ignore entirely the route outside the interval, but it only mutes the notifications, the route is still active at all time and will show up in the alertmanager UI when the alerts triggers.
Only the notifications will be paused outside the time interval (or during the interval with mute_time_intervals)

Hope that clarifies something for some people like me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants