Un-inhibiting alarms #1255

dmitrypapka1 · 2018-02-22T12:34:54Z

No description provided.

dmitrypapka1 · 2018-02-22T12:40:03Z

This is fix for: #1153

I'm not sure tho how changes in cmd/alertmanager/main.go have been included in my PR

fabxc · 2018-02-28T10:27:34Z

Thanks for looking into this!
Could you elaborate what condition you debugged that gets fixed by this.

We are subscribing to incoming alerts, which however does not handle alerts that are not updated but just resolve via timeout. This is what the comment in indicates:

alertmanager/inhibit/inhibit.go

Lines 85 to 86 in f4c226c

    
           // As alerts can also time out without an update, we never 
        
           // handle new resolved alerts but invalidate the cache on read.

So adding an unset there does not solve all cases.
That's why I went with skipping over resolved alerts when we check whether there a rule has source alerts:

alertmanager/inhibit/inhibit.go

Lines 228 to 231 in f4c226c

    
           // The cache might be stale and contain resolved alerts. 
        
           if a.Resolved() { 
        
           	continue 
        
           }

The memory gets cleaned up periodically here:

alertmanager/inhibit/inhibit.go

Lines 242 to 252 in f4c226c

    
           // gc clears out resolved alerts from the source cache. 
        
           func (r *InhibitRule) gc() { 
        
           	r.mtx.Lock() 
        
           	defer r.mtx.Unlock() 
        
           	for fp, a := range r.scache { 
        
           		if a.Resolved() { 
        
           			delete(r.scache, fp) 
        
           		} 
        
           	} 
        
           }

Due to the way we skip over resolved alerts on reads, by my reasoning, the source of #1153 cannot be that we are not tracking resolved alerts correctly but rather. Instead my thinking is that the reason must be that an alert does not get resolved.

I might be missing something crucial of course.

dmitrypapka1 · 2018-03-07T10:29:43Z

Hello

We are subscribing to incoming alerts, which however does not handle alerts that are not updated but just resolve via timeout.

Correct me if I am wrong please. After debugging the running app I found out that alerts which are not "firing" anymore are resolved. E.g. condition a.Resolved() will return true for them.

For example.

If there is an alarm and condition for it's activation is met it will be "fired". Once the condition is not met anymore the alert becomes inactive and a.Resolved() will return us true.

As far as I understand, this is the moment when we need to invalidate cache for depending alarms. When alarm X becomes inactive, if it is inhibits alarm Y, we need to "un-inhibit" it.

So adding an unset there does not solve all cases.

Could you please mention an example of a case (condition) which will not be solved?

Instead my thinking is that the reason must be that an alert does not get resolved.

Whenever alert becomes inactive after it's activity it is successfully resolved. At least I see it in a debugger as resolved.

stuartnelson3 · 2018-03-28T13:15:42Z

@fabxc do you have time to finish reviewing this or would you like me to take it?

dmitrypapka1 · 2018-04-07T21:10:17Z

Hello guys, any update on this?

stuartnelson3 · 2018-04-12T08:23:24Z

#1309 also tries to fix this issue, and according to the test it seems to fix it, but I haven't gotten a clear description why it does this.

I'll take a look at your PR and comments shortly, sorry for the long delay

stuartnelson3 · 2018-04-13T12:23:52Z

inhibit/inhibit.go

@@ -138,7 +141,7 @@ func (ih *Inhibitor) Stop() {
 	}
 }

-// Mutes returns true iff the given label set is muted.
+// Mutes returns true if the given label set is muted.


iff is not a typo, it means "if and only if".

stuartnelson3 · 2018-04-13T14:07:56Z

Could you please mention an example of a case (condition) which will not be solved?

The code path that had unset() added only receives updates from incoming alerts; an alert doesn't go through this code path if it resolves via timeout (resolve_timeout in the config file).

Looking through the code, I came to the same conclusion as @fabxc:

As part of the inhibit stage, the provided muter is checked to see if an alert's labels are muted: https://github.com/prometheus/alertmanager/blob/master/notify/notify.go#L352-L364

Checking if an alert's labels are muted relies on reading the internal cache of alerts that match user-defined inhibition rules.

Mutes():
https://github.com/prometheus/alertmanager/blob/master/inhibit/inhibit.go#L147-L150

If an alert in this internal cache is Resolved(), then it doesn't contribute to marking an alert as being muted, since it gets skipped in the loop that checks for matches: https://github.com/prometheus/alertmanager/blob/master/inhibit/inhibit.go#L222-L240

Correct me if I am wrong please. After debugging the running app I found out that alerts which are not "firing" anymore are resolved. E.g. condition a.Resolved() will return true for them.

Have you run a debugger in the Inhibit stage of the pipeline? This line here is where alerts are being incorrectly filtered out: https://github.com/prometheus/alertmanager/blob/master/notify/notify.go#L357
From reading the code it looks like the only an alert would be filtered is if r.hasEqual(lset) returns true, which for a Resolved() alert shouldn't happen.

hey ...

https://github.com/prometheus/alertmanager/blob/master/inhibit/inhibit.go#L84-L88

an alert comes in, if it's resolved, we skip updating the internal cache with that alert. So even though it's the "same alert", with the same fingerprint, maybe it's not being merged correctly in provider/mem/mem.go:
https://github.com/prometheus/alertmanager/blob/master/provider/mem/mem.go#L180-L183

This could explain why in the other PR, removing the continue when a.Resolved() == true fixes the issue, because it forces the internal cache to be updated with the resolved version of the alert:

https://github.com/prometheus/alertmanager/pull/1309/files#diff-cd1a9b949c420e5761d4a5e5db8fd215L84

EDIT:
adding @simonpasquier

stuartnelson3 · 2018-04-18T14:27:01Z

fixed in #1331

Include additional unit types in the default systemd collector blacklist. Signed-off-by: Ben Kochie <[email protected]>

Un-inhibiting alarms

e31301f

dmitrypapka1 closed this Feb 22, 2018

dmitrypapka1 reopened this Feb 22, 2018

stuartnelson3 reviewed Apr 13, 2018

View reviewed changes

stuartnelson3 closed this Apr 18, 2018

hh pushed a commit to ii/alertmanager that referenced this pull request Apr 2, 2019

Update systemd blacklist (prometheus#1255)

f028b81

Include additional unit types in the default systemd collector blacklist. Signed-off-by: Ben Kochie <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Un-inhibiting alarms #1255

Un-inhibiting alarms #1255

dmitrypapka1 commented Feb 22, 2018

dmitrypapka1 commented Feb 22, 2018

fabxc commented Feb 28, 2018

dmitrypapka1 commented Mar 7, 2018

stuartnelson3 commented Mar 28, 2018

dmitrypapka1 commented Apr 7, 2018 •

edited

Loading

stuartnelson3 commented Apr 12, 2018

stuartnelson3 Apr 13, 2018

stuartnelson3 commented Apr 13, 2018 •

edited

Loading

stuartnelson3 commented Apr 18, 2018

Un-inhibiting alarms #1255

Un-inhibiting alarms #1255

Conversation

dmitrypapka1 commented Feb 22, 2018

dmitrypapka1 commented Feb 22, 2018

fabxc commented Feb 28, 2018

dmitrypapka1 commented Mar 7, 2018

stuartnelson3 commented Mar 28, 2018

dmitrypapka1 commented Apr 7, 2018 • edited Loading

stuartnelson3 commented Apr 12, 2018

stuartnelson3 Apr 13, 2018

Choose a reason for hiding this comment

stuartnelson3 commented Apr 13, 2018 • edited Loading

stuartnelson3 commented Apr 18, 2018

dmitrypapka1 commented Apr 7, 2018 •

edited

Loading

stuartnelson3 commented Apr 13, 2018 •

edited

Loading