Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Un-inhibiting alarms #1255

Closed

Conversation

dmitrypapka1
Copy link

No description provided.

@dmitrypapka1
Copy link
Author

This is fix for: #1153

I'm not sure tho how changes in cmd/alertmanager/main.go have been included in my PR

@fabxc
Copy link
Contributor

fabxc commented Feb 28, 2018

Thanks for looking into this!
Could you elaborate what condition you debugged that gets fixed by this.

We are subscribing to incoming alerts, which however does not handle alerts that are not updated but just resolve via timeout. This is what the comment in indicates:

// As alerts can also time out without an update, we never
// handle new resolved alerts but invalidate the cache on read.

So adding an unset there does not solve all cases.
That's why I went with skipping over resolved alerts when we check whether there a rule has source alerts:

// The cache might be stale and contain resolved alerts.
if a.Resolved() {
continue
}

The memory gets cleaned up periodically here:

// gc clears out resolved alerts from the source cache.
func (r *InhibitRule) gc() {
r.mtx.Lock()
defer r.mtx.Unlock()
for fp, a := range r.scache {
if a.Resolved() {
delete(r.scache, fp)
}
}
}

Due to the way we skip over resolved alerts on reads, by my reasoning, the source of #1153 cannot be that we are not tracking resolved alerts correctly but rather. Instead my thinking is that the reason must be that an alert does not get resolved.

I might be missing something crucial of course.

@dmitrypapka1
Copy link
Author

Hello

We are subscribing to incoming alerts, which however does not handle alerts that are not updated but just resolve via timeout.

Correct me if I am wrong please. After debugging the running app I found out that alerts which are not "firing" anymore are resolved. E.g. condition a.Resolved() will return true for them.

For example.

If there is an alarm and condition for it's activation is met it will be "fired". Once the condition is not met anymore the alert becomes inactive and a.Resolved() will return us true.

As far as I understand, this is the moment when we need to invalidate cache for depending alarms. When alarm X becomes inactive, if it is inhibits alarm Y, we need to "un-inhibit" it.

So adding an unset there does not solve all cases.

Could you please mention an example of a case (condition) which will not be solved?

Instead my thinking is that the reason must be that an alert does not get resolved.

Whenever alert becomes inactive after it's activity it is successfully resolved. At least I see it in a debugger as resolved.

@stuartnelson3
Copy link
Contributor

@fabxc do you have time to finish reviewing this or would you like me to take it?

@dmitrypapka1
Copy link
Author

dmitrypapka1 commented Apr 7, 2018

Hello guys, any update on this?

@stuartnelson3
Copy link
Contributor

#1309 also tries to fix this issue, and according to the test it seems to fix it, but I haven't gotten a clear description why it does this.

I'll take a look at your PR and comments shortly, sorry for the long delay

@@ -138,7 +141,7 @@ func (ih *Inhibitor) Stop() {
}
}

// Mutes returns true iff the given label set is muted.
// Mutes returns true if the given label set is muted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iff is not a typo, it means "if and only if".

@stuartnelson3
Copy link
Contributor

stuartnelson3 commented Apr 13, 2018

Could you please mention an example of a case (condition) which will not be solved?

The code path that had unset() added only receives updates from incoming alerts; an alert doesn't go through this code path if it resolves via timeout (resolve_timeout in the config file).

Looking through the code, I came to the same conclusion as @fabxc:

As part of the inhibit stage, the provided muter is checked to see if an alert's labels are muted: https://github.com/prometheus/alertmanager/blob/master/notify/notify.go#L352-L364

Checking if an alert's labels are muted relies on reading the internal cache of alerts that match user-defined inhibition rules.

Mutes():
https://github.com/prometheus/alertmanager/blob/master/inhibit/inhibit.go#L147-L150

If an alert in this internal cache is Resolved(), then it doesn't contribute to marking an alert as being muted, since it gets skipped in the loop that checks for matches: https://github.com/prometheus/alertmanager/blob/master/inhibit/inhibit.go#L222-L240

Correct me if I am wrong please. After debugging the running app I found out that alerts which are not "firing" anymore are resolved. E.g. condition a.Resolved() will return true for them.

Have you run a debugger in the Inhibit stage of the pipeline? This line here is where alerts are being incorrectly filtered out: https://github.com/prometheus/alertmanager/blob/master/notify/notify.go#L357
From reading the code it looks like the only an alert would be filtered is if r.hasEqual(lset) returns true, which for a Resolved() alert shouldn't happen.

hey ...

https://github.com/prometheus/alertmanager/blob/master/inhibit/inhibit.go#L84-L88

an alert comes in, if it's resolved, we skip updating the internal cache with that alert. So even though it's the "same alert", with the same fingerprint, maybe it's not being merged correctly in provider/mem/mem.go:
https://github.com/prometheus/alertmanager/blob/master/provider/mem/mem.go#L180-L183

This could explain why in the other PR, removing the continue when a.Resolved() == true fixes the issue, because it forces the internal cache to be updated with the resolved version of the alert:

https://github.com/prometheus/alertmanager/pull/1309/files#diff-cd1a9b949c420e5761d4a5e5db8fd215L84

EDIT:
adding @simonpasquier

@stuartnelson3
Copy link
Contributor

fixed in #1331

hh pushed a commit to ii/alertmanager that referenced this pull request Apr 2, 2019
Include additional unit types in the default systemd collector
blacklist.

Signed-off-by: Ben Kochie <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants