Documentation: Further improve etcdMembersDown alert #12177

ironcladlou · 2020-07-27T21:23:28Z

Before this change, the default window for the etcdMembersDown network failure
rate function was recently changed to 1 minute. While this helps detect a etcd
recovery more quickly, it depends on scrape intervals of <= 15s to collect
sufficient data points for the rate function. In practice, an interval of >= 30s
is more typical, which causes the rate function to be less accurate.

This patch increases the window to 2m, which is a compromise between the
original value of 3m and the 1m change introuced with 2aa5684, and should
accomodate more typical scrape intervals.

To offset the window change and to further improve the chance that the alert
will only fire when etcd is truly dead, this patch changes the for clause from
3m to 10m. The rationale is as follows:

There can be significant variance in durations following a reboot before etcd
is scraped and detected as available.
A conservative trigger like 10m seems less likely to produce a false alarm in
the face of such variance.
In this alerting situation, if the outage is real, it seems unlikely that an
additional 7 minutes of delay before (for example) paging somebody will make a
significant impact on the overall response.

Please read https://github.com/etcd-io/etcd/blob/master/CONTRIBUTING.md#contribution-flow.

ironcladlou · 2020-07-27T21:23:54Z

cc @hexfusion @retroflexer @wking

Documentation/etcd-mixin/mixin.libsonnet

Documentation/etcd-mixin/test.yaml

ironcladlou · 2020-07-31T11:59:04Z

@wking @paulfantom @hexfusion I have a couple of outstanding questions about the work I did here, would appreciate any feedback

Documentation/etcd-mixin/test.yaml

Before this change, the default window for the etcdMembersDown network failure rate function was recently changed to 1 minute. While this helps detect a etcd recovery more quickly, it depends on scrape intervals of <= 15s to collect sufficient data points for the rate function. In practice, an interval of >= 30s is more typical, which causes the rate function to be less accurate. This patch increases the window to 2m, which is a compromise between the original value of 3m and the 1m change introuced with 2aa5684, and should accomodate more typical scrape intervals. To offset the window change and to further improve the chance that the alert will only fire when etcd is truly dead, this patch changes the `for` clause from 3m to 10m. The rationale is as follows: 1. There can be significant variance in durations following a reboot before etcd is scraped and detected as available. 2. A conservative trigger like 10m seems less likely to produce a false alarm in the face of such variance. 3. In this alerting situation, if the outage is real, it seems unlikely that an additional 7 minutes of delay before (for example) paging somebody will make a significant impact on the overall response.

hexfusion · 2020-07-31T19:06:46Z

Thank you @ironcladlou @wking LGTM

Includes etcd-io/etcd#12177.

ironcladlou mentioned this pull request Jul 27, 2020

*: Bump mixins for: Tweak etcdMembersDown to reduce false negatives openshift/cluster-monitoring-operator#853

Closed

2 tasks

wking reviewed Jul 27, 2020

View reviewed changes

Documentation/etcd-mixin/mixin.libsonnet Outdated Show resolved Hide resolved

wking reviewed Jul 29, 2020

View reviewed changes

Documentation/etcd-mixin/mixin.libsonnet Outdated Show resolved Hide resolved

ironcladlou force-pushed the etcdmembersdown-tweak branch from 298862c to a1af762 Compare July 30, 2020 14:03

ironcladlou commented Jul 30, 2020

View reviewed changes

Documentation/etcd-mixin/test.yaml Outdated Show resolved Hide resolved

wking reviewed Jul 31, 2020

View reviewed changes

Documentation/etcd-mixin/test.yaml Outdated Show resolved Hide resolved

ironcladlou force-pushed the etcdmembersdown-tweak branch from a1af762 to cd3df73 Compare July 31, 2020 13:27

hexfusion merged commit 1af6d61 into etcd-io:master Jul 31, 2020

ironcladlou added a commit to ironcladlou/cluster-monitoring-operator that referenced this pull request Jul 31, 2020

*: Bump mixins for: Tweak etcdMembersDown to reduce false negatives

17111bf

Includes etcd-io/etcd#12177.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation: Further improve etcdMembersDown alert #12177

Documentation: Further improve etcdMembersDown alert #12177

ironcladlou commented Jul 27, 2020

ironcladlou commented Jul 27, 2020

ironcladlou commented Jul 31, 2020

hexfusion commented Jul 31, 2020

Documentation: Further improve etcdMembersDown alert #12177

Documentation: Further improve etcdMembersDown alert #12177

Conversation

ironcladlou commented Jul 27, 2020

ironcladlou commented Jul 27, 2020

ironcladlou commented Jul 31, 2020

hexfusion commented Jul 31, 2020