Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation: Further improve etcdMembersDown alert #12177

Merged
merged 1 commit into from
Jul 31, 2020

Conversation

ironcladlou
Copy link
Contributor

Before this change, the default window for the etcdMembersDown network failure
rate function was recently changed to 1 minute. While this helps detect a etcd
recovery more quickly, it depends on scrape intervals of <= 15s to collect
sufficient data points for the rate function. In practice, an interval of >= 30s
is more typical, which causes the rate function to be less accurate.

This patch increases the window to 2m, which is a compromise between the
original value of 3m and the 1m change introuced with 2aa5684, and should
accomodate more typical scrape intervals.

To offset the window change and to further improve the chance that the alert
will only fire when etcd is truly dead, this patch changes the for clause from
3m to 10m. The rationale is as follows:

  1. There can be significant variance in durations following a reboot before etcd
    is scraped and detected as available.

  2. A conservative trigger like 10m seems less likely to produce a false alarm in
    the face of such variance.

  3. In this alerting situation, if the outage is real, it seems unlikely that an
    additional 7 minutes of delay before (for example) paging somebody will make a
    significant impact on the overall response.

Please read https://github.com/etcd-io/etcd/blob/master/CONTRIBUTING.md#contribution-flow.

@ironcladlou
Copy link
Contributor Author

cc @hexfusion @retroflexer @wking

@ironcladlou ironcladlou force-pushed the etcdmembersdown-tweak branch from 298862c to a1af762 Compare July 30, 2020 14:03
@ironcladlou
Copy link
Contributor Author

@wking @paulfantom @hexfusion I have a couple of outstanding questions about the work I did here, would appreciate any feedback

Before this change, the default window for the etcdMembersDown network failure
rate function was recently changed to 1 minute. While this helps detect a etcd
recovery more quickly, it depends on scrape intervals of <= 15s to collect
sufficient data points for the rate function. In practice, an interval of >= 30s
is more typical, which causes the rate function to be less accurate.

This patch increases the window to 2m, which is a compromise between the
original value of 3m and the 1m change introuced with 2aa5684, and should
accomodate more typical scrape intervals.

To offset the window change and to further improve the chance that the alert
will only fire when etcd is truly dead, this patch changes the `for` clause from
3m to 10m. The rationale is as follows:

1. There can be significant variance in durations following a reboot before etcd
is scraped and detected as available.

2. A conservative trigger like 10m seems less likely to produce a false alarm in
the face of such variance.

3. In this alerting situation, if the outage is real, it seems unlikely that an
additional 7 minutes of delay before (for example) paging somebody will make a
significant impact on the overall response.
@ironcladlou ironcladlou force-pushed the etcdmembersdown-tweak branch from a1af762 to cd3df73 Compare July 31, 2020 13:27
@hexfusion
Copy link
Contributor

Thank you @ironcladlou @wking LGTM

@hexfusion hexfusion merged commit 1af6d61 into etcd-io:master Jul 31, 2020
ironcladlou added a commit to ironcladlou/cluster-monitoring-operator that referenced this pull request Jul 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants