Skip to content

Commit

Permalink
Alert on high duration of high IO rate
Browse files Browse the repository at this point in the history
This also fixes hkgsuperset.ooni.io missing from `dom0` and `hkg`
inventory groups. See #155 and #226
  • Loading branch information
darkk committed Oct 31, 2018
1 parent 030d887 commit 4a52e2d
Show file tree
Hide file tree
Showing 3 changed files with 20 additions and 0 deletions.
2 changes: 2 additions & 0 deletions ansible/inventory
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ ath
ber
tpo
bigv
fra

[gh:children]
wdc # technically it's i95.net, Radio Free Asia network, but GH has some boxes there
Expand Down Expand Up @@ -48,6 +49,7 @@ ooni-explorer-next.test.ooni.io
wiki.ooni.io
labs.ooni.io
hkgjump.ooni.nu
hkgsuperset.ooni.io

[ams]
explorer.ooni.io
Expand Down
10 changes: 10 additions & 0 deletions ansible/inventory-check.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,16 @@
---
- import_playbook: ansible-version.yml

- hosts: all
gather_facts: false
tasks:
- name: ensure that all inventory hosts are rooted to dom0
assert:
that:
- groups.all | difference(groups.dom0) | length == 0
msg: "Hosts in inventory not rooted to dom0: {{ groups.all | difference(groups.dom0) | sort | join(' ') }}"
run_once: true

- hosts: all
vars:
ansible_become: false # root is not required here, also it's not `become: false` as variable declraed for `all` has precedence over directive :-/
Expand Down
8 changes: 8 additions & 0 deletions ansible/roles/prometheus/files/alert_rules.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,14 @@ groups:
annotations:
summary: '{{ $labels.instance }} %iowait over 90%'

# the difference between node_disk_{io,read,write}_time_ms is not clear, `io` is NOT `read + write`, it may be greater, it may be less...
# All the nodes have `node_disk_io_time_ms`, but it can be verified with expr: (sum without(device) (node_disk_io_time_ms{job="node"}) or up{job="node"}) == 1
- alert: IOHigh
expr: irate(node_disk_io_time_ms{device!~"(nbd[0-9]+|dm-[0-9]+|ram[0-9]+|sr[0-9]+|md[0-9]+)"}[1m]) > 800
for: 2h
annotations:
summary: '{{ $labels.instance }}/{{ $labels.device }} spends {{ $value }}ms/s in IO over 2 hours'

- alert: CPUHigh
expr: sum without (mode, cpu) (irate(node_cpu{mode!="idle"}[1m])) > 0.75
for: 8h
Expand Down

0 comments on commit 4a52e2d

Please sign in to comment.