Monitoring epic, Oct 2017 … Sep 2018 #226

darkk · 2018-09-05T13:34:35Z

This ticket tracks enhancements to OONI infra monitoring through Oct 2017 … Sep 2018.

The text was updated successfully, but these errors were encountered:

darkk · 2018-09-28T12:01:07Z

dockerd monitoring has at least one metric that sounds interesting: engine_daemon_container_actions_seconds_count{action="start"} that becomes rate of container restarts. Also, docker seems to implement backoff while restarting, so the restart loop is not "naive" busy loop.
But prometheus exporter should be explicitly enabled.

darkk · 2018-09-29T18:01:21Z

monitor number of systemd service restarts-per-minute

That's impossible with node_exporter 0.16.0 as soon as there are no metrics exported that can express that event. Seems, unreleased version of the node_exporter has alike features, but it's not in "stable" release yet. But frequent restarters are not "active" services so it's probably okay.

monitor number of systemd services in failed state

node_systemd_units in node_exporter=0.16.0 shows that and also exports meta-self-test is-system-running as node_systemd_system_running.

Exporting systemd metrics should be done carefully as most of units are useless (*.device, *.mount and so on).

darkk · 2018-09-29T18:12:58Z

investigate running docker containers as systemd services #220

Seems, it may be useful for graceful shutdown #172

hellais · 2018-10-12T14:37:28Z

monitor number of systemd service restarts-per-minute

I also commented in #220, but posting it here too:

https://www.robustperception.io/alerting-on-crash-loops-with-prometheus

This does require that the to-be-monitored process uses some prometheus aware library. This is the case for gorush.

SuperQ · 2018-10-12T14:40:07Z

I'll be releasing a new node_exporter with the systemd restarts metric in the next week or so.

darkk · 2018-10-12T15:33:26Z

@hellais I would love to see it done in a more generic way as gorush is just a single service out of dozen. That's why I aimed for systemd & docker metrics.

SuperQ · 2018-10-12T15:41:01Z

@darkk IMO, directly instrumenting is extremely valuable. Even just adding the default Prometheus client_golang for example gives you a number of process metrics, including Go internals, process CPU, RSS, etc.

It also gives you the standard Prometheus up that can be used to discover down tasks.

That should highlight resource exhaustion and possible malicious activity. See #101, #135 an #155 umbrelled under #226.

This also fixes hkgsuperset.ooni.io missing from `dom0` and `hkg` inventory groups. See #155 and #226

Systemd is quite good in supervising failing processes, so this signal is useful "generic" alert. Individual systemd units are not monitored yet as utility of that data is unclear. See also #220 and #226.

darkk · 2018-10-31T15:40:12Z

node_systemd_system_running

For some reason many hosts had no dbus package that is needed to access systemd from an unprivileged user (while all other systemd-enabled hosts had it!). I'm recording the list here for historical purposes:

amsapi.ooni.nu
amsdataproxy.ooni.nu
amsmetadb.ooni.nu
analytics.ooni.io
get.ooni.io
hkgcollectora.ooni.nu
hkgjump.ooni.nu
hkgmetadb.infra.ooni.io
hkgwebconnectivitya.ooni.nu
jupyter.ooni.io
labs.ooni.io
msg.ooni.io
run.ooni.io
shwdc.ooni.io
ssdams.infra.ooni.io
wiki.ooni.io

See #226 and #189

darkk added a commit that referenced this issue Oct 31, 2018

Alert on anomalously high LA, CPU, RX, TX and low RAM

030d887

That should highlight resource exhaustion and possible malicious activity. See #101, #135 an #155 umbrelled under #226.

darkk added a commit that referenced this issue Oct 31, 2018

Alert on high duration of high IO rate

4a52e2d

This also fixes hkgsuperset.ooni.io missing from `dom0` and `hkg` inventory groups. See #155 and #226

darkk added a commit that referenced this issue Oct 31, 2018

Add AlertmanagerNotificationsFailing alert

e7a5b2e

See #226 and #189

darkk mentioned this issue Oct 31, 2018

More monitoring & alerting from 2018-epic #238

Merged

darkk closed this as completed in eaeb1c1 Oct 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring epic, Oct 2017 … Sep 2018 #226

Monitoring epic, Oct 2017 … Sep 2018 #226

darkk commented Sep 5, 2018 •

edited

Loading

darkk commented Sep 28, 2018 •

edited

Loading

darkk commented Sep 29, 2018

darkk commented Sep 29, 2018

hellais commented Oct 12, 2018

SuperQ commented Oct 12, 2018

darkk commented Oct 12, 2018

SuperQ commented Oct 12, 2018

darkk commented Oct 31, 2018

Monitoring epic, Oct 2017 … Sep 2018 #226

Monitoring epic, Oct 2017 … Sep 2018 #226

Comments

darkk commented Sep 5, 2018 • edited Loading

darkk commented Sep 28, 2018 • edited Loading

darkk commented Sep 29, 2018

darkk commented Sep 29, 2018

hellais commented Oct 12, 2018

SuperQ commented Oct 12, 2018

darkk commented Oct 12, 2018

SuperQ commented Oct 12, 2018

darkk commented Oct 31, 2018

darkk commented Sep 5, 2018 •

edited

Loading

darkk commented Sep 28, 2018 •

edited

Loading