Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring epic, Oct 2017 … Sep 2018 #226

Closed
31 tasks done
darkk opened this issue Sep 5, 2018 · 8 comments
Closed
31 tasks done

Monitoring epic, Oct 2017 … Sep 2018 #226

darkk opened this issue Sep 5, 2018 · 8 comments

Comments

@darkk
Copy link
Contributor

darkk commented Sep 5, 2018

This ticket tracks enhancements to OONI infra monitoring through Oct 2017 … Sep 2018.

@darkk
Copy link
Contributor Author

darkk commented Sep 28, 2018

dockerd monitoring has at least one metric that sounds interesting: engine_daemon_container_actions_seconds_count{action="start"} that becomes rate of container restarts. Also, docker seems to implement backoff while restarting, so the restart loop is not "naive" busy loop.
But prometheus exporter should be explicitly enabled.

@darkk
Copy link
Contributor Author

darkk commented Sep 29, 2018

monitor number of systemd service restarts-per-minute

That's impossible with node_exporter 0.16.0 as soon as there are no metrics exported that can express that event. Seems, unreleased version of the node_exporter has alike features, but it's not in "stable" release yet. But frequent restarters are not "active" services so it's probably okay.

monitor number of systemd services in failed state

node_systemd_units in node_exporter=0.16.0 shows that and also exports meta-self-test is-system-running as node_systemd_system_running.

Exporting systemd metrics should be done carefully as most of units are useless (*.device, *.mount and so on).

@darkk
Copy link
Contributor Author

darkk commented Sep 29, 2018

investigate running docker containers as systemd services #220

Seems, it may be useful for graceful shutdown #172

@hellais
Copy link
Member

hellais commented Oct 12, 2018

monitor number of systemd service restarts-per-minute

I also commented in #220, but posting it here too:

https://www.robustperception.io/alerting-on-crash-loops-with-prometheus

This does require that the to-be-monitored process uses some prometheus aware library. This is the case for gorush.

@SuperQ
Copy link
Contributor

SuperQ commented Oct 12, 2018

I'll be releasing a new node_exporter with the systemd restarts metric in the next week or so.

@darkk
Copy link
Contributor Author

darkk commented Oct 12, 2018

@hellais I would love to see it done in a more generic way as gorush is just a single service out of dozen. That's why I aimed for systemd & docker metrics.

@SuperQ
Copy link
Contributor

SuperQ commented Oct 12, 2018

@darkk IMO, directly instrumenting is extremely valuable. Even just adding the default Prometheus client_golang for example gives you a number of process metrics, including Go internals, process CPU, RSS, etc.

It also gives you the standard Prometheus up that can be used to discover down tasks.

darkk added a commit that referenced this issue Oct 31, 2018
That should highlight resource exhaustion and possible malicious
activity. See #101, #135 an #155 umbrelled under #226.
darkk added a commit that referenced this issue Oct 31, 2018
This also fixes hkgsuperset.ooni.io missing from `dom0` and `hkg`
inventory groups. See #155 and #226
darkk added a commit that referenced this issue Oct 31, 2018
Systemd is quite good in supervising failing processes, so this signal
is useful "generic" alert.  Individual systemd units are not monitored
yet as utility of that data is unclear. See also #220 and #226.
@darkk
Copy link
Contributor Author

darkk commented Oct 31, 2018

node_systemd_system_running

For some reason many hosts had no dbus package that is needed to access systemd from an unprivileged user (while all other systemd-enabled hosts had it!). I'm recording the list here for historical purposes:

  • amsapi.ooni.nu
  • amsdataproxy.ooni.nu
  • amsmetadb.ooni.nu
  • analytics.ooni.io
  • get.ooni.io
  • hkgcollectora.ooni.nu
  • hkgjump.ooni.nu
  • hkgmetadb.infra.ooni.io
  • hkgwebconnectivitya.ooni.nu
  • jupyter.ooni.io
  • labs.ooni.io
  • msg.ooni.io
  • run.ooni.io
  • shwdc.ooni.io
  • ssdams.infra.ooni.io
  • wiki.ooni.io

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants