Use systemd-notify for worker heartbeating #20840

agrare · 2020-11-24T19:06:27Z

Notify the systemd service manager when a worker is finished starting up, when it is shutting down, and poke the watchdog for heartbeating

Systemd has a watchdog system which will track the status of services
via "liveness check-ins" done via the systemd-notify API.

A watchdog timeout can be specified in the service file: https://www.freedesktop.org/software/systemd/man/systemd.service.html#WatchdogSec=

WatchdogSec=

Configures the watchdog timeout for a service. The watchdog is activated when the start-up is completed. The service must call sd_notify(3) regularly with "WATCHDOG=1" (i.e. the "keep-alive ping"). If the time between two such calls is larger than the configured time, then the service is placed in a failed state and it will be terminated with SIGABRT (or the signal specified by WatchdogSignal=).

When a worker is killed due to heartbeat timeout a SIGABRT is delivered:

Nov 24 13:57:27 desktop systemd[2415]: Started [email protected].
Nov 24 13:59:27 desktop systemd[2415]: [email protected]: Watchdog timeout (limit 2min)!
Nov 24 13:59:27 desktop systemd[2415]: [email protected]: Killing process 89351 (ruby) with signal SIGABRT.
Nov 24 13:59:27 desktop systemd[2415]: [email protected]: Main process exited, code=killed, status=6/ABRT
Nov 24 13:59:27 desktop systemd[2415]: [email protected]: Failed with result 'watchdog'

TODO

Adjust WATCHDOG_USEC= based on dynamic queue_timeout
Disable MiqServer heartbeat checking for systemd/podified

agrare · 2020-11-24T19:56:00Z

So I've been trying to replace as much of what MiqServer does with native systemd utilities as much as possible and worker status/liveness checks is definitely one such place.

We do need to keep in mind how this will play with "bring your own image". We don't want workers to have to inherit from core worker classes in the future the way that we do now. Moving these core features out of MiqServer allows for more flexibility in implementation.

This approach can put more onus on the byoi author because they are responsible for their own heartbeating, but it is similar to a kubernetes pod bringing its own liveness check.

Another way to do this would be for MiqServer to scan the heartbeat files and sd_notify_pid on the workers behalf. We would have to find the PID from the GUID which has led to race conditions in the past and having something time-critical run from a core process just to avoid using the sd_notify api feels like extra overhead. It also doesn't solve the starting/stopping timeout components of sd_notify. There are go and python packages for sd_notify https://github.com/coreos/go-systemd/blob/master/daemon/sdnotify.go, https://pypi.org/project/sdnotify/ and lacking a package you can always just write to the notifier socket so the barrier to entry is very low.

agrare · 2020-11-24T20:11:14Z

app/models/miq_worker/runner.rb

@@ -142,6 +142,7 @@ def starting_worker_record

  def started_worker_record
    reload_worker_record
+    @worker.sd_notify_started if @worker.systemd_worker?


Don't love doing this here, feels like we're going up to the worker instance for all of these so maybe moving all of this info @worker.started! or something would be better? Just testing having systemd checking starting/stopping/watchdog timeout and this seemed to be the place where we note that something is started

I'm hoping we can find all the places we're doing systemd/podified related checks and move them up instead of deep in methods like this. I don't know if we'll be able to do this until we have more implemented differently in each process management system.

Fryguy · 2020-11-25T17:49:46Z

I am also concerned about bring your own image. While there are packages to do this for languages, it feels like it might introduce complexities. For example, I assume FFI is a component of many of those packages, and so that complicates building the image, making sure ffi is up to date, etc.

In podified, we expect the worker to write to a file for a heartbeat, which is dead simple, and then separately we bring the liveness check script which essentially just checks the heartbeat file. In appliances I was thinking a similar liveness check (perhaps the same one?) that the orhcestrator could run, and notify systemd (I think this is nearly the same thing you said in the last paragraph of #20840 (comment)).

However, I didn't think that processes could notify on behalf of other processes. In fact, this was the reason that I was against using systemd-notify ultimately. But if a process can notify on behalf of another that changes everything.

Fryguy · 2020-11-25T17:51:07Z

We would have to find the PID from the GUID which has led to race conditions in the past and having something time-critical run from a core process just to avoid using the sd_notify api feels like extra overhead.

Do we? I thought the heartbeat file had the PID already in it. If we are scanning heartbeat files, then I don't think we need anything.

agrare · 2020-11-25T18:05:50Z

For example, I assume FFI is a component of many of those packages, and so that complicates building the image, making sure ffi is up to date, etc.

Actually none of these use FFI as sd_notify isn't a binary library you just write to a file descriptor which systemd passes as an env var (see "lacking a package you can always just write to the notifier socket so the barrier to entry is very low.")

In podified, we expect the worker to write to a file for a heartbeat, which is dead simple, and then separately we bring the liveness check

True bring your own image might mean bring your own liveness check as well e.g. a provider could bring a podified template and a service file for a worker that is completely opaque to us. Trying to further decouple from core as much as possible

Do we? I thought the heartbeat file had the PID already in it. If we are scanning heartbeat files, then I don't think we need anything.

A heartbeat file is just tmp/#{guid}.hb and contains just a timestamp, so we'd have to lookup the PID by GUID which is exactly the race condition we had before while a worker was restarting so if we do go this route we really should change the heartbeat file to include the pid.

What we could do is have a bash script that heartbeats for the worker, which is similar to how podified calls a script we provide and miq_server isn't involved. I'm just hesitant to add more worker management logic into miq_server beyond dictating what should be running.

Fryguy · 2020-11-30T15:40:36Z

For example, I assume FFI is a component of many of those packages, and so that complicates building the image, making sure ffi is up to date, etc.

Actually none of these use FFI as sd_notify isn't a binary library you just write to a file descriptor which systemd passes as an env var (see "lacking a package you can always just write to the notifier socket so the barrier to entry is very low.")

Oh that's huge... I read the docs and completely missed that bit. If that's the case, writing to a file descriptor is simple enough that I'm good with that approach.

Fryguy · 2020-11-30T15:41:23Z

Do we? I thought the heartbeat file had the PID already in it. If we are scanning heartbeat files, then I don't think we need anything.

A heartbeat file is just tmp/#{guid}.hb and contains just a timestamp, so we'd have to lookup the PID by GUID which is exactly the race condition we had before while a worker was restarting so if we do go this route we really should change the heartbeat file to include the pid.

I agree...we may even want to write the GUID into the heartbeat file

Fryguy · 2020-11-30T15:43:21Z

What we could do is have a bash script that heartbeats for the worker, which is similar to how podified calls a script we provide and miq_server isn't involved. I'm just hesitant to add more worker management logic into miq_server beyond dictating what should be running.

It's really important for the worker to do its own heartbeating. If an external entity heartbeats for it, all it is really checking is the existence of a process, whereas a worker doing its own heartbeating is also finding issues like infinite loops or unexpected long running calls.

Notify the systemd supervisor when a worker is finished starting up, when it is shutting down, and poke the watchdog for heartbeating

miq-bot · 2020-12-14T20:00:31Z

Checked commits agrare/manageiq@d1242f2~...01da008 with ruby 2.6.3, rubocop 0.82.0, haml-lint 0.35.0, and yamllint
5 files checked, 2 offenses detected

app/models/miq_worker/systemd_common.rb

❗ - Line 123, Col 5 - Rails/Delegate - Use delegate to define delegations.
❗ - Line 127, Col 5 - Rails/Delegate - Use delegate to define delegations.

agrare · 2020-12-14T20:11:54Z

app/models/miq_server/worker_management/monitor.rb

+    # worker monitoring (liveness, memory threshold) is handled by those
+    # systems.  Only when workers are run as standalone processes does MiqServer
+    # have to monitor the workers itself.
+    return if podified? || systemd?


So looking at this method it

Updates the MiqWorker#last_heartbeat from the tmp/guid.hb heartbeat files

Validates the worker heartbeat and memory limits [ref]

After this PR neither of these will be necessary neither on k8s (no access to heartbeat files and mem/cpu limits handled by k8s) nor on systemd (systemd-notify for heartbeating and cgroups for mem/cpu limits)

Side note, in a lot of places we ask if podified? and if systemd? but we have no way of asking if we're using the third monitor type, process. Ideally here I'd like to say return unless worker_monitor_type_process? since we have three worker monitor types: k8s, systemd, and legacy/process.

Side note 2, I am seeing a few categories of "worker_monitor_types" that we make decisions on:

external_liveness_monitor?

external_readiness_monitor?

external_resource_limit_monitor?

isolated processes: decisions about where a piece of work can run: can't get process information for workers from the server or vice versa in pods.

I'm sure there are others but maybe we can come up with good names for these concerns.

This is a good point, I've been assuming that any "external worker monitor" aka k8s or systemd will take care of all or most of the "every 15 seconds make sure the workers are okay" for us but some may provide most but not all. I'm thinking when we move to a more pluggable worker monitor class these won't be questions that have to be asked but rather methods that are implemented or not and should clean things up nicely (:crossed_fingers:)

Unsure of readiness vs liveness, is readiness the "are there enough resources to start this new worker" check?

Can you expand on "decisions about where a piece of work can run"? Is that "make MiqSever more like k8s and schedule workers across the appliances"? If so yeah that is a big one that I haven't even started to tackle yet, I think this is going to be a fundamental rethink of how we schedule workers (per server vs per region). We "got away with it" on podified because we went to a single MiqServer but with actual appliance VMs maybe the worker counts should be at the region level and divvied up by the master MiqServer?

I'm thinking when we move to a more pluggable worker monitor class these won't be questions that have to be asked but rather methods that are implemented or not and should clean things up nicely

👍

Unsure of readiness vs liveness, is readiness the "are there enough resources to start this new worker" check?

liveness == heartbeat check
readiness == rails server is available (port 3000) for puma based workers

Can you expand on "decisions about where a piece of work can run"?

There are places that have expectations that the server and workers reside on the same process space / filesystem. For example: #20835

The workaround for these has been to make the right process fetch the environmental information for itself instead of doing it for someone else. Generally, it's been places where the assumption was that we needed to run on the miq server. Previously, any worker on that server was fine but doesn't work in pods so it makes sense to make everything queue things for the correct process instead of assuming one process can see other processes. Like this: jrafanie@1c3d8a7

Another example: #20290

There are places that have expectations that the server and workers reside on the same process space / filesystem. For example

Ahh yes okay I see what you mean, basically anything doing "worker management" outside of MiqServer based on the assumption it can do it based on the PID (processInfo(pid) in the first example and kill(pid) in the second)

Exactly, and i believe we had a similar change for embedded ansible since that relied on the filesystem having the ansible repo checked out. It might have changed since but the idea was that you can't assume the "server with embedded ansible" can access the locally checked out ansible repo, only the process that's guaranteed to have checked it out. I don't know what to call this but the process isolation of process space (kill/processInfo(pid)) and filesystems (ansible repo as an example) are what I'm describing.

Fryguy · 2021-01-04T19:43:49Z

LGTM...will let @jrafanie merge.

jrafanie · 2021-01-04T20:07:36Z

Gemfile

@@ -161,6 +161,7 @@ end

 group :systemd, :optional => true do
  gem "dbus-systemd",    "~>1.1.0", :require => false
+  gem "sd_notify",       "~>0.1.0", :require => false


FYI, "A pure-Ruby implementation of sd_notify(3) that can be used to communicate state changes of Ruby programs to systemd." (no other dependencies and no c extensions 👏 )

agrare requested review from Fryguy, gtanzillo and jrafanie as code owners November 24, 2020 19:06

agrare added core/workers wip labels Nov 24, 2020

agrare changed the title ~~Use systemd-notify for worker heartbeating~~ [WIP] Use systemd-notify for worker heartbeating Nov 24, 2020

miq-bot added the dependencies label Nov 24, 2020

agrare force-pushed the use_sd_notify_systemd_heartbeating branch from 93a0621 to 1ecb58e Compare November 24, 2020 20:02

agrare commented Nov 24, 2020

View reviewed changes

agrare force-pushed the use_sd_notify_systemd_heartbeating branch from 1ecb58e to 17ed39c Compare November 25, 2020 13:20

agrare force-pushed the use_sd_notify_systemd_heartbeating branch 2 times, most recently from 645d816 to e6f6355 Compare December 3, 2020 14:02

agrare requested a review from kbrock as a code owner December 3, 2020 14:02

agrare added 5 commits December 14, 2020 12:46

Use systemd-notify for worker heartbeating

d1242f2

Notify the systemd supervisor when a worker is finished starting up, when it is shutting down, and poke the watchdog for heartbeating

Only require sd_notify once

d159dee

Delegate systemd_worker? from runner to worker

8eb0771

Extend watchdog timeout for queue messages

35b2558

Use systemd service Type=notify

91f98a9

agrare force-pushed the use_sd_notify_systemd_heartbeating branch 2 times, most recently from 26185c0 to 49de68a Compare December 14, 2020 19:52

Don't need to monitor workers on k8s/systemd

01da008

agrare force-pushed the use_sd_notify_systemd_heartbeating branch from 49de68a to 01da008 Compare December 14, 2020 19:57

agrare commented Dec 14, 2020

View reviewed changes

agrare changed the title ~~[WIP] Use systemd-notify for worker heartbeating~~ Use systemd-notify for worker heartbeating Dec 14, 2020

miq-bot removed the wip label Dec 14, 2020

Fryguy approved these changes Jan 4, 2021

View reviewed changes

jrafanie reviewed Jan 4, 2021

View reviewed changes

jrafanie approved these changes Jan 4, 2021

View reviewed changes

jrafanie merged commit df00edf into ManageIQ:master Jan 4, 2021

jrafanie self-assigned this Jan 4, 2021

agrare deleted the use_sd_notify_systemd_heartbeating branch January 4, 2021 20:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use systemd-notify for worker heartbeating #20840

Use systemd-notify for worker heartbeating #20840

agrare commented Nov 24, 2020 •

edited

Loading

agrare commented Nov 24, 2020

agrare Nov 24, 2020

jrafanie Jan 4, 2021

Fryguy commented Nov 25, 2020

Fryguy commented Nov 25, 2020

agrare commented Nov 25, 2020 •

edited

Loading

Fryguy commented Nov 30, 2020

Fryguy commented Nov 30, 2020

Fryguy commented Nov 30, 2020

miq-bot commented Dec 14, 2020

agrare Dec 14, 2020

agrare Dec 14, 2020

jrafanie Dec 16, 2020

agrare Dec 16, 2020

jrafanie Dec 16, 2020

agrare Dec 16, 2020

jrafanie Dec 16, 2020

Fryguy commented Jan 4, 2021

jrafanie Jan 4, 2021

Use systemd-notify for worker heartbeating #20840

Use systemd-notify for worker heartbeating #20840

Conversation

agrare commented Nov 24, 2020 • edited Loading

agrare commented Nov 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fryguy commented Nov 25, 2020

Fryguy commented Nov 25, 2020

agrare commented Nov 25, 2020 • edited Loading

Fryguy commented Nov 30, 2020

Fryguy commented Nov 30, 2020

Fryguy commented Nov 30, 2020

miq-bot commented Dec 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fryguy commented Jan 4, 2021

Choose a reason for hiding this comment

agrare commented Nov 24, 2020 •

edited

Loading

agrare commented Nov 25, 2020 •

edited

Loading