Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use systemd-notify for worker heartbeating #20840

Merged
merged 6 commits into from
Jan 4, 2021

Conversation

agrare
Copy link
Member

@agrare agrare commented Nov 24, 2020

Notify the systemd service manager when a worker is finished starting up, when it is shutting down, and poke the watchdog for heartbeating

Systemd has a watchdog system which will track the status of services
via "liveness check-ins" done via the systemd-notify API.

A watchdog timeout can be specified in the service file: https://www.freedesktop.org/software/systemd/man/systemd.service.html#WatchdogSec=

WatchdogSec=

Configures the watchdog timeout for a service. The watchdog is activated when the start-up is completed. The service must call sd_notify(3) regularly with "WATCHDOG=1" (i.e. the "keep-alive ping"). If the time between two such calls is larger than the configured time, then the service is placed in a failed state and it will be terminated with SIGABRT (or the signal specified by WatchdogSignal=).

When a worker is killed due to heartbeat timeout a SIGABRT is delivered:

Nov 24 13:57:27 desktop systemd[2415]: Started [email protected].
Nov 24 13:59:27 desktop systemd[2415]: [email protected]: Watchdog timeout (limit 2min)!
Nov 24 13:59:27 desktop systemd[2415]: [email protected]: Killing process 89351 (ruby) with signal SIGABRT.
Nov 24 13:59:27 desktop systemd[2415]: [email protected]: Main process exited, code=killed, status=6/ABRT
Nov 24 13:59:27 desktop systemd[2415]: [email protected]: Failed with result 'watchdog'

TODO

  • Adjust WATCHDOG_USEC= based on dynamic queue_timeout
  • Disable MiqServer heartbeat checking for systemd/podified

@agrare agrare changed the title Use systemd-notify for worker heartbeating [WIP] Use systemd-notify for worker heartbeating Nov 24, 2020
@agrare
Copy link
Member Author

agrare commented Nov 24, 2020

So I've been trying to replace as much of what MiqServer does with native systemd utilities as much as possible and worker status/liveness checks is definitely one such place.

We do need to keep in mind how this will play with "bring your own image". We don't want workers to have to inherit from core worker classes in the future the way that we do now. Moving these core features out of MiqServer allows for more flexibility in implementation.

This approach can put more onus on the byoi author because they are responsible for their own heartbeating, but it is similar to a kubernetes pod bringing its own liveness check.

Another way to do this would be for MiqServer to scan the heartbeat files and sd_notify_pid on the workers behalf. We would have to find the PID from the GUID which has led to race conditions in the past and having something time-critical run from a core process just to avoid using the sd_notify api feels like extra overhead. It also doesn't solve the starting/stopping timeout components of sd_notify. There are go and python packages for sd_notify https://github.com/coreos/go-systemd/blob/master/daemon/sdnotify.go, https://pypi.org/project/sdnotify/ and lacking a package you can always just write to the notifier socket so the barrier to entry is very low.

@agrare agrare force-pushed the use_sd_notify_systemd_heartbeating branch from 93a0621 to 1ecb58e Compare November 24, 2020 20:02
@@ -142,6 +142,7 @@ def starting_worker_record

def started_worker_record
reload_worker_record
@worker.sd_notify_started if @worker.systemd_worker?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't love doing this here, feels like we're going up to the worker instance for all of these so maybe moving all of this info @worker.started! or something would be better? Just testing having systemd checking starting/stopping/watchdog timeout and this seemed to be the place where we note that something is started

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hoping we can find all the places we're doing systemd/podified related checks and move them up instead of deep in methods like this. I don't know if we'll be able to do this until we have more implemented differently in each process management system.

@agrare agrare force-pushed the use_sd_notify_systemd_heartbeating branch from 1ecb58e to 17ed39c Compare November 25, 2020 13:20
@Fryguy
Copy link
Member

Fryguy commented Nov 25, 2020

I am also concerned about bring your own image. While there are packages to do this for languages, it feels like it might introduce complexities. For example, I assume FFI is a component of many of those packages, and so that complicates building the image, making sure ffi is up to date, etc.

In podified, we expect the worker to write to a file for a heartbeat, which is dead simple, and then separately we bring the liveness check script which essentially just checks the heartbeat file. In appliances I was thinking a similar liveness check (perhaps the same one?) that the orhcestrator could run, and notify systemd (I think this is nearly the same thing you said in the last paragraph of #20840 (comment)).

However, I didn't think that processes could notify on behalf of other processes. In fact, this was the reason that I was against using systemd-notify ultimately. But if a process can notify on behalf of another that changes everything.

@Fryguy
Copy link
Member

Fryguy commented Nov 25, 2020

We would have to find the PID from the GUID which has led to race conditions in the past and having something time-critical run from a core process just to avoid using the sd_notify api feels like extra overhead.

Do we? I thought the heartbeat file had the PID already in it. If we are scanning heartbeat files, then I don't think we need anything.

@agrare
Copy link
Member Author

agrare commented Nov 25, 2020

For example, I assume FFI is a component of many of those packages, and so that complicates building the image, making sure ffi is up to date, etc.

Actually none of these use FFI as sd_notify isn't a binary library you just write to a file descriptor which systemd passes as an env var (see "lacking a package you can always just write to the notifier socket so the barrier to entry is very low.")

In podified, we expect the worker to write to a file for a heartbeat, which is dead simple, and then separately we bring the liveness check

True bring your own image might mean bring your own liveness check as well e.g. a provider could bring a podified template and a service file for a worker that is completely opaque to us. Trying to further decouple from core as much as possible

Do we? I thought the heartbeat file had the PID already in it. If we are scanning heartbeat files, then I don't think we need anything.

A heartbeat file is just tmp/#{guid}.hb and contains just a timestamp, so we'd have to lookup the PID by GUID which is exactly the race condition we had before while a worker was restarting so if we do go this route we really should change the heartbeat file to include the pid.

What we could do is have a bash script that heartbeats for the worker, which is similar to how podified calls a script we provide and miq_server isn't involved. I'm just hesitant to add more worker management logic into miq_server beyond dictating what should be running.

@Fryguy
Copy link
Member

Fryguy commented Nov 30, 2020

For example, I assume FFI is a component of many of those packages, and so that complicates building the image, making sure ffi is up to date, etc.

Actually none of these use FFI as sd_notify isn't a binary library you just write to a file descriptor which systemd passes as an env var (see "lacking a package you can always just write to the notifier socket so the barrier to entry is very low.")

Oh that's huge... I read the docs and completely missed that bit. If that's the case, writing to a file descriptor is simple enough that I'm good with that approach.

@Fryguy
Copy link
Member

Fryguy commented Nov 30, 2020

Do we? I thought the heartbeat file had the PID already in it. If we are scanning heartbeat files, then I don't think we need anything.

A heartbeat file is just tmp/#{guid}.hb and contains just a timestamp, so we'd have to lookup the PID by GUID which is exactly the race condition we had before while a worker was restarting so if we do go this route we really should change the heartbeat file to include the pid.

I agree...we may even want to write the GUID into the heartbeat file

@Fryguy
Copy link
Member

Fryguy commented Nov 30, 2020

What we could do is have a bash script that heartbeats for the worker, which is similar to how podified calls a script we provide and miq_server isn't involved. I'm just hesitant to add more worker management logic into miq_server beyond dictating what should be running.

It's really important for the worker to do its own heartbeating. If an external entity heartbeats for it, all it is really checking is the existence of a process, whereas a worker doing its own heartbeating is also finding issues like infinite loops or unexpected long running calls.

@agrare agrare force-pushed the use_sd_notify_systemd_heartbeating branch 2 times, most recently from 645d816 to e6f6355 Compare December 3, 2020 14:02
@agrare agrare requested a review from kbrock as a code owner December 3, 2020 14:02
@agrare agrare force-pushed the use_sd_notify_systemd_heartbeating branch 2 times, most recently from 26185c0 to 49de68a Compare December 14, 2020 19:52
@agrare agrare force-pushed the use_sd_notify_systemd_heartbeating branch from 49de68a to 01da008 Compare December 14, 2020 19:57
@miq-bot
Copy link
Member

miq-bot commented Dec 14, 2020

Checked commits agrare/manageiq@d1242f2~...01da008 with ruby 2.6.3, rubocop 0.82.0, haml-lint 0.35.0, and yamllint
5 files checked, 2 offenses detected

app/models/miq_worker/systemd_common.rb

  • ❗ - Line 123, Col 5 - Rails/Delegate - Use delegate to define delegations.
  • ❗ - Line 127, Col 5 - Rails/Delegate - Use delegate to define delegations.

# worker monitoring (liveness, memory threshold) is handled by those
# systems. Only when workers are run as standalone processes does MiqServer
# have to monitor the workers itself.
return if podified? || systemd?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So looking at this method it

  1. Updates the MiqWorker#last_heartbeat from the tmp/guid.hb heartbeat files
  2. Validates the worker heartbeat and memory limits [ref]

After this PR neither of these will be necessary neither on k8s (no access to heartbeat files and mem/cpu limits handled by k8s) nor on systemd (systemd-notify for heartbeating and cgroups for mem/cpu limits)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note, in a lot of places we ask if podified? and if systemd? but we have no way of asking if we're using the third monitor type, process. Ideally here I'd like to say return unless worker_monitor_type_process? since we have three worker monitor types: k8s, systemd, and legacy/process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note 2, I am seeing a few categories of "worker_monitor_types" that we make decisions on:

  1. external_liveness_monitor?
  2. external_readiness_monitor?
  3. external_resource_limit_monitor?
  4. isolated processes: decisions about where a piece of work can run: can't get process information for workers from the server or vice versa in pods.

I'm sure there are others but maybe we can come up with good names for these concerns.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point, I've been assuming that any "external worker monitor" aka k8s or systemd will take care of all or most of the "every 15 seconds make sure the workers are okay" for us but some may provide most but not all. I'm thinking when we move to a more pluggable worker monitor class these won't be questions that have to be asked but rather methods that are implemented or not and should clean things up nicely (:crossed_fingers:)

Unsure of readiness vs liveness, is readiness the "are there enough resources to start this new worker" check?

Can you expand on "decisions about where a piece of work can run"? Is that "make MiqSever more like k8s and schedule workers across the appliances"? If so yeah that is a big one that I haven't even started to tackle yet, I think this is going to be a fundamental rethink of how we schedule workers (per server vs per region). We "got away with it" on podified because we went to a single MiqServer but with actual appliance VMs maybe the worker counts should be at the region level and divvied up by the master MiqServer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking when we move to a more pluggable worker monitor class these won't be questions that have to be asked but rather methods that are implemented or not and should clean things up nicely

👍

Unsure of readiness vs liveness, is readiness the "are there enough resources to start this new worker" check?

liveness == heartbeat check
readiness == rails server is available (port 3000) for puma based workers

Can you expand on "decisions about where a piece of work can run"?

There are places that have expectations that the server and workers reside on the same process space / filesystem. For example: #20835

The workaround for these has been to make the right process fetch the environmental information for itself instead of doing it for someone else. Generally, it's been places where the assumption was that we needed to run on the miq server. Previously, any worker on that server was fine but doesn't work in pods so it makes sense to make everything queue things for the correct process instead of assuming one process can see other processes. Like this: jrafanie@1c3d8a7

Another example: #20290

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are places that have expectations that the server and workers reside on the same process space / filesystem. For example

Ahh yes okay I see what you mean, basically anything doing "worker management" outside of MiqServer based on the assumption it can do it based on the PID (processInfo(pid) in the first example and kill(pid) in the second)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, and i believe we had a similar change for embedded ansible since that relied on the filesystem having the ansible repo checked out. It might have changed since but the idea was that you can't assume the "server with embedded ansible" can access the locally checked out ansible repo, only the process that's guaranteed to have checked it out. I don't know what to call this but the process isolation of process space (kill/processInfo(pid)) and filesystems (ansible repo as an example) are what I'm describing.

@agrare agrare changed the title [WIP] Use systemd-notify for worker heartbeating Use systemd-notify for worker heartbeating Dec 14, 2020
@miq-bot miq-bot removed the wip label Dec 14, 2020
@Fryguy
Copy link
Member

Fryguy commented Jan 4, 2021

LGTM...will let @jrafanie merge.

@@ -161,6 +161,7 @@ end

group :systemd, :optional => true do
gem "dbus-systemd", "~>1.1.0", :require => false
gem "sd_notify", "~>0.1.0", :require => false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, "A pure-Ruby implementation of sd_notify(3) that can be used to communicate state changes of Ruby programs to systemd." (no other dependencies and no c extensions 👏 )

@jrafanie jrafanie merged commit df00edf into ManageIQ:master Jan 4, 2021
@jrafanie jrafanie self-assigned this Jan 4, 2021
@agrare agrare deleted the use_sd_notify_systemd_heartbeating branch January 4, 2021 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants