-
Notifications
You must be signed in to change notification settings - Fork 900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use systemd-notify for worker heartbeating #20840
Use systemd-notify for worker heartbeating #20840
Conversation
So I've been trying to replace as much of what MiqServer does with native systemd utilities as much as possible and worker status/liveness checks is definitely one such place. We do need to keep in mind how this will play with "bring your own image". We don't want workers to have to inherit from core worker classes in the future the way that we do now. Moving these core features out of MiqServer allows for more flexibility in implementation. This approach can put more onus on the byoi author because they are responsible for their own heartbeating, but it is similar to a kubernetes pod bringing its own liveness check. Another way to do this would be for MiqServer to scan the heartbeat files and sd_notify_pid on the workers behalf. We would have to find the PID from the GUID which has led to race conditions in the past and having something time-critical run from a core process just to avoid using the sd_notify api feels like extra overhead. It also doesn't solve the starting/stopping timeout components of sd_notify. There are go and python packages for sd_notify https://github.com/coreos/go-systemd/blob/master/daemon/sdnotify.go, https://pypi.org/project/sdnotify/ and lacking a package you can always just write to the notifier socket so the barrier to entry is very low. |
93a0621
to
1ecb58e
Compare
app/models/miq_worker/runner.rb
Outdated
@@ -142,6 +142,7 @@ def starting_worker_record | |||
|
|||
def started_worker_record | |||
reload_worker_record | |||
@worker.sd_notify_started if @worker.systemd_worker? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't love doing this here, feels like we're going up to the worker instance for all of these so maybe moving all of this info @worker.started!
or something would be better? Just testing having systemd checking starting/stopping/watchdog timeout and this seemed to be the place where we note that something is started
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hoping we can find all the places we're doing systemd/podified related checks and move them up instead of deep in methods like this. I don't know if we'll be able to do this until we have more implemented differently in each process management system.
1ecb58e
to
17ed39c
Compare
I am also concerned about bring your own image. While there are packages to do this for languages, it feels like it might introduce complexities. For example, I assume FFI is a component of many of those packages, and so that complicates building the image, making sure ffi is up to date, etc. In podified, we expect the worker to write to a file for a heartbeat, which is dead simple, and then separately we bring the liveness check script which essentially just checks the heartbeat file. In appliances I was thinking a similar liveness check (perhaps the same one?) that the orhcestrator could run, and notify systemd (I think this is nearly the same thing you said in the last paragraph of #20840 (comment)). However, I didn't think that processes could notify on behalf of other processes. In fact, this was the reason that I was against using systemd-notify ultimately. But if a process can notify on behalf of another that changes everything. |
Do we? I thought the heartbeat file had the PID already in it. If we are scanning heartbeat files, then I don't think we need anything. |
Actually none of these use FFI as sd_notify isn't a binary library you just write to a file descriptor which systemd passes as an env var (see "lacking a package you can always just write to the notifier socket so the barrier to entry is very low.")
True bring your own image might mean bring your own liveness check as well e.g. a provider could bring a podified template and a service file for a worker that is completely opaque to us. Trying to further decouple from core as much as possible
A heartbeat file is just What we could do is have a bash script that heartbeats for the worker, which is similar to how podified calls a script we provide and miq_server isn't involved. I'm just hesitant to add more worker management logic into miq_server beyond dictating what should be running. |
Oh that's huge... I read the docs and completely missed that bit. If that's the case, writing to a file descriptor is simple enough that I'm good with that approach. |
I agree...we may even want to write the GUID into the heartbeat file |
It's really important for the worker to do its own heartbeating. If an external entity heartbeats for it, all it is really checking is the existence of a process, whereas a worker doing its own heartbeating is also finding issues like infinite loops or unexpected long running calls. |
645d816
to
e6f6355
Compare
Notify the systemd supervisor when a worker is finished starting up, when it is shutting down, and poke the watchdog for heartbeating
26185c0
to
49de68a
Compare
49de68a
to
01da008
Compare
Checked commits agrare/manageiq@d1242f2~...01da008 with ruby 2.6.3, rubocop 0.82.0, haml-lint 0.35.0, and yamllint app/models/miq_worker/systemd_common.rb |
# worker monitoring (liveness, memory threshold) is handled by those | ||
# systems. Only when workers are run as standalone processes does MiqServer | ||
# have to monitor the workers itself. | ||
return if podified? || systemd? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So looking at this method it
- Updates the
MiqWorker#last_heartbeat
from thetmp/guid.hb
heartbeat files - Validates the worker heartbeat and memory limits [ref]
After this PR neither of these will be necessary neither on k8s (no access to heartbeat files and mem/cpu limits handled by k8s) nor on systemd (systemd-notify for heartbeating and cgroups for mem/cpu limits)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note, in a lot of places we ask if podified?
and if systemd?
but we have no way of asking if we're using the third monitor type, process. Ideally here I'd like to say return unless worker_monitor_type_process?
since we have three worker monitor types: k8s, systemd, and legacy/process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note 2, I am seeing a few categories of "worker_monitor_types" that we make decisions on:
- external_liveness_monitor?
- external_readiness_monitor?
- external_resource_limit_monitor?
- isolated processes: decisions about where a piece of work can run: can't get process information for workers from the server or vice versa in pods.
I'm sure there are others but maybe we can come up with good names for these concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, I've been assuming that any "external worker monitor" aka k8s or systemd will take care of all or most of the "every 15 seconds make sure the workers are okay" for us but some may provide most but not all. I'm thinking when we move to a more pluggable worker monitor class these won't be questions that have to be asked but rather methods that are implemented or not and should clean things up nicely (:crossed_fingers:)
Unsure of readiness vs liveness, is readiness the "are there enough resources to start this new worker" check?
Can you expand on "decisions about where a piece of work can run"? Is that "make MiqSever more like k8s and schedule workers across the appliances"? If so yeah that is a big one that I haven't even started to tackle yet, I think this is going to be a fundamental rethink of how we schedule workers (per server vs per region). We "got away with it" on podified because we went to a single MiqServer
but with actual appliance VMs maybe the worker counts should be at the region level and divvied up by the master MiqServer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking when we move to a more pluggable worker monitor class these won't be questions that have to be asked but rather methods that are implemented or not and should clean things up nicely
👍
Unsure of readiness vs liveness, is readiness the "are there enough resources to start this new worker" check?
liveness == heartbeat check
readiness == rails server is available (port 3000) for puma based workers
Can you expand on "decisions about where a piece of work can run"?
There are places that have expectations that the server and workers reside on the same process space / filesystem. For example: #20835
The workaround for these has been to make the right process fetch the environmental information for itself instead of doing it for someone else. Generally, it's been places where the assumption was that we needed to run on the miq server. Previously, any worker on that server was fine but doesn't work in pods so it makes sense to make everything queue things for the correct process instead of assuming one process can see other processes. Like this: jrafanie@1c3d8a7
Another example: #20290
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are places that have expectations that the server and workers reside on the same process space / filesystem. For example
Ahh yes okay I see what you mean, basically anything doing "worker management" outside of MiqServer based on the assumption it can do it based on the PID (processInfo(pid) in the first example and kill(pid) in the second)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, and i believe we had a similar change for embedded ansible since that relied on the filesystem having the ansible repo checked out. It might have changed since but the idea was that you can't assume the "server with embedded ansible" can access the locally checked out ansible repo, only the process that's guaranteed to have checked it out. I don't know what to call this but the process isolation of process space (kill/processInfo(pid)) and filesystems (ansible repo as an example) are what I'm describing.
LGTM...will let @jrafanie merge. |
@@ -161,6 +161,7 @@ end | |||
|
|||
group :systemd, :optional => true do | |||
gem "dbus-systemd", "~>1.1.0", :require => false | |||
gem "sd_notify", "~>0.1.0", :require => false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, "A pure-Ruby implementation of sd_notify(3) that can be used to communicate state changes of Ruby programs to systemd." (no other dependencies and no c extensions 👏 )
Notify the systemd service manager when a worker is finished starting up, when it is shutting down, and poke the watchdog for heartbeating
Systemd has a watchdog system which will track the status of services
via "liveness check-ins" done via the systemd-notify API.
A watchdog timeout can be specified in the service file: https://www.freedesktop.org/software/systemd/man/systemd.service.html#WatchdogSec=
When a worker is killed due to heartbeat timeout a SIGABRT is delivered:
TODO