avoid scheduling jobs on compute nodes that are not cleaned up #6616

garlick · 2025-02-07T20:21:28Z

Problem: after a broker restart, "stuck" housekeeping, epilog, prolog, or job shell tasks might still be running, but flux is unaware and new work may be scheduled there even though there might be a problem, or those tasks might be holding on to resources.

When those things are run under systemd, we have the machinery for finding them and tracking them readily at hand.

This PR does the following

modifies sdbus so it can be loaded twice, once for user and once for system systemd instances
add a sdmon monitoring module that tracks running flux units on both busses
have sdmon join a sdmon.idle broker group at startup, after it has verified that no units are running
modify the resource module so it monitors sdmon.idle instead of broker.online when configured to use systemd

The net effect is that nodes that require cleanup remain offline until the node is cleaned up.

sdmon also logs the systemd units it has found. Here's an example where I kill -9 a broker while housekeeping is running, then start it back up again

[  +0.192069] sdmon.err[7]: [email protected] needs cleanup - resources are offline
...
[ +26.900866] sdmon.err[7]: cleanup complete - resources are online

Before cleanup is complete, flux resource status reports

      STATE UP NNODES NODELIST
       avail  ✔      7 picl[0-6]
      avail*  ✗      1 picl7

Seems like this does the bare minimum to resolve #6590

This does seem a bit thin in the tooling department. The possibilities are pretty broad so for now, I wanted to get this posted and get feedback on the way the resource module is tied into sdmon using broker groups.

grondo · 2025-02-07T20:44:15Z

Wow, nice! I don't have any qualms with using an sdmon.idle broker group for the current implementation.

It seems like eventually we'd want various subsystems to be able to recapture their state from what sdmon has found. (for example the job execution system could reconnect to running jobs after restart, or terminate jobs that are not supposed to be running, the job manager could do something similar for prolog/epilog and housekeeping. Any thoughts on how that might work? I realize bringing that up is a bit premature, but it could inform the solution here as a stepping stone. (I guess one thought is that as state is able to be recaptured, this would reduce the list of things that prevent a broker from joining sdmod.idle)

Also, since the sdmon.idle group is never left once joined (except in the case of a broker restart I'm assuming?), I'm wondering if there's a better term. I can't think of anything though, and it doesn't seem important enough to worry about it now (especially since it isn't really user/admin visible)

garlick · 2025-02-07T21:07:15Z

Maybe sdmon.online is a better name for the group. (It started out as an idle group that was kept up to date and causes resource to post idle/busy events. But then I realized it could be a lot simpler and didn't revisit the group name)

garlick · 2025-02-07T22:19:59Z

Renamed the group and fixed a spelling error in a test caught in CI.

This still needs a test for the resource module portion of the proposed change so I'll leave it WIP for the moment.

garlick · 2025-02-08T00:01:34Z

I added the missing test, so I'll drop the WIP.

One thing I should do before we merge this though is make sure the systemd shipped with RHEL 8 allows sdbus to authetnicate to it with flux credentials. I'll try to test that on fluke.

garlick · 2025-02-08T00:46:10Z

Yep that worked

2025-02-07T16:41:43.412549-08:00 sdbus.info[0]: sd_bus_open_system: connected

garlick · 2025-02-08T15:10:42Z

(I guess one thought is that as state is able to be recaptured, this would reduce the list of things that prevent a broker from joining sdmod.idle)

Right, I like that way of thinking about it.

Hmm, we should also be trying to capture the state of any units that have completed but weren't reaped, and put that in a lost+found or something. I need to refresh my memory on what happens to that state for the cases we're discussing here (the templated system units and transient user units). That could be a follow-on PR.

But anyway, yeah, if a running unit can be reclaimed, we could let the node join sdmon.online before it terminates.

garlick · 2025-02-28T14:36:59Z

Since #6662 is likely not feasible to get into our next release, it seems like this could be considered as is.

garlick · 2025-02-28T15:16:20Z

Oh we should make sure this doesn't cause problems for

shrink available resource set for jobs that lose nodes #6652

I will be offline for today but will double check that soon

grondo

This one looks good to me. I'm excited by the prospects of the sdmon module!

My only comment is that it would be nice if ranks offline due to sdmon could somehow bubble up the errors to a central location (e.g. something like flux overlay errors) so admins can easily tell the difference between "offline due to flux not started" vs "offline due to orphaned work". However, I'm not thinking of an easy way to provide that (or maybe there is already a way I'm not seeing) and it certainly shouldn't hold up this PR.

grondo · 2025-02-28T15:27:20Z

Oh we should make sure this doesn't cause problems for

What problems do you foresee there? The shrink should only affect jobs which aren't running sdmon, but I'm probably missing something obvious (Just thought maybe I can do some testing today while you're out)

garlick · 2025-02-28T15:31:18Z

Just wanted to double check that this wasn't going to cause a problem:

modify the resource module so it monitors sdmon.idle instead of broker.online when configured to use systemd

grondo · 2025-02-28T19:11:29Z

I did install this PR branch onto my test cluster and performed some quick checks of shrink functionality and did not notice any effects. Since batch and alloc jobs won't have systemd.enable set, these instance won't even have the sdmon module loaded, so as far as shrink in the resource acquisition protocol, there's no effect.

I also tested that the resource module notices a rank suddenly disconnecting in the system instance, and this results in the expected behavior in the child. I wasn't really sure what else to check.

garlick · 2025-03-01T13:36:09Z

I think this is OK since sdmon.online (formerly sdmon.idle) is only joined, never left.

garlick · 2025-03-01T14:35:19Z

Good thought on adding something to flux overlay status. Maybe we could somehow annotate brokers that are online but not present in the sdmon.online set. The lack of tools does seem to be the weak link here.

Another thought is to build a tool around sdmon that can do things like list orphaned units and signal them.

garlick · 2025-03-01T19:54:49Z

Well, I'll go ahead and set MWP here to keep things moving along. Thanks!

Problem: a comment has an extra "to" that makes the sentence incorrect. Drop the extra word.

Problem: the timer used by sdbus_connect() is hard to modify because of the embedded error handling. Extract a function for building the user bus path for the error log. Now the timer is a bit simpler.

Problem: the sdbus module is hardwired to connect to a systemd user instance, but Flux now has "work" running in the systemd system instance as well (prolog, epilog, housekeeping). Add a "system" module option which directs sdbus to connect to the systemd system instance instead. Future commits will allow a second instance of the sdbus module to be loaded with this option so access to both systemd instances can be handled concurrently.

Problem: the sdbus system option has no coverage. Amend the 2407-sdbus.t test with a simple test of "system mode".

Problem: the sdbus module can only be loaded once because it uses an explicit service name. Drop the outdated MOD_NAME() symbol declaration. Register methods in a way that lets the default service name change. Update the self-contacting "subscribe" composite RPC to determine the topic string to contact programmatically. Now the module can be loaded as many times as we like using e.g. flux module load --name NEWNAME sdbus

Problem: there are no tests for loading sdbus under a different name Modify the system test to load sdbus under the name "sdbus-sys" in system mode instead of reloading the module. Show that it works for listing units in the system instance.

Problem: when the system is configured to use systemd, sdbus is only loaded for the systemd user instance. Load sdbus-sys as well.

Problem: some libsdexec RPCs can now be directed to different services to reach the systemd system or user instance. Add a service parameter to the following functions: sdexec_list_units() sdexec_property_get() sdexec_property_get_all() sdexec_property_changed() Update sdexec. Update tests.

Problem: there is no mechanism to track systemd units across a broker restart. Add a broker module that creates and maintains a list of running flux systemd units. This monitors two instances of systemd: - the user one, running as user flux (where jobs are run) - the system one (where housekeeping, prolog, epilog run) A list of units matching flux unit globs is requested at initialization, and a subscription to property updates on those globs is obtained. After the initial list, monitoring is driven solely by property updates. Join the sdmon.online broker group once the node is demonstrably idle. This lets the resource module on rank 0 notice compute nodes that need cleanup at restart and withhold them from the scheduler. Once the group is joined, sdmon does not explicitly leave it. It implicitly leaves the group if sdmon is unloaded or the node goes offline/lost. If there are running units at startup, log this information at LOG_ERR level, and again when the units are cleaned up, e.g. [email protected] needs cleanup - resources are offline cleanup complete - resources are online In the future, this module's role could be expanded to support tooling for listing running work and obtaining runtime information such as pids and cgroup resource parameters. It could also play a role in informing other flux components about work that should be re-attached after a full or partial restart, when support for that is added.

Problem: the sdmon module is not loaded by default. Load it if systemd.enable = true in the configuration.

Problem: the monitor subsystem of the resource module needs to know whether the "sdmon.online" broker group will be populated. Parse the enable key from [systemd]. Pass the whole resource_config struct to the monitor subsystem instead of just monitor_force_up.

Problem: nodes are not checked for untracked running work when a Flux instance starts up. This might happen, for example, if - job-exec deems job shell(s) unkillable - housekeeping/prolog/epilog gets stuck on a hung file system - the broker exits without proper shutdown When systemd is enabled, the new sdmon module joins the 'sdmon.online' broker group on startup. However, if there are any running flux units, this is delayed until those units are no longer running. Change the resource module so that it monitors sdmon.online instead of broker.online when systemd is enabled. This will withhold "busy" nodes from the scheduler until they become idle. Fixes flux-framework#6590

Problem: there is no test coverage for the sdmon module. Add a new sharness script.

Problem: there is no test coverage for the resource module's behavior when systemd is configured and sdmon is providing sdmon.online. Add a sharness script for that.

codecov · 2025-03-01T20:23:07Z

Codecov Report

Attention: Patch coverage is 74.31907% with 66 lines in your changes missing coverage. Please review.

Project coverage is 83.86%. Comparing base (ff486ca) to head (6b86e5f).
Report is 15 commits behind head on master.

Files with missing lines	Patch %	Lines
src/modules/sdmon/sdmon.c	71.87%	54 Missing ⚠️
src/modules/sdbus/sdbus.c	57.14%	6 Missing ⚠️
src/modules/sdbus/connect.c	82.60%	4 Missing ⚠️
src/common/libsdexec/property.c	77.77%	2 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##           master    #6616    +/-   ##
========================================
  Coverage   83.85%   83.86%            
========================================
  Files         533      534     +1     
  Lines       88695    88929   +234     
========================================
+ Hits        74379    74584   +205     
- Misses      14316    14345    +29

Files with missing lines	Coverage Δ
src/common/libsdexec/list.c	`92.85% <100.00%> (+50.54%)`	⬆️
src/modules/resource/monitor.c	`71.75% <100.00%> (+0.39%)`	⬆️
src/modules/resource/resource.c	`86.56% <100.00%> (+0.13%)`	⬆️
src/modules/sdbus/main.c	`69.23% <100.00%> (ø)`
src/modules/sdbus/subscribe.c	`70.00% <100.00%> (+7.50%)`	⬆️
src/modules/sdexec/sdexec.c	`68.21% <ø> (+0.48%)`	⬆️
src/common/libsdexec/property.c	`71.59% <77.77%> (+0.85%)`	⬆️
src/modules/sdbus/connect.c	`77.27% <82.60%> (+9.27%)`	⬆️
src/modules/sdbus/sdbus.c	`69.42% <57.14%> (-0.84%)`	⬇️
src/modules/sdmon/sdmon.c	`71.87% <71.87%> (ø)`

... and 10 files with indirect coverage changes

garlick force-pushed the issue#6590 branch from 2a22573 to 42e8a8f Compare February 7, 2025 22:19

garlick changed the title ~~WIP avoid scheduling jobs on compute nodes that are not cleaned up~~ avoid scheduling jobs on compute nodes that are not cleaned up Feb 7, 2025

garlick mentioned this pull request Feb 10, 2025

can we run prolog/epilog/housekeeping in the flux systemd instance? #6619

Open

garlick force-pushed the issue#6590 branch from f7bdb9d to 6b43b6a Compare February 28, 2025 14:35

grondo approved these changes Feb 28, 2025

View reviewed changes

garlick added the merge-when-passing label Mar 1, 2025

garlick added 6 commits March 1, 2025 19:55

sdexec: fix comment grammar

7a25679

Problem: a comment has an extra "to" that makes the sentence incorrect. Drop the extra word.

sdbus: clean up error handling

09587c9

Problem: the timer used by sdbus_connect() is hard to modify because of the embedded error handling. Extract a function for building the user bus path for the error log. Now the timer is a bit simpler.

testsuite: cover sdbus system option

88c93a8

Problem: the sdbus system option has no coverage. Amend the 2407-sdbus.t test with a simple test of "system mode".

testsuite: cover sdbus-sys

47a22f4

Problem: there are no tests for loading sdbus under a different name Modify the system test to load sdbus under the name "sdbus-sys" in system mode instead of reloading the module. Show that it works for listing units in the system instance.

garlick added 8 commits March 1, 2025 19:55

rc: load sdbus-sys

7e98ecd

Problem: when the system is configured to use systemd, sdbus is only loaded for the systemd user instance. Load sdbus-sys as well.

rc: conditionally load sdmon module

4c50d8e

Problem: the sdmon module is not loaded by default. Load it if systemd.enable = true in the configuration.

testsuite: add sdmon test script

27977ac

Problem: there is no test coverage for the sdmon module. Add a new sharness script.

testsuite: cover resource with sdmon.online

6b86e5f

Problem: there is no test coverage for the resource module's behavior when systemd is configured and sdmon is providing sdmon.online. Add a sharness script for that.

garlick force-pushed the issue#6590 branch from 6b43b6a to 6b86e5f Compare March 1, 2025 19:55

mergify bot merged commit 5ee36b4 into flux-framework:master Mar 1, 2025
33 checks passed

garlick deleted the issue#6590 branch March 3, 2025 13:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avoid scheduling jobs on compute nodes that are not cleaned up #6616

avoid scheduling jobs on compute nodes that are not cleaned up #6616

garlick commented Feb 7, 2025 •

edited

Loading

grondo commented Feb 7, 2025

garlick commented Feb 7, 2025 •

edited

Loading

garlick commented Feb 7, 2025

garlick commented Feb 8, 2025

garlick commented Feb 8, 2025

garlick commented Feb 8, 2025

garlick commented Feb 28, 2025

garlick commented Feb 28, 2025

grondo left a comment

grondo commented Feb 28, 2025

garlick commented Feb 28, 2025

grondo commented Feb 28, 2025

garlick commented Mar 1, 2025

garlick commented Mar 1, 2025 •

edited

Loading

garlick commented Mar 1, 2025

codecov bot commented Mar 1, 2025

avoid scheduling jobs on compute nodes that are not cleaned up #6616

avoid scheduling jobs on compute nodes that are not cleaned up #6616

Conversation

garlick commented Feb 7, 2025 • edited Loading

grondo commented Feb 7, 2025

garlick commented Feb 7, 2025 • edited Loading

garlick commented Feb 7, 2025

garlick commented Feb 8, 2025

garlick commented Feb 8, 2025

garlick commented Feb 8, 2025

garlick commented Feb 28, 2025

garlick commented Feb 28, 2025

grondo left a comment

Choose a reason for hiding this comment

grondo commented Feb 28, 2025

garlick commented Feb 28, 2025

grondo commented Feb 28, 2025

garlick commented Mar 1, 2025

garlick commented Mar 1, 2025 • edited Loading

garlick commented Mar 1, 2025

codecov bot commented Mar 1, 2025

Codecov Report

garlick commented Feb 7, 2025 •

edited

Loading

garlick commented Feb 7, 2025 •

edited

Loading

garlick commented Mar 1, 2025 •

edited

Loading