Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid scheduling jobs on compute nodes that are not cleaned up #6616

Merged
merged 14 commits into from
Mar 1, 2025

Conversation

garlick
Copy link
Member

@garlick garlick commented Feb 7, 2025

Problem: after a broker restart, "stuck" housekeeping, epilog, prolog, or job shell tasks might still be running, but flux is unaware and new work may be scheduled there even though there might be a problem, or those tasks might be holding on to resources.

When those things are run under systemd, we have the machinery for finding them and tracking them readily at hand.

This PR does the following

  • modifies sdbus so it can be loaded twice, once for user and once for system systemd instances
  • add a sdmon monitoring module that tracks running flux units on both busses
  • have sdmon join a sdmon.idle broker group at startup, after it has verified that no units are running
  • modify the resource module so it monitors sdmon.idle instead of broker.online when configured to use systemd

The net effect is that nodes that require cleanup remain offline until the node is cleaned up.

sdmon also logs the systemd units it has found. Here's an example where I kill -9 a broker while housekeeping is running, then start it back up again

[  +0.192069] sdmon.err[7]: [email protected] needs cleanup - resources are offline
...
[ +26.900866] sdmon.err[7]: cleanup complete - resources are online

Before cleanup is complete, flux resource status reports

      STATE UP NNODES NODELIST
       avail  ✔      7 picl[0-6]
      avail*  ✗      1 picl7

Seems like this does the bare minimum to resolve #6590

This does seem a bit thin in the tooling department. The possibilities are pretty broad so for now, I wanted to get this posted and get feedback on the way the resource module is tied into sdmon using broker groups.

@grondo
Copy link
Contributor

grondo commented Feb 7, 2025

Wow, nice! I don't have any qualms with using an sdmon.idle broker group for the current implementation.

It seems like eventually we'd want various subsystems to be able to recapture their state from what sdmon has found. (for example the job execution system could reconnect to running jobs after restart, or terminate jobs that are not supposed to be running, the job manager could do something similar for prolog/epilog and housekeeping. Any thoughts on how that might work? I realize bringing that up is a bit premature, but it could inform the solution here as a stepping stone. (I guess one thought is that as state is able to be recaptured, this would reduce the list of things that prevent a broker from joining sdmod.idle)

Also, since the sdmon.idle group is never left once joined (except in the case of a broker restart I'm assuming?), I'm wondering if there's a better term. I can't think of anything though, and it doesn't seem important enough to worry about it now (especially since it isn't really user/admin visible)

@garlick
Copy link
Member Author

garlick commented Feb 7, 2025

Maybe sdmon.online is a better name for the group. (It started out as an idle group that was kept up to date and causes resource to post idle/busy events. But then I realized it could be a lot simpler and didn't revisit the group name)

@garlick
Copy link
Member Author

garlick commented Feb 7, 2025

Renamed the group and fixed a spelling error in a test caught in CI.

This still needs a test for the resource module portion of the proposed change so I'll leave it WIP for the moment.

@garlick garlick changed the title WIP avoid scheduling jobs on compute nodes that are not cleaned up avoid scheduling jobs on compute nodes that are not cleaned up Feb 7, 2025
@garlick
Copy link
Member Author

garlick commented Feb 8, 2025

I added the missing test, so I'll drop the WIP.

One thing I should do before we merge this though is make sure the systemd shipped with RHEL 8 allows sdbus to authetnicate to it with flux credentials. I'll try to test that on fluke.

@garlick
Copy link
Member Author

garlick commented Feb 8, 2025

Yep that worked

2025-02-07T16:41:43.412549-08:00 sdbus.info[0]: sd_bus_open_system: connected

@garlick
Copy link
Member Author

garlick commented Feb 8, 2025

(I guess one thought is that as state is able to be recaptured, this would reduce the list of things that prevent a broker from joining sdmod.idle)

Right, I like that way of thinking about it.

Hmm, we should also be trying to capture the state of any units that have completed but weren't reaped, and put that in a lost+found or something. I need to refresh my memory on what happens to that state for the cases we're discussing here (the templated system units and transient user units). That could be a follow-on PR.

But anyway, yeah, if a running unit can be reclaimed, we could let the node join sdmon.online before it terminates.

@garlick
Copy link
Member Author

garlick commented Feb 28, 2025

Since #6662 is likely not feasible to get into our next release, it seems like this could be considered as is.

@garlick
Copy link
Member Author

garlick commented Feb 28, 2025

Oh we should make sure this doesn't cause problems for

I will be offline for today but will double check that soon

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one looks good to me. I'm excited by the prospects of the sdmon module!

My only comment is that it would be nice if ranks offline due to sdmon could somehow bubble up the errors to a central location (e.g. something like flux overlay errors) so admins can easily tell the difference between "offline due to flux not started" vs "offline due to orphaned work". However, I'm not thinking of an easy way to provide that (or maybe there is already a way I'm not seeing) and it certainly shouldn't hold up this PR.

@grondo
Copy link
Contributor

grondo commented Feb 28, 2025

Oh we should make sure this doesn't cause problems for

What problems do you foresee there? The shrink should only affect jobs which aren't running sdmon, but I'm probably missing something obvious (Just thought maybe I can do some testing today while you're out)

@garlick
Copy link
Member Author

garlick commented Feb 28, 2025

Just wanted to double check that this wasn't going to cause a problem:

  • modify the resource module so it monitors sdmon.idle instead of broker.online when configured to use systemd

@grondo
Copy link
Contributor

grondo commented Feb 28, 2025

I did install this PR branch onto my test cluster and performed some quick checks of shrink functionality and did not notice any effects. Since batch and alloc jobs won't have systemd.enable set, these instance won't even have the sdmon module loaded, so as far as shrink in the resource acquisition protocol, there's no effect.

I also tested that the resource module notices a rank suddenly disconnecting in the system instance, and this results in the expected behavior in the child. I wasn't really sure what else to check.

@garlick
Copy link
Member Author

garlick commented Mar 1, 2025

I think this is OK since sdmon.online (formerly sdmon.idle) is only joined, never left.

@garlick
Copy link
Member Author

garlick commented Mar 1, 2025

Good thought on adding something to flux overlay status. Maybe we could somehow annotate brokers that are online but not present in the sdmon.online set. The lack of tools does seem to be the weak link here.

Another thought is to build a tool around sdmon that can do things like list orphaned units and signal them.

@garlick
Copy link
Member Author

garlick commented Mar 1, 2025

Well, I'll go ahead and set MWP here to keep things moving along. Thanks!

garlick added 6 commits March 1, 2025 19:55
Problem: a comment has an extra "to" that makes the sentence incorrect.

Drop the extra word.
Problem: the timer used by sdbus_connect() is hard to modify because
of the embedded error handling.

Extract a function for building the user bus path for the error log.
Now the timer is a bit simpler.
Problem: the sdbus module is hardwired to connect to a systemd
user instance, but Flux now has "work" running in the systemd
system instance as well (prolog, epilog, housekeeping).

Add a "system" module option which directs sdbus to connect
to the systemd system instance instead.  Future commits will allow
a second instance of the sdbus module to be loaded with this option
so access to both systemd instances can be handled concurrently.
Problem: the sdbus system option has no coverage.

Amend the 2407-sdbus.t test with a simple test of "system mode".
Problem: the sdbus module can only be loaded once because it
uses an explicit service name.

Drop the outdated MOD_NAME() symbol declaration.
Register methods in a way that lets the default service name change.
Update the self-contacting "subscribe" composite RPC to determine
the topic string to contact programmatically.

Now the module can be loaded as many times as we like using e.g.
  flux module load --name NEWNAME sdbus
Problem: there are no tests for loading sdbus under a different name

Modify the system test to load sdbus under the name "sdbus-sys" in system
mode instead of reloading the module.  Show that it works for listing
units in the system instance.
garlick added 8 commits March 1, 2025 19:55
Problem: when the system is configured to use systemd, sdbus is only
loaded for the systemd user instance.

Load sdbus-sys as well.
Problem: some libsdexec RPCs can now be directed to different
services to reach the systemd system or user instance.

Add a service parameter to the following functions:
  sdexec_list_units()
  sdexec_property_get()
  sdexec_property_get_all()
  sdexec_property_changed()

Update sdexec.
Update tests.
Problem: there is no mechanism to track systemd units across
a broker restart.

Add a broker module that creates and maintains a list of running
flux systemd units.

This monitors two instances of systemd:
- the user one, running as user flux (where jobs are run)
- the system one (where housekeeping, prolog, epilog run)

A list of units matching flux unit globs is requested at initialization,
and a subscription to property updates on those globs is obtained.
After the initial list, monitoring is driven solely by property updates.

Join the sdmon.online broker group once the node is demonstrably idle.
This lets the resource module on rank 0 notice compute nodes that need
cleanup at restart and withhold them from the scheduler.  Once the group
is joined, sdmon does not explicitly leave it.  It implicitly leaves the
group if sdmon is unloaded or the node goes offline/lost.

If there are running units at startup, log this information at LOG_ERR
level, and again when the units are cleaned up, e.g.

 [email protected] needs cleanup - resources are offline
 cleanup complete - resources are online

In the future, this module's role could be expanded to support tooling
for listing running work and obtaining runtime information such as
pids and cgroup resource parameters.  It could also play a role in
informing other flux components about work that should be re-attached
after a full or partial restart, when support for that is added.
Problem: the sdmon module is not loaded by default.

Load it if systemd.enable = true in the configuration.
Problem: the monitor subsystem of the resource module needs to
know whether the "sdmon.online" broker group will be populated.

Parse the enable key from [systemd].
Pass the whole resource_config struct to the monitor subsystem
instead of just monitor_force_up.
Problem: nodes are not checked for untracked running work when a
Flux instance starts up.

This might happen, for example, if
- job-exec deems job shell(s) unkillable
- housekeeping/prolog/epilog gets stuck on a hung file system
- the broker exits without proper shutdown

When systemd is enabled, the new sdmon module joins the 'sdmon.online'
broker group on startup.  However, if there are any running flux units,
this is delayed until those units are no longer running.

Change the resource module so that it monitors sdmon.online instead of
broker.online when systemd is enabled.  This will withhold "busy" nodes
from the scheduler until they become idle.

Fixes flux-framework#6590
Problem: there is no test coverage for the sdmon module.

Add a new sharness script.
Problem: there is no test coverage for the resource module's
behavior when systemd is configured and sdmon is providing
sdmon.online.

Add a sharness script for that.
@mergify mergify bot merged commit 5ee36b4 into flux-framework:master Mar 1, 2025
33 checks passed
Copy link

codecov bot commented Mar 1, 2025

Codecov Report

Attention: Patch coverage is 74.31907% with 66 lines in your changes missing coverage. Please review.

Project coverage is 83.86%. Comparing base (ff486ca) to head (6b86e5f).
Report is 15 commits behind head on master.

Files with missing lines Patch % Lines
src/modules/sdmon/sdmon.c 71.87% 54 Missing ⚠️
src/modules/sdbus/sdbus.c 57.14% 6 Missing ⚠️
src/modules/sdbus/connect.c 82.60% 4 Missing ⚠️
src/common/libsdexec/property.c 77.77% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           master    #6616    +/-   ##
========================================
  Coverage   83.85%   83.86%            
========================================
  Files         533      534     +1     
  Lines       88695    88929   +234     
========================================
+ Hits        74379    74584   +205     
- Misses      14316    14345    +29     
Files with missing lines Coverage Δ
src/common/libsdexec/list.c 92.85% <100.00%> (+50.54%) ⬆️
src/modules/resource/monitor.c 71.75% <100.00%> (+0.39%) ⬆️
src/modules/resource/resource.c 86.56% <100.00%> (+0.13%) ⬆️
src/modules/sdbus/main.c 69.23% <100.00%> (ø)
src/modules/sdbus/subscribe.c 70.00% <100.00%> (+7.50%) ⬆️
src/modules/sdexec/sdexec.c 68.21% <ø> (+0.48%) ⬆️
src/common/libsdexec/property.c 71.59% <77.77%> (+0.85%) ⬆️
src/modules/sdbus/connect.c 77.27% <82.60%> (+9.27%) ⬆️
src/modules/sdbus/sdbus.c 69.42% <57.14%> (-0.84%) ⬇️
src/modules/sdmon/sdmon.c 71.87% <71.87%> (ø)

... and 10 files with indirect coverage changes

@garlick garlick deleted the issue#6590 branch March 3, 2025 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

after an emergency restart, flux doesn't know about user processes that are still running
2 participants