Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix prolog/epilog timeout handling #6677

Merged
merged 4 commits into from
Mar 4, 2025
Merged

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Mar 3, 2025

This PR adds a probable fix for #6644 (epilog doesn't time out), plus some other bits of cleanup.

grondo added 3 commits March 3, 2025 07:19
Problem: A comment in the perilog jobtap plugin is outdated.

Fix the comment to describe actual plugin behavior.
Problem: When a prolog/epilog timeout is configured, the perilog.so
plugin delays the start of the timeout timer until the bulk-exec
`on_start` callback is called. But this could result in the timer
never being started if one or more tasks get stuck starting.

Start the timeout timer immediately on launch of the prolog or
epilog tasks. This removes the need for the start_cb(), so remove
that function.

Fixes flux-framework#6644
Problem: The perilog plugin raises an error if the kill-timeout
is set for the epilog, but there is no reason any longer to disallow
this, in fact it may be useful in the future.

Remove the code that prevented kill-timeout from being set for the
epilog. Update one test.
Copy link
Member

@chu11 chu11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't know the perilog too well but seems to make sense to me. Just some English stuff in the docs.

seconds to wait until any nodes with prolog tasks that are still
active will be drained. The drain reason will include the string
"canceled then timed out". The default is 60.
(optional, float) If a the prolog times out, or a job exception is raised
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both here and in epilog, "If a the" -> "If the"?

Comment on lines 135 to 136
(optional, bool) A boolean indicating whether a fatal job exception is
raised while the prolog is active terminates the prolog. The default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both here and in epilog, first sentence seems off to me with the "while the prolog is active terminates the prolog". Perhaps reorder the sentence, "A boolean indicating if a prolog should be terminated if a fatal job exception is raised while the prolog is active."?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, this was an edit issue - two versions of the sentence remain grafted together here. Anyway, I like your suggestion so let's use that.

Problem: The documentation of the prolog and epilog configuration
values in flux-config-job-manager(5) is unclear in many ways and
incorrect in some ways.

Amend the documentation for clarity and correctness.
@grondo
Copy link
Contributor Author

grondo commented Mar 3, 2025

Thanks! I've fixed the wording as suggested and will set MWP.

@mergify mergify bot merged commit 621c7b8 into flux-framework:master Mar 4, 2025
35 checks passed
Copy link

codecov bot commented Mar 4, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.85%. Comparing base (7c6746e) to head (13f4f73).
Report is 5 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6677      +/-   ##
==========================================
- Coverage   83.86%   83.85%   -0.01%     
==========================================
  Files         534      534              
  Lines       88939    88931       -8     
==========================================
- Hits        74588    74576      -12     
- Misses      14351    14355       +4     
Files with missing lines Coverage Δ
src/modules/job-manager/plugins/perilog.c 82.43% <100.00%> (-0.21%) ⬇️

... and 9 files with indirect coverage changes

@grondo grondo deleted the issue#6644 branch March 4, 2025 00:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants