-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix prolog/epilog timeout handling #6677
Conversation
Problem: A comment in the perilog jobtap plugin is outdated. Fix the comment to describe actual plugin behavior.
Problem: When a prolog/epilog timeout is configured, the perilog.so plugin delays the start of the timeout timer until the bulk-exec `on_start` callback is called. But this could result in the timer never being started if one or more tasks get stuck starting. Start the timeout timer immediately on launch of the prolog or epilog tasks. This removes the need for the start_cb(), so remove that function. Fixes flux-framework#6644
Problem: The perilog plugin raises an error if the kill-timeout is set for the epilog, but there is no reason any longer to disallow this, in fact it may be useful in the future. Remove the code that prevented kill-timeout from being set for the epilog. Update one test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't know the perilog too well but seems to make sense to me. Just some English stuff in the docs.
doc/man5/flux-config-job-manager.rst
Outdated
seconds to wait until any nodes with prolog tasks that are still | ||
active will be drained. The drain reason will include the string | ||
"canceled then timed out". The default is 60. | ||
(optional, float) If a the prolog times out, or a job exception is raised |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both here and in epilog, "If a the" -> "If the"?
doc/man5/flux-config-job-manager.rst
Outdated
(optional, bool) A boolean indicating whether a fatal job exception is | ||
raised while the prolog is active terminates the prolog. The default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both here and in epilog, first sentence seems off to me with the "while the prolog is active terminates the prolog". Perhaps reorder the sentence, "A boolean indicating if a prolog should be terminated if a fatal job exception is raised while the prolog is active."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, this was an edit issue - two versions of the sentence remain grafted together here. Anyway, I like your suggestion so let's use that.
Problem: The documentation of the prolog and epilog configuration values in flux-config-job-manager(5) is unclear in many ways and incorrect in some ways. Amend the documentation for clarity and correctness.
Thanks! I've fixed the wording as suggested and will set MWP. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #6677 +/- ##
==========================================
- Coverage 83.86% 83.85% -0.01%
==========================================
Files 534 534
Lines 88939 88931 -8
==========================================
- Hits 74588 74576 -12
- Misses 14351 14355 +4
|
This PR adds a probable fix for #6644 (epilog doesn't time out), plus some other bits of cleanup.