-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raise non-fatal exception on epilog failure #6669
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I think you could argue that an epilog failure should be a fatal job exception. For example, if an epilog action is ensuring that job output is on stable storage.
However, I'm sure there are arguments both ways and this is certainly a step in the right direction!
I did consider that, however it seemed like that is not the common case, and the fatal epilog exception would mask the actual exit code of the job itself. Maybe this should somehow be controlled by the epilog itself or configuration in the future. |
Problem: The generation of the prolog exception message is embedded in the `finish` event callback and is specific to the job prolog. This makes it not reusable for the job epilog. Abstract the error message generation into an exception_errmsg() function which can be reused for the epilog.
Problem: Job epilog failures do not raise job exceptions. This made sense when the epilog was used for administrative purposes, but now that most of that has moved to housekeeping, the epilog is meant to include cleanup necessary for job completion, and therefore an epilog failure should be reflected in the job eventlog and the user notified. Raise a non-fatal job exception on epilog failure. A non-fatal exception is used so that it does influence the job status while still allowing the exception to be logged. Fixes flux-framework#6249
Problem: Nothing in the testsuite ensures that an epilog failure results in a job exception. Add a test to t2274-manager-perilog-per-rank.t to ensure a job exception is logged when the epilog fails.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6669 +/- ##
==========================================
- Coverage 83.86% 83.86% -0.01%
==========================================
Files 533 533
Lines 88678 88683 +5
==========================================
+ Hits 74367 74371 +4
- Misses 14311 14312 +1
|
This PR adds a missing exception (non fatal) to the job eventlog on epilog failure.