Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raise non-fatal exception on epilog failure #6669

Merged
merged 3 commits into from
Feb 28, 2025

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Feb 27, 2025

This PR adds a missing exception (non fatal) to the job eventlog on epilog failure.

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I think you could argue that an epilog failure should be a fatal job exception. For example, if an epilog action is ensuring that job output is on stable storage.

However, I'm sure there are arguments both ways and this is certainly a step in the right direction!

@grondo
Copy link
Contributor Author

grondo commented Feb 28, 2025

I think you could argue that an epilog failure should be a fatal job exception. For example, if an epilog action is ensuring that job output is on stable storage.

I did consider that, however it seemed like that is not the common case, and the fatal epilog exception would mask the actual exit code of the job itself. Maybe this should somehow be controlled by the epilog itself or configuration in the future.

Problem: The generation of the prolog exception message is embedded
in the `finish` event callback and is specific to the job prolog. This
makes it not reusable for the job epilog.

Abstract the error message generation into an exception_errmsg()
function which can be reused for the epilog.
Problem: Job epilog failures do not raise job exceptions. This made
sense when the epilog was used for administrative purposes, but now
that most of that has moved to housekeeping, the epilog is meant to
include cleanup necessary for job completion, and therefore an epilog
failure should be reflected in the job eventlog and the user notified.

Raise a non-fatal job exception on epilog failure. A non-fatal
exception is used so that it does influence the job status while
still allowing the exception to be logged.

Fixes flux-framework#6249
Problem: Nothing in the testsuite ensures that an epilog failure
results in a job exception.

Add a test to t2274-manager-perilog-per-rank.t to ensure a job
exception is logged when the epilog fails.
@mergify mergify bot merged commit c57d3b1 into flux-framework:master Feb 28, 2025
35 checks passed
Copy link

codecov bot commented Feb 28, 2025

Codecov Report

Attention: Patch coverage is 92.30769% with 2 lines in your changes missing coverage. Please review.

Project coverage is 83.86%. Comparing base (26e39f2) to head (513cc87).
Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
src/modules/job-manager/plugins/perilog.c 92.30% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6669      +/-   ##
==========================================
- Coverage   83.86%   83.86%   -0.01%     
==========================================
  Files         533      533              
  Lines       88678    88683       +5     
==========================================
+ Hits        74367    74371       +4     
- Misses      14311    14312       +1     
Files with missing lines Coverage Δ
src/modules/job-manager/plugins/perilog.c 82.68% <92.30%> (+0.13%) ⬆️

... and 3 files with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants