Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.0] fix: sets jobStatus=Failed/Payload failed iff the job was running #7302

Merged
merged 1 commit into from
Nov 30, 2023

Conversation

fstagni
Copy link
Contributor

@fstagni fstagni commented Nov 22, 2023

This is a solution to https://lblogbook.cern.ch/Operations/37159

Since #6970 the result of (Inner) ComputingElement.submit() is correctly handled by the JobAgent. A job is set in "JobStatus=Failed/Payload failed" normally because of some catastrophic WN failure out of DIRAC control. All fine, but we might anyway incur in a race condition: it can happen that the job is rescheduled by the JobWrapper (

rescheduleResult = rescheduleFailedJob(jobID, "Working Directory already exists")
) at time X. So, the job starts going again through the Optimizers. If at time X+Y the job is set to "JobStatus=Failed/Payload failed" then we can end up with the following job logging:

[...]
JobManager              Received     Job Rescheduled           Unknown                       2023-11-10 15:29:41
JobPath                 Checking     JobSanity                 Unknown                       2023-11-10 15:29:41
JobSanity               Checking     InputData                 Unknown                       2023-11-10 15:29:41
InputData               Checking     AncestorFiles             Unknown                       2023-11-10 15:29:41
AncestorFiles           Checking     AncestorFiles             Unknown                       2023-11-10 15:29:56
JobScheduling           Waiting      Pilot Agent Submission    Unknown                       2023-11-10 15:33:51
Matcher                 Matched      Assigned                  Unknown                       2023-11-10 15:36:11
[email protected]     Matched      Job Received by Agent     Unknown                       2023-11-10 15:36:11
[email protected]     Matched      Submitting To CE          Unknown                       2023-11-10 15:36:11
JobWrapper              Running      Job Initialization        Unknown                       2023-11-10 15:36:16
JobWrapper              Running      Downloading InputSandbox  Unknown                       2023-11-10 15:36:17
JobWrapper              Running      Input Data Resolution     Unknown                       2023-11-10 15:36:19
JobWrapper              Running      Input Data Resolution     Failed Input Data Resolution  2023-11-10 15:37:53
JobWrapper              Rescheduled  Input Data Resolution     Failed Input Data Resolution  2023-11-10 15:37:53
JobManager              Received     Job Rescheduled           Unknown                       2023-11-10 15:37:53
JobPath                 Checking     JobSanity                 Unknown                       2023-11-10 15:37:53
JobSanity               Checking     InputData                 Unknown                       2023-11-10 15:37:53
InputData               Checking     AncestorFiles             Unknown                       2023-11-10 15:37:53
AncestorFiles           Checking     AncestorFiles             Unknown                       2023-11-10 15:37:56
[email protected]     Failed       Payload failed            Unknown                       2023-11-10 15:38:05

This PR changes the JobAgent: in order to avoid overriding perfectly valid states, the status is updated iff the job was running

BEGINRELEASENOTES

*WMS
FIX: JobAgents will set jobStatus=Failed/Payload failed if and only if the job was previously Running

ENDRELEASENOTES

@fstagni fstagni requested a review from atsareg as a code owner November 22, 2023 15:42
@DIRACGridBot DIRACGridBot added the alsoTargeting:integration Cherry pick this PR to integration after merge label Nov 22, 2023
@fstagni fstagni force-pushed the 80_fixes72 branch 2 times, most recently from a797a63 to 4754439 Compare November 24, 2023 09:39
@fstagni fstagni merged commit 70538bb into DIRACGrid:rel-v8r0 Nov 30, 2023
25 checks passed
@DIRACGridBot DIRACGridBot added the sweep:done All sweeping actions have been done for this PR label Nov 30, 2023
DIRACGridBot pushed a commit to DIRACGridBot/DIRAC that referenced this pull request Nov 30, 2023
@DIRACGridBot
Copy link

Sweep summary

Sweep ran in https://github.com/DIRACGrid/DIRAC/actions/runs/7044080845

Successful:

  • integration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
alsoTargeting:integration Cherry pick this PR to integration after merge sweep:done All sweeping actions have been done for this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants