Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs stuck in matched when using SingularityCE #6885

Closed
chrisburr opened this issue Mar 6, 2023 · 4 comments · Fixed by #6970
Closed

Jobs stuck in matched when using SingularityCE #6885

chrisburr opened this issue Mar 6, 2023 · 4 comments · Fixed by #6970
Assignees
Milestone

Comments

@chrisburr
Copy link
Member

In LHCb we've been seeing jobs getting stuck in Matched when using SingularityCE instead of rescheduling when issues happen launching the container.

I suspect it's caused by the interplay between the PoolCE with an inner SingularityCE.

@fstagni fstagni added this to the v8.0 milestone Mar 23, 2023
@aldbr
Copy link
Contributor

aldbr commented Apr 11, 2023

Not entirely sure about that but from what I have observed so far, the JobAgent is not able to get the result of a job submission from a PoolCE.

As the PoolCE manages a pool of processes, it cannot return the status of a submission synchronously, so the PoolCE always returns S_OK() to the JobAgent.

https://github.com/DIRACGrid/DIRAC/blob/rel-v8r0/src/DIRAC/Resources/Computing/PoolComputingElement.py#L149

The results of the submissions are handled by a callback function named PoolCE.finalizeJob(), which only prints an error log in case of failure.

https://github.com/DIRACGrid/DIRAC/blob/rel-v8r0/src/DIRAC/Resources/Computing/PoolComputingElement.py#L201

Thus, the following blocks of code do not seem to be executed:

https://github.com/DIRACGrid/DIRAC/blob/rel-v8r0/src/DIRAC/WorkloadManagementSystem/Agent/JobAgent.py#L307-L315
https://github.com/DIRACGrid/DIRAC/blob/rel-v8r0/src/DIRAC/WorkloadManagementSystem/Agent/JobAgent.py#L660-L676

A suggestion:

  • The JobAgent would pass the jobID to PoolCE.submitJob().
  • The PoolCE would add the jobID to the PoolCE.taskResults.
  • In each cycle, the JobAgent would check PoolCE.taskResults, would retrieve the job IDs of the failed submissions and would handle the failures properly.

This solution means that the JobAgent would be more dependent on the PoolCE, or at least its structure.
This is probably not an issue as it is a generic component that can always be used.

@fstagni
Copy link
Contributor

fstagni commented Apr 11, 2023

Good analysis, I don't find it incorrect. It is maybe possible to verify it with a test, it would be useful also when coding the solution.
If you are right, then there's no interplay issue with SingularityCE, but everything is on PoolCE side.

Regarding your suggestion: IMHO, PoolCE might even be the default "Inner CE" out there (if it's bug-free...). The reason why we kept it as the not-default one was exactly because of these possible bugs.

@aldbr
Copy link
Contributor

aldbr commented Apr 17, 2023

Another thing I am not entirely sure to understand.
Is there a mechanism that marks jobs as FAILED if the JobAgent does not finish well?
Because I don't see any, but there is a comment in the code that seems to indicate that: https://github.com/DIRACGrid/DIRAC/blob/integration/src/DIRAC/WorkloadManagementSystem/Agent/JobAgent.py#L663

@chrisburr
Copy link
Member Author

It was true at the time the comment was written and was removed in df805d5.

@aldbr aldbr linked a pull request Apr 18, 2023 that will close this issue
4 tasks
@fstagni fstagni closed this as completed Jun 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants