-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[8.0] fix the interactions between the Matcher and the PoolCE #6970
Conversation
bcb95bc
to
4044167
Compare
c1d8e2c
to
01de156
Compare
d2aeed7
to
f4ec8dc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of changes are large enough that should be tested in Jenkins before merging. Have you already done it by chance?
error = "JobWrapper execution error" | ||
return S_ERROR(f"Failed to run InProcess: {result['Message']}") | ||
|
||
retCode = result["Value"][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The elif
was meaningful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why? I am not sure to understand.
Now we return S_ERROR
when result["OK"] is False
.
Yes I did actually, you can find it in |
f4ec8dc
to
571772e
Compare
Is it a problem?
Now I don't recall immediately all, but it's correct even if weird.
Normally not, but it's not to be removed from v7r3.
I am not sure I understand what you mean...
Seems correct.
Maybe to catch exceptions thrown by CEs...?
where?
The different abstract classes seem like the better solution.
I agree, and at a minimum it should go to v8.0. |
Just to clarify on this:
Keep this PR to v8.0, as soon as https://jenkins-dirac.web.cern.ch/ comes back (it's down, I opened a ticket) I will verify it myself. The abstract class changes is for later of course. |
Not really a problem, but just heterogeneous. IMHO, the
Well, actually I think this is okay.
Just to let you know: in this PR, if the submission fails (meaning it is not the fault of the payload), then the job is rescheduled.
On the one hand, these exceptions should probably be handled by the CEs.
Here: https://github.com/DIRACGrid/DIRAC/blob/rel-v8r0/src/DIRAC/WorkloadManagementSystem/Agent/JobAgent.py#L755 |
Discussion with @fstagni: If the IIUC, the |
I was thinking about adding an option to stop the |
Looks OK to me. |
d78555c
to
69978f3
Compare
Update:
I did not modify the
I reintroduced the
I added the following option: https://github.com/DIRACGrid/DIRAC/pull/6970/files#diff-81052c54bdd87d4f640519488aff350cadab94a15eb0aadd9d30166ab9d23dabR71 If there are too many CE errors, then the |
Can you rebase this one? For running the tests in Jenkins. |
69978f3
to
96a144d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fully verified with several Jenkins tests.
Sweep summary Sweep ran in https://github.com/DIRACGrid/DIRAC/actions/runs/5324587368 Successful:
|
This PR aims at solving the following issue: #6885
JobAgent
.InProcess
,Singularity
,Pool
).Pilot3_submitAndMatch_Py3-client
Jenkins testA few details about the implementation:
The
JobAgent
deals with both synchronous (InProcess
,Singularity
) and asynchronous (Pool
) submissions. A CE parameter calledAsyncSubmission
allows theJobAgent
to make the difference between the two types.Inner CEs performing synchronous submission return
S_OK
if the submission was ok, andS_ERROR
if an error occurred during the submission. If the payload fail, aPayloadFailed = code returned
is added toS_OK
.Singularity
, I added thePayloadFailed
key in the result dict and deleted theReschedulePayload
key (ifS_ERROR
is returned bySingularity
, theJobAgent
knows that this was not aPayloadFailed
problem and reschedules the job).Inner CEs performing asynchronous submission always return
S_OK(CE-specific ID)
. Then results are reported incomputingElement.taskResults
.A few points that could be discussed:
The
Singularity
CE does not seem to properly report errors occurring inJobWrapper
(from what I understand, it is reported asPayloadFailed
).The
InProcess
CE does not seem to always return payload failures properly (e.g. exit code < 0), I am a bit lost with the logic here: https://github.com/DIRACGrid/DIRAC/blob/rel-v8r0/src/DIRAC/Resources/Computing/InProcessComputingElement.py#L86-L105Is the
Sudo
CE still used?The
JobAgent
does nothing if the job is not rescheduled, is this expected? https://github.com/DIRACGrid/DIRAC/blob/rel-v8r0/src/DIRAC/WorkloadManagementSystem/Agent/JobAgent.py#L320In the current code
JobAgent
does not reschedule jobs if_submitJob()
returnsS_ERROR
: https://github.com/DIRACGrid/DIRAC/blob/rel-v8r0/src/DIRAC/WorkloadManagementSystem/Agent/JobAgent.py#L308There is a huge
try/except
block in theJobAgent
. Do we really expect to get an exception from that? Could you provide me with an example?Why was the
direct
parameter added torescheduleFailedJob()
?At some point, we would probably need to define different abstract classes (with type hinting) to separate the different kinds of CEs we have, because it is becoming a bit messy:
ARC
andHTCondor
InProcess
andSingularity
Pool
Another possible (tricky) solution to have homogeneous CE interfaces: inner CEs (
InProcess
,Singularity
,Pool
) could return the status of the submission (S_OK/S_ERROR
) but not the exit code of the payload. They would have agetJobStatus()
method that would return the status of the job if available intaskResults
, the same way it is done for the "remote" CEs. Any opinion about this?This PR should probably go to v8.1.
BEGINRELEASENOTES
*Resources
NEW: add a test for the PoolCE to highlight the fact that submission failures cannot be handled by the caller with the return value
*WorkloadManagement
FIX: management of the job status in JobAgent
NEW: add a test for the JobAgent to make sure the status of the submissions are correctly handled
*docs
NEW: documentation about ComputingElement
ENDRELEASENOTES