Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.0] fix the interactions between the Matcher and the PoolCE #6970

Merged
merged 5 commits into from
Jun 20, 2023

Conversation

aldbr
Copy link
Contributor

@aldbr aldbr commented Apr 13, 2023

This PR aims at solving the following issue: #6885

  • Add a test highlighting the issue that we face when the submission process in an inner CE goes wrong: the error is not handled by the JobAgent.
  • Propose an implementation to solve the issue.
  • Add an integration test to make sure the implementation works fine (tests with InProcess, Singularity, Pool).
  • Apply to the Pilot3_submitAndMatch_Py3-client Jenkins test

A few details about the implementation:

  • The JobAgent deals with both synchronous (InProcess, Singularity) and asynchronous (Pool) submissions. A CE parameter called AsyncSubmission allows the JobAgent to make the difference between the two types.

  • Inner CEs performing synchronous submission return S_OK if the submission was ok, and S_ERROR if an error occurred during the submission. If the payload fail, a PayloadFailed = code returned is added to S_OK.

    • In Singularity, I added the PayloadFailed key in the result dict and deleted the ReschedulePayload key (if S_ERROR is returned by Singularity, the JobAgent knows that this was not a PayloadFailed problem and reschedules the job).
  • Inner CEs performing asynchronous submission always return S_OK(CE-specific ID). Then results are reported in computingElement.taskResults.

A few points that could be discussed:

This PR should probably go to v8.1.

BEGINRELEASENOTES
*Resources
NEW: add a test for the PoolCE to highlight the fact that submission failures cannot be handled by the caller with the return value
*WorkloadManagement
FIX: management of the job status in JobAgent
NEW: add a test for the JobAgent to make sure the status of the submissions are correctly handled
*docs
NEW: documentation about ComputingElement
ENDRELEASENOTES

@DIRACGridBot DIRACGridBot added the alsoTargeting:rel-v8r0 Cherry pick this PR to rel-v8r0 after merge label Apr 13, 2023
@aldbr aldbr force-pushed the rel-v7r3_FIX_MatcherPoolCE branch from bcb95bc to 4044167 Compare April 18, 2023 08:18
@aldbr aldbr changed the base branch from rel-v7r3 to rel-v8r0 April 18, 2023 08:19
@aldbr aldbr changed the title [v7r3] fix the interactions between the Matcher and the PoolCE [v8.0] fix the interactions between the Matcher and the PoolCE Apr 18, 2023
@aldbr aldbr added alsoTargeting:integration Cherry pick this PR to integration after merge and removed alsoTargeting:rel-v8r0 Cherry pick this PR to rel-v8r0 after merge labels Apr 18, 2023
@aldbr aldbr linked an issue Apr 18, 2023 that may be closed by this pull request
@aldbr aldbr force-pushed the rel-v7r3_FIX_MatcherPoolCE branch 8 times, most recently from c1d8e2c to 01de156 Compare April 20, 2023 12:49
@aldbr aldbr changed the title [v8.0] fix the interactions between the Matcher and the PoolCE [8.0] fix the interactions between the Matcher and the PoolCE Apr 20, 2023
@aldbr aldbr force-pushed the rel-v7r3_FIX_MatcherPoolCE branch 2 times, most recently from d2aeed7 to f4ec8dc Compare April 20, 2023 16:28
@aldbr aldbr marked this pull request as ready for review April 20, 2023 16:29
Copy link
Contributor

@fstagni fstagni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of changes are large enough that should be tested in Jenkins before merging. Have you already done it by chance?

src/DIRAC/Resources/Computing/ComputingElement.py Outdated Show resolved Hide resolved
error = "JobWrapper execution error"
return S_ERROR(f"Failed to run InProcess: {result['Message']}")

retCode = result["Value"][0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The elif was meaningful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why? I am not sure to understand.
Now we return S_ERROR when result["OK"] is False.

@aldbr
Copy link
Contributor Author

aldbr commented May 8, 2023

The number of changes are large enough that should be tested in Jenkins before merging. Have you already done it by chance?

Yes I did actually, you can find it in Pilot3_submitAndMatch_Py3-client#78.

@aldbr aldbr force-pushed the rel-v7r3_FIX_MatcherPoolCE branch from f4ec8dc to 571772e Compare May 8, 2023 03:54
@fstagni
Copy link
Contributor

fstagni commented May 10, 2023

A few points that could be discussed:

* The `Singularity` CE does not seem to properly report errors occurring in `JobWrapper` (from what I understand, it is reported as `PayloadFailed`).

Is it a problem?

* The `InProcess` CE does not seem to always return payload failures properly (e.g. exit code < 0), I am a bit lost with the logic here: https://github.com/DIRACGrid/DIRAC/blob/rel-v8r0/src/DIRAC/Resources/Computing/InProcessComputingElement.py#L86-L105

Now I don't recall immediately all, but it's correct even if weird.

* Is the `Sudo` CE still used?

Normally not, but it's not to be removed from v7r3.

* The `JobAgent` does nothing if the job is not rescheduled, is this expected? https://github.com/DIRACGrid/DIRAC/blob/rel-v8r0/src/DIRAC/WorkloadManagementSystem/Agent/JobAgent.py#L320

I am not sure I understand what you mean...

* In the current code `JobAgent` does not reschedule jobs if `_submitJob()` returns `S_ERROR`: https://github.com/DIRACGrid/DIRAC/blob/rel-v8r0/src/DIRAC/WorkloadManagementSystem/Agent/JobAgent.py#L308

Seems correct.

* There is a huge `try/except` block in the `JobAgent`. Do we really expect to get an exception from that? Could you provide me with an example?

Maybe to catch exceptions thrown by CEs...?

* Why was the `direct` parameter added to `rescheduleFailedJob()`?

where?

* At some point, we would probably need to define different abstract classes (with type hinting) to separate the different kinds of CEs we have, because it is becoming a bit messy:
  
  * Remote CEs such as `ARC` and `HTCondor`
  * Inner CEs such as `InProcess` and `Singularity`
  * Inner Pool CEs such as `Pool`

* Another possible (tricky) solution to have homogeneous CE interfaces: inner CEs  (`InProcess`, `Singularity`, `Pool`) could return the status of the submission (`S_OK/S_ERROR`) but not the exit code of the payload. They would have a `getJobStatus()` method that would return the status of the job if available in `taskResults`, the same way it is done for the "remote" CEs. Any opinion about this?

The different abstract classes seem like the better solution.

This PR should probably go to v8.1.

I agree, and at a minimum it should go to v8.0.

@fstagni
Copy link
Contributor

fstagni commented May 15, 2023

Just to clarify on this:

The different abstract classes seem like the better solution.

This PR should probably go to v8.1.

I agree, and at a minimum it should go to v8.0.

Keep this PR to v8.0, as soon as https://jenkins-dirac.web.cern.ch/ comes back (it's down, I opened a ticket) I will verify it myself.

The abstract class changes is for later of course.

@aldbr
Copy link
Contributor Author

aldbr commented May 16, 2023

The Singularity CE does not seem to properly report errors occurring in JobWrapper (from what I understand, it is reported as PayloadFailed).

Is it a problem?

Not really a problem, but just heterogeneous.
In the InProcessCE, this is handled differently: if an issue occurs in the JobWrapper, then the submission is marked as failed, whereas in the SingularityCE the submission will be marked as OK, but the payload as failed.
Then, the JobAgent handles these two cases differently: if the submission is marked as failed, the job is rescheduled.

IMHO, the InProcessCE should mark the payload as failed if an error occurs in the JobWrapper, since the JobWrapper is the payload.
And if we introduce an error in the JobWrapper, the error will appear everywhere, so rescheduling the job would be pointless I guess.
What do you think?

The JobAgent does nothing if the job is not rescheduled, is this expected? https://github.com/DIRACGrid/DIRAC/blob/rel-v8r0/src/DIRAC/WorkloadManagementSystem/Agent/JobAgent.py#L320

I am not sure I understand what you mean...

Well, actually I think this is okay.
I meant: what if the JobAgent fails to reschedule the job (https://github.com/DIRACGrid/DIRAC/blob/rel-v8r0/src/DIRAC/WorkloadManagementSystem/Agent/JobAgent.py#L777)?
It gets the message from S_ERROR returned by _rescheduleFailedJob and displays it.

In the current code JobAgent does not reschedule jobs if _submitJob() returns S_ERROR: https://github.com/DIRACGrid/DIRAC/blob/rel-v8r0/src/DIRAC/WorkloadManagementSystem/Agent/JobAgent.py#L308

Seems correct.

Just to let you know: in this PR, if the submission fails (meaning it is not the fault of the payload), then the job is rescheduled.

There is a huge try/except block in the JobAgent. Do we really expect to get an exception from that? Could you provide me with an example?

Maybe to catch exceptions thrown by CEs...?

On the one hand, these exceptions should probably be handled by the CEs.
On the other hand, we will probably miss some.
Thus, the question is: should we catch them from the JobAgent to terminate gracefully? I would say probably yes to not penalize the running jobs.
I removed the try/except block but I am going to restore it, unless you think it is better without it.

Why was the direct parameter added to rescheduleFailedJob()?

where?

Here: https://github.com/DIRACGrid/DIRAC/blob/rel-v8r0/src/DIRAC/WorkloadManagementSystem/Agent/JobAgent.py#L755
Because with the changes I made, rescheduleFailedJob() with direct=True is not used anymore and I wonder whether it should be used or not.

@aldbr
Copy link
Contributor Author

aldbr commented May 19, 2023

In the InProcessCE, this is handled differently: if an issue occurs in the JobWrapper, then the submission is marked as failed, whereas in the SingularityCE the submission will be marked as OK, but the payload as failed.
Then, the JobAgent handles these two cases differently: if the submission is marked as failed, the job is rescheduled.

IMHO, the InProcessCE should mark the payload as failed if an error occurs in the JobWrapper, since the JobWrapper is the payload.
And if we introduce an error in the JobWrapper, the error will appear everywhere, so rescheduling the job would be pointless I guess.
What do you think?

Discussion with @fstagni:
The JobWrapperTemplate might return an error if it cannot download the inputs for instance. Therefore, the SingularityCE should mark the submission as failed in this case.
According to you, the SingularityCE is already handling this correctly. I double checked and I still don't see how.

If the JobWrapperTemplate returns 2 for instance, the SingularityCE will be able to find the returned code and will mark the payload as failed: https://github.com/DIRACGrid/DIRAC/pull/6970/files#diff-774a7f7f7ef17f765067a8eb0e66ba659de3450779d80bd6b5555a73eefc3801L345

IIUC, the SingularityCE would only reschedule the job if it cannot get the JobWrapperTemplate exit code: https://github.com/DIRACGrid/DIRAC/pull/6970/files#diff-774a7f7f7ef17f765067a8eb0e66ba659de3450779d80bd6b5555a73eefc3801L338

@aldbr
Copy link
Contributor Author

aldbr commented May 19, 2023

I was thinking about adding an option to stop the JobAgent if it fails to submit n times (meaning there is a problem with the worker node).
It would prevent a pilot to fetch and fail a large number of jobs.
What do you think?

@fstagni
Copy link
Contributor

fstagni commented May 19, 2023

I was thinking about adding an option to stop the JobAgent if it fails to submit n times (meaning there is a problem with the worker node). It would prevent a pilot to fetch and fail a large number of jobs. What do you think?

Looks OK to me.

@aldbr aldbr force-pushed the rel-v7r3_FIX_MatcherPoolCE branch 4 times, most recently from d78555c to 69978f3 Compare June 1, 2023 07:14
@aldbr
Copy link
Contributor Author

aldbr commented Jun 7, 2023

Update:

The Singularity CE does not seem to properly report errors occurring in JobWrapper (from what I understand, it is reported as PayloadFailed).

I did not modify the SingularityCE.

I removed the try/except block but I am going to restore it, unless you think it is better without it.

I reintroduced the try/except block, but it is just wrapping the submission to the CE.
If an exception occurs, it is processed as a CE error such as: https://github.com/DIRACGrid/DIRAC/pull/6970/files#diff-81052c54bdd87d4f640519488aff350cadab94a15eb0aadd9d30166ab9d23dabR655-R662

I was thinking about adding an option to stop the JobAgent if it fails to submit n times (meaning there is a problem with the worker node).

I added the following option: https://github.com/DIRACGrid/DIRAC/pull/6970/files#diff-81052c54bdd87d4f640519488aff350cadab94a15eb0aadd9d30166ab9d23dabR71

If there are too many CE errors, then the JobAgent is stopped such as: https://github.com/DIRACGrid/DIRAC/pull/6970/files#diff-81052c54bdd87d4f640519488aff350cadab94a15eb0aadd9d30166ab9d23dabR707-R709

@fstagni
Copy link
Contributor

fstagni commented Jun 16, 2023

Can you rebase this one? For running the tests in Jenkins.

@aldbr aldbr force-pushed the rel-v7r3_FIX_MatcherPoolCE branch from 69978f3 to 96a144d Compare June 20, 2023 11:34
Copy link
Contributor

@fstagni fstagni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fully verified with several Jenkins tests.

@fstagni fstagni merged commit 37a2629 into DIRACGrid:rel-v8r0 Jun 20, 2023
@DIRACGridBot DIRACGridBot added the sweep:done All sweeping actions have been done for this PR label Jun 20, 2023
DIRACGridBot pushed a commit to DIRACGridBot/DIRAC that referenced this pull request Jun 20, 2023
@DIRACGridBot
Copy link

Sweep summary

Sweep ran in https://github.com/DIRACGrid/DIRAC/actions/runs/5324587368

Successful:

  • integration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
alsoTargeting:integration Cherry pick this PR to integration after merge sweep:done All sweeping actions have been done for this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Jobs stuck in matched when using SingularityCE
3 participants