Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.0] Getting more details about failed/aborted pilots from HTCondor #7069

Merged
merged 5 commits into from
Nov 7, 2023

Conversation

aldbr
Copy link
Contributor

@aldbr aldbr commented Jun 21, 2023

This PR:

  • allows operators to get output data even if the pilots fail/are aborted:

    • by default, HTCondor does not mark pilots as Failed if they don't terminate correctly.
    • the on_exit_hold option allows to mark pilots as held if they fail (on_exit_hold_subcode allows to identify them with a specific code).
    • from there, the output data can be transferred if needed, and DIRAC marked them as Failed.
    • the periodic_remove deletes the pilots at some point.
  • fixes wrong Aborted pilots:

    • some pilots are marked as Aborted in DIRAC while there are temporary Held in HTCondor because inputs are spooled.
    • these pilots are now marked as Waiting.

BEGINRELEASENOTES
*Resources
FIX: getting more details about failed/aborted pilots from HTCondor
ENDRELEASENOTES

@DIRACGridBot DIRACGridBot added the alsoTargeting:integration Cherry pick this PR to integration after merge label Jun 21, 2023
@aldbr aldbr force-pushed the rel-v8r0_FIX_HTCondorCEStatus branch 2 times, most recently from 12ca458 to 1ab913c Compare June 22, 2023 13:55
@aldbr aldbr changed the title [8.0] A few fixes for HTCondor [8.0] Getting more details about failed/aborted pilots from HTCondor Jun 22, 2023
@aldbr aldbr force-pushed the rel-v8r0_FIX_HTCondorCEStatus branch from 0ab51ff to f9fd4d5 Compare June 23, 2023 07:11
@aldbr aldbr marked this pull request as ready for review June 23, 2023 07:44
@aldbr aldbr requested review from fstagni and atsareg as code owners June 23, 2023 07:44
@fstagni
Copy link
Contributor

fstagni commented Jun 27, 2023

Please fix the conflict.

@aldbr aldbr force-pushed the rel-v8r0_FIX_HTCondorCEStatus branch from f9fd4d5 to 2e963c4 Compare June 28, 2023 15:21
@fstagni fstagni closed this Jul 5, 2023
@fstagni fstagni reopened this Jul 5, 2023
@aldbr aldbr force-pushed the rel-v8r0_FIX_HTCondorCEStatus branch 2 times, most recently from a89508b to 71254ad Compare July 18, 2023 08:58
@andresailer andresailer self-requested a review July 27, 2023 08:50
src/DIRAC/Resources/Computing/BatchSystems/Condor.py Outdated Show resolved Hide resolved
src/DIRAC/Resources/Computing/BatchSystems/Condor.py Outdated Show resolved Hide resolved
src/DIRAC/Resources/Computing/BatchSystems/Condor.py Outdated Show resolved Hide resolved
src/DIRAC/Resources/Computing/BatchSystems/Condor.py Outdated Show resolved Hide resolved
@aldbr aldbr force-pushed the rel-v8r0_FIX_HTCondorCEStatus branch 2 times, most recently from de00232 to 1a1415b Compare August 1, 2023 07:42
@aldbr aldbr closed this Sep 22, 2023
@aldbr aldbr reopened this Sep 22, 2023
@aldbr aldbr force-pushed the rel-v8r0_FIX_HTCondorCEStatus branch from 5c2487a to aa812d3 Compare September 25, 2023 08:42
@fstagni fstagni requested a review from andresailer October 5, 2023 08:28
andresailer
andresailer previously approved these changes Oct 5, 2023
Copy link
Contributor

@andresailer andresailer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry you were waiting for me for so long.

Emoji responses, and marking discussions as resolved don't trigger notifications.

So I was expecting explicit responses to comments with "Done", or a final: "I changed things along the lines you proposed" to have another look.

@aldbr aldbr force-pushed the rel-v8r0_FIX_HTCondorCEStatus branch from aa812d3 to b5afceb Compare October 6, 2023 06:52
@aldbr
Copy link
Contributor Author

aldbr commented Oct 6, 2023

Indeed, my fault.
Next time, I will comment or click on the "re-request review" button.

Thanks.

@aldbr aldbr closed this Oct 12, 2023
@aldbr aldbr reopened this Oct 12, 2023
@fstagni fstagni merged commit a04d867 into DIRACGrid:rel-v8r0 Nov 7, 2023
40 checks passed
@DIRACGridBot DIRACGridBot added the sweep:done All sweeping actions have been done for this PR label Nov 7, 2023
DIRACGridBot pushed a commit to DIRACGridBot/DIRAC that referenced this pull request Nov 7, 2023
@DIRACGridBot
Copy link

Sweep summary

Sweep ran in https://github.com/DIRACGrid/DIRAC/actions/runs/6782583263

Successful:

  • integration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
alsoTargeting:integration Cherry pick this PR to integration after merge sweep:done All sweeping actions have been done for this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants