Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[integration] Remote Pilot Logger to Tornado #6208

Merged

Conversation

martynia
Copy link
Contributor

@martynia martynia commented Jun 24, 2022

BEGINRELEASENOTES
This is a draft of a pilot remote logging system to a Tornado Web server - Dirac side.

*Workload Management System
CHANGE: PilotWrapper has been modified to enable sending messages to a Tornado server.
NEW: The remote pilot logger uses a string.io buffer which, when full, flushes messages to Tornado. If enabled, remote logging happens in parallell to the existing pilot logging system. The buffer can be used indirectly by calling log.debug or log.info messages or directly by writing to it. The latter method is used within pilot command subprocesses, where records emitted by Dirac (not Pilot) logger are simply fed into the buffer and sent to Tornado. This requires no changes to logging from those subprocesses.
NEW TornadoPilotLoggingHandler loads a plugin to dispatch messages received by the server. As an example, a FileCacheLoginPlugin is implemented. It writes log records to a file, one per pilot. When a last record is written, or a pilot command exits with a code != 0, a log file is "finalised" and collected by an agent (PilotLoggingAgent).
NEW add a possibility to read a pilot stamp from DIRAC_PILOT_STAMP environment variable if supplied by a CE. This is the same stamp (PilotStamp) as stored in PilotAgents table. This allows to link log files to pilots.
CHANGE PilotCStoJSONSynchronizer to dump an entire Operations section of the CS.

ENDRELEASENOTES

It is foreseen that a MQ logging agent is developed which converts log records into JSON and ship them to a MQ Service.
Notes:
Tornado URL should be provided in Dirac CS.
PilotCStoJSONSynchronizer could also be used to store (parts of ?) the URL in pilot.json

@martynia martynia changed the title [integration] Remote Pilot Logger to Tonado [integration] Remote Pilot Logger to Tornado Jun 24, 2022
@martynia martynia force-pushed the integration_janusz_pilotlogsWrapper_dev branch from 405b86a to 5a6f587 Compare June 27, 2022 08:32
Copy link
Contributor

@fstagni fstagni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an initial review.

Apart from the documentation, that can come later, it is missing unit and integration tests.

@martynia martynia force-pushed the integration_janusz_pilotlogsWrapper_dev branch 2 times, most recently from 571c1c9 to 50ce8c9 Compare July 11, 2022 13:06
@martynia martynia force-pushed the integration_janusz_pilotlogsWrapper_dev branch from 50ce8c9 to 94a4a63 Compare August 5, 2022 10:07
@@ -244,6 +244,29 @@ def _getPilotOptionsPerSetup(self, setup, pilotDict):
return queueOptionRes
queuesDict[queue] = queueOptionRes["Value"]
pilotDict["Setups"][setup]["Logging"]["Queues"] = queuesDict
elif "loggingRESTService" in pilotDict["Setups"][setup]:
self.log.debug(
"Getting option of ", "/DIRAC/Setups/%s/%s" % (setup, pilotDict["Setups"][setup]["loggingRESTService"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not look like a suitable location.

Copy link
Contributor Author

@martynia martynia Aug 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we really need this (a section in pilot.json). It was there in the original code. Now we are sending logs to Tornado, and in the second step are selecting a plugin by name (TornadoPilotLoggingHandler does this) from the CS directly (MQ or File cache) and we don't need this section. I would revert to the original code here.

Copy link
Contributor

@fstagni fstagni Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need it for logging before DIRAC is installed.

@andresailer
Copy link
Contributor

Getting a proxy on demand and executing a function or code block:
https://dirac.readthedocs.io/en/latest/CodeDocumentation/Core/Utilities/Proxy.html

@fstagni fstagni changed the base branch from integration to rel-v8r0 September 15, 2022 07:51
@fstagni fstagni changed the title [integration] Remote Pilot Logger to Tornado [8.0] Remote Pilot Logger to Tornado Sep 15, 2022
@DIRACGridBot DIRACGridBot added the alsoTargeting:integration Cherry pick this PR to integration after merge label Sep 21, 2022
@martynia martynia force-pushed the integration_janusz_pilotlogsWrapper_dev branch from 28ea921 to f203f0e Compare October 18, 2022 15:06
Comment on lines 44 to 52
self.setup = gConfig.getValue("/DIRAC/Setup", None)
if self.setup is None:
self.log.error("Setup is not defined in the configuration")
return S_ERROR("Setup is not defined in the configuration")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this test should be here.

pilotOptions.append("-z ")
# Pilot Logging defined? This enables the extended (possibly remote) logger.
# We need the URL and an optional flag which allows fine tuning on VO by VO basis:
pilotLogging = opsHelper.getValue("/Pilot/RemoteLogging", True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again I would start with

Suggested change
pilotLogging = opsHelper.getValue("/Pilot/RemoteLogging", True)
pilotLogging = opsHelper.getValue("/Pilot/RemoteLogging", False)

src/DIRAC/WorkloadManagementSystem/Agent/SiteDirector.py Outdated Show resolved Hide resolved
@martynia martynia force-pushed the integration_janusz_pilotlogsWrapper_dev branch from f203f0e to 010991b Compare October 31, 2022 10:30
@martynia martynia changed the base branch from rel-v8r0 to integration October 31, 2022 10:38
@martynia martynia changed the title [8.0] Remote Pilot Logger to Tornado [integration] Remote Pilot Logger to Tornado Oct 31, 2022
@fstagni
Copy link
Contributor

fstagni commented Nov 4, 2022

For the failing PilotWrapper tests: you need to change this line https://github.com/DIRACGrid/DIRAC/blob/integration/tests/Integration/WorkloadManagementSystem/Test_GenerateAndExecutePilotWrapper.py#L56 from

    location="diracproject.web.cern.ch/diracproject/tars/Pilot/DIRAC/master/,wrong.cern.ch",

to

    location="diracproject.web.cern.ch/diracproject/tars/Pilot/DIRAC/devel/,wrong.cern.ch",

and then we would need to merge your Pilot PR (to devel branch of course). This might break this test and the DIRAC certification setup but that's not a big deal.

@martynia
Copy link
Contributor Author

martynia commented Nov 4, 2022

Well, it takes pilot code from tars not from git. How would I make refresh the tars ? It's still the old code.

@fstagni
Copy link
Contributor

fstagni commented Nov 7, 2022

The tars are updated every night by this action: https://github.com/DIRACGrid/Pilot/actions/workflows/nightly.yml

@martynia
Copy link
Contributor Author

There is something strange happening here. When examining the pilot.tar I can clearly see the new code. This tar does not have (among other things) MessageSender.py, since this file is now obsolete. By when un-taring the bundle the file is listed. Is it possible that we are still unpacking the old file ?? but are claiming otherwise ...

@fstagni
Copy link
Contributor

fstagni commented Nov 18, 2022

There is something strange happening here. When examining the pilot.tar I can clearly see the new code. This tar does not have (among other things) MessageSender.py, since this file is now obsolete. By when un-taring the bundle the file is listed. Is it possible that we are still unpacking the old file ?? but are claiming otherwise ...

I don't see anything wrong ...

@martynia
Copy link
Contributor Author

Well, so why in the (failing) wrapper test the untar lists more files than there actually are in the tar which is listed in that test and is here: https://diracproject.web.cern.ch/diracproject/tars/Pilot/DIRAC/devel/ ?

@fstagni
Copy link
Contributor

fstagni commented Jun 26, 2023

There are still test failures after rebasing. Not sure they are related to my changes, though.

Run these locally and you should them failing:

src/DIRAC/WorkloadManagementSystem/JobWrapper/test/Test_JobWrapper.py::test_InputData PASSED [ 90%]
src/DIRAC/WorkloadManagementSystem/JobWrapper/test/Test_JobWrapper.py::test_performChecks PASSED [ 90%]
src/DIRAC/WorkloadManagementSystem/JobWrapper/test/Test_JobWrapper.py::test_execute[/bin/ls-None-None-Application Finished Successfully] PASSED [ 90%]
src/DIRAC/WorkloadManagementSystem/JobWrapper/test/Test_JobWrapper.py::test_execute[script-OK.sh-None-src/DIRAC/WorkloadManagementSystem/JobWrapper/test/-Application Finished Successfully] FAILED [ 90%]
src/DIRAC/WorkloadManagementSystem/JobWrapper/test/Test_JobWrapper.py::test_execute[script.sh-111-src/DIRAC/WorkloadManagementSystem/JobWrapper/test/-Application Finished With Errors0] FAILED [ 90%]
src/DIRAC/WorkloadManagementSystem/JobWrapper/test/Test_JobWrapper.py::test_execute[script.sh-111-src/DIRAC/WorkloadManagementSystem/JobWrapper/test/-Application Finished With Errors1] FAILED [ 90%]
src/DIRAC/WorkloadManagementSystem/JobWrapper/test/Test_JobWrapper.py::test_execute[script-RESC.sh-None-src/DIRAC/WorkloadManagementSystem/JobWrapper/test/-Going to reschedule job] FAILED [ 90%]
src/DIRAC/WorkloadManagementSystem/JobWrapper/test/Test_JobWrapper.py::test_execute[src/DIRAC/WorkloadManagementSystem/scripts/dirac_jobexec.py-src/DIRAC/WorkloadManagementSystem/JobWrapper/test/jobDescription.xml -o /DIRAC/Setup=Test-None-Application Finished Successfully] FAILED [ 90%]

@martynia
Copy link
Contributor Author

Hi,
The errors I can see when run locally in a Docker container, following the manual (https://dirac.readthedocs.io/en/latest/DeveloperGuide/CodeTesting/index.html) are the same as on GitHub. The test cannot locate a file (multiple files in fact) under src/DIRAC/WorkloadManagementSystem/JobWrapper/test/. The files are there. The test itself hasn't changed for a couple of months and it is not my code. There must be a fundamental issue with paths setting, not the code itself. Even the test in 8.0 looks the same.

@martynia
Copy link
Contributor Author

The cwd the test is running in is /home/dirac. I would like to know if it is intended in this test, before I prepend src/ with DIRAC/.

@martynia
Copy link
Contributor Author

@aldbr is this your code ? Can you help me with failing tests ?

@fstagni
Copy link
Contributor

fstagni commented Jun 28, 2023

I checked out your branch, and, if I run only this (py)test:

pytest src/DIRAC/WorkloadManagementSystem/JobWrapper/test/Test_JobWrapper.py

it works fine.

If instead I run pytest src/DIRAC/WorkloadManagementSystem/ (so all the tests found in this package) it fails with

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpw43m4wms/src/DIRAC/Core/Base/API.py

This is because in one of the unit tests you have added in this PR you are moving to a temporary directory. Please fix it one way or the other.

@martynia
Copy link
Contributor Author

I fixed the code avoiding using os.chdir(). Now pylint is complaining in AuthHandler:
https://github.com/martynia/DIRAC/blob/87acd4748909128d96c527f56f1e4a61395281b7/src/DIRAC/FrameworkSystem/API/AuthHandler.py#L247

@fstagni
Copy link
Contributor

fstagni commented Jun 29, 2023

I fixed the code avoiding using os.chdir(). Now pylint is complaining in AuthHandler: https://github.com/martynia/DIRAC/blob/87acd4748909128d96c527f56f1e4a61395281b7/src/DIRAC/FrameworkSystem/API/AuthHandler.py#L247

Yes that's independent from you!

@marianne013
Copy link
Contributor

@fstagni @atsareg @cburr Are any of you able to re-trigger the CI test on this request, please?

@andresailer
Copy link
Contributor

Sure

@andresailer
Copy link
Contributor

PS: you mean all or just the failed ones?

@marianne013
Copy link
Contributor

PS: you mean all or just the failed ones?

Didn't even know you could do that :-D

@andresailer
Copy link
Contributor

You mean that I could do that, or that one could to that?
And I still don't know if all or just the failed ones. But the failed ones failed again.

@marianne013
Copy link
Contributor

Sowohl als auch ....
Just realized that @martynia hasn't rebased, so the test indeed fail again. Sorry about that.

@martynia martynia force-pushed the integration_janusz_pilotlogsWrapper_dev branch from 87acd47 to 8cb436c Compare July 19, 2023 14:10
@martynia
Copy link
Contributor Author

I have rebased and tests are still failing, in another place, though.

@marianne013
Copy link
Contributor

I have rebased and tests are still failing, in another place, though.

Yes, this looks like a failure elsewhere in the DIRAC code. I can't tell though whose it is, though I feel an urge to tag arbitrary Chrises....

@marianne013
Copy link
Contributor

marianne013 commented Jul 20, 2023

@chrisburr I think the test failure is due to an upgrade to python 3.11:
https://docs.python.org/3/library/importlib.resources.html
read_text is now deprecated, so maybe that's the issue ?
Tagging the other Chris @chaen as well, as I don't have everyone's holiday schedules in my head.

#7112 is me trying to fix this.

@cburr
Copy link

cburr commented Jul 20, 2023

Hi, you all have the wrong Chris Burr, this is the guy you want: @chrisburr

@martynia martynia force-pushed the integration_janusz_pilotlogsWrapper_dev branch from 8cb436c to 5a2af95 Compare July 21, 2023 09:41
@martynia martynia force-pushed the integration_janusz_pilotlogsWrapper_dev branch from 5a2af95 to 9d437d8 Compare July 25, 2023 09:38
@fstagni
Copy link
Contributor

fstagni commented Jul 26, 2023

OK. Now just add a note in https://github.com/DIRACGrid/DIRAC/wiki/DIRAC-8.1 on what is needed to activate this and we are ready to merge this PR, and to test it in the DIRAC certification setup.

@martynia
Copy link
Contributor Author

I have added a preliminary configuration info to the Wiki.

@fstagni fstagni merged commit a18d85f into DIRACGrid:integration Jul 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants