-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure in StepChain jobs can result in misnamed output files #9633
Comments
The system seems to be full of such problems since - at least - the past week. Here is the json unpickled version of the job reported processed by JobAccountant in submit3: other files, for that same job, can also be found under the same directory. Having a quick look at the wmagentJob.log, there is something very very suspicious:
where it actually reports the very same output file (e.g. MINIAODSIMoutput.root) for the same step (cmsRun3). In short, there are two critical errors with this job:
|
Just fixed submit3, submit4, vocms0250, vocms0253 and vocms0281. If it keeps happening, we need to bring it to the top of the list, make a proper fix (or at least a temporary JobAccountant fix) and get all the agents patched. |
There is another wave of such failures in many agents. Will try to get a workaround in place to at least keep the agents stable. |
Based on the debugging done this morning, I have some interesting observations to make. First, let me share a dump of a job report: from submit7, for this job: now the facts:
Having said that, there are at least two issues to be further investigated and fixed here:
|
Even though we have a workaround for this (well, better call that a hack!), we should make sure it gets dealt with in the coming months. If not in Q4, then in the very beginning of Q1/2022. |
Hi @amaltaro It seems to me the most important question to answer here is why |
It's been a while since I looked into this, but I believe the last 3 bullets in the initial description go straight to the problems identified here. |
Ok! Upon a proper formatting of the error we've got 3 months ago from this particular WMCore/src/python/WMCore/WMSpec/Steps/Executors/CMSSW.py Lines 287 to 301 in 3ff3441
And I can also see how the error code from the
Which is (even if the method called at the time the variable is overwritten is expected to return the same [1]
|
And indeed, the first thing this method
WMCore/src/python/WMCore/FwkJobReport/Report.py Lines 364 to 365 in 5f7408c
Would/may lead to error miscount and skip the rest of the execution of the current method's code and directly return the so set to 0 FYI @amaltaro |
And if we look at the dump of the last 25 lines of the
Which to me sounds like the process has been killed by the worker. |
Hi Todor, your comment regarding the I definitely think that code needs further review and some minor changes. One example is, I'm not 100% convinced that a subprocess.call exit code 0 means that the cmsRun process had an exit code equal to 0 as well (from getStepExitCodeAndMessage()). However, that logic in |
And this seems no longer to be specific to StepChain workflows! We just had another flood of duplicate LFNs in the RelVal agent (vocms0259), and while running that script to remove the job reports which create duplicate LFNs, I found this TaskChain workflow: with the following list of duplicate LFNs:
as can be seen in the report dictionary under:
One can find the output of the removeDupJobAccountant.py here: https://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_9633/dupPickles.json and in short, it looks like the output section is duplicated(?), in addition to having the output file named after the output module. |
When the message above was posted, we apparently had an issue with that script which was not considering unique jobs, thus sometimes reporting false duplicate files. That issue was fixed by Kenyi around a month ago. Now that we had many agents with JobAccountant down, I see that this issue is back in WMAgent 2.0.2.patchX and it's apparently affecting StepChain. Example of a bad job output is:
I restarted the agents many times today, so we should reconsider this issue for this quarter and try to get this bug properly understood and fixed before a new WMAgent stable release is made, in June. |
Impact of the bug
WMAgent (seen in submit3 so far)
Describe the bug
It still needs further debugging, but it looks like that if a stepchain job fails in a specific way (or some specific exit code), its job report returns with the standard local output file names, such as:
which is wrong and should never be inserted into the database. Also because as soon as this job exits the worker node, that file is no longer available.
As soon as we get 2 jobs failing the same way, we hit a duplicate LFN in the database, which then crashes JobAccountant.
How to reproduce it
Still to be investigated
Expected behavior
Such file names should never get returned to the agent. There are many ways to see/resolve this issue though:
Additional context and error message
Log dump from the script to remove duplicate files in JobAccountant can be found here (notice it has multiple jobs):
http://amaltaro.web.cern.ch/amaltaro/Issue_9633/dupPickles.json
The text was updated successfully, but these errors were encountered: