Failure in StepChain jobs can result in misnamed output files #9633

amaltaro · 2020-04-14T13:47:44Z

Impact of the bug
WMAgent (seen in submit3 so far)

Describe the bug
It still needs further debugging, but it looks like that if a stepchain job fails in a specific way (or some specific exit code), its job report returns with the standard local output file names, such as:

      "/store/unmerged/Run3Summer19DRPremix/QCD_Pt-15_IsoTrkFilter_Pt-30_TuneCP5_14TeV-pythia8/GEN-SIM-RECO/2021ScenarioRECO_106X_mcRun3_2021_realistic_v3-v3/230001/RECOSIMoutput.root",

which is wrong and should never be inserted into the database. Also because as soon as this job exits the worker node, that file is no longer available.

As soon as we get 2 jobs failing the same way, we hit a duplicate LFN in the database, which then crashes JobAccountant.

How to reproduce it
Still to be investigated

Expected behavior
Such file names should never get returned to the agent. There are many ways to see/resolve this issue though:

transient file names in a StepChain job should not appear in the job report
it should return with a local worker node name path (instead of the usual LFN/PFN)
we might have to add some logic to JobAccountant to skip such files from database insertion (they anyways will represent failed jobs)

Additional context and error message
Log dump from the script to remove duplicate files in JobAccountant can be found here (notice it has multiple jobs):
http://amaltaro.web.cern.ch/amaltaro/Issue_9633/dupPickles.json

The text was updated successfully, but these errors were encountered:

amaltaro · 2021-04-12T06:13:03Z

The system seems to be full of such problems since - at least - the past week. Here is the json unpickled version of the job reported processed by JobAccountant in submit3:
http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_9633/badPickle.json

other files, for that same job, can also be found under the same directory.

Having a quick look at the wmagentJob.log, there is something very very suspicious:

2021-04-09 07:49:47,781:INFO:PerformanceMonitor:PSS: 3577811; RSS: 3625756; PCPU: 759; PMEM: 2.7
2021-04-09 07:52:15,003:INFO:Report:addOutputFile method called with outputModule: MINIAODSIMoutput, aFile: None
2021-04-09 07:52:15,004:INFO:Report:addOutputFile method fileRef: , whole tree: {}
2021-04-09 07:52:15,004:INFO:Report:addOutputFile method called with outputModule: RECOSIMoutput, aFile: None
2021-04-09 07:52:15,005:INFO:Report:addOutputFile method fileRef: , whole tree: {}
2021-04-09 07:52:15,016:CRITICAL:CMSSW:Error running cmsRun
{'arguments': ['/bin/bash', '/srv/job/WMTaskSpace/cmsRun3/cmsRun3-main.sh', '', u'slc7_amd64_gcc900', 'scramv1', 'CMSSW', 'CMSSW_11_2_2_patch1', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', '', '', '']}
CMSSW Return code: 0

2021-04-09 07:52:15,016:CRITICAL:CMSSW:Error message: None
2021-04-09 07:52:15,209:INFO:Report:addOutputFile method called with outputModule: MINIAODSIMoutput, aFile: None
2021-04-09 07:52:15,210:INFO:Report:addOutputFile method fileRef: , whole tree: {}
2021-04-09 07:52:15,210:INFO:Report:addOutputFile method called with outputModule: RECOSIMoutput, aFile: None
2021-04-09 07:52:15,211:INFO:Report:addOutputFile method fileRef: , whole tree: {}
2021-04-09 07:52:15,228:INFO:CMSSW:Steps.Executors.CMSSW.post called
2021-04-09 07:52:15,234:INFO:ExecuteMaster:StepName: cmsRun3, StepType: CMSSW, with result: 50116

where it actually reports the very same output file (e.g. MINIAODSIMoutput.root) for the same step (cmsRun3).

In short, there are two critical errors with this job:

it returns the output file named after the output module instead of the one created from the uuid algorithm
it returns the same file multiple times for the same cmsRun step

amaltaro · 2021-04-12T06:58:57Z

Just fixed submit3, submit4, vocms0250, vocms0253 and vocms0281. If it keeps happening, we need to bring it to the top of the list, make a proper fix (or at least a temporary JobAccountant fix) and get all the agents patched.

amaltaro · 2021-05-13T04:49:20Z

There is another wave of such failures in many agents. Will try to get a workaround in place to at least keep the agents stable.

amaltaro · 2021-05-13T06:29:51Z

Based on the debugging done this morning, I have some interesting observations to make. First, let me share a dump of a job report:
http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_9633/job_1019852.json

from submit7, for this job:
/storage/local/data1/cmsdataops/srv/wmagent/v1.4.7.patch2/install/wmagent/JobCreator/JobCache/cmsunified_task_TSG-Run3Winter21GS-00057__v1_T_210429_192814_4663/TSG-Run3Winter21GS-00057_0/JobCollection_65415_0/job_1019852/Report.1.pkl

now the facts:

every single step had a step status=0, meaning the job was successful
cmsRun3 actually had a hard crash, reported under the errors attribute (thus, this job should actually be marked as failed!)
only cmsRun3 step reports output files, under the output attribute.
cmsRun3 step actually reports the same output file twice (causing duplication!)
the name of the output files isn't unique - because the job failed before we could stage them out - another duplication issue in the agent

output files don't have any location associated to them, so the job is actually marked as failed, see:

WMCore/src/python/WMComponent/JobAccountant/AccountantWorker.py

Lines 499 to 507 in f862ce2

    
           # Make sure every file has a valid location 
        
           # see https://github.com/dmwm/WMCore/issues/9353 
        
           for fwjrFile in fileList: 
        
               # T0 has analysis file without any location, see: 
        
               # https://github.com/dmwm/WMCore/issues/9497 
        
               if not fwjrFile.get("locations") and fwjrFile.get("lfn", "").endswith(".root"): 
        
                   logging.warning("The following file doesn't have any location: %s", fwjrFile) 
        
                   jobSuccess = False 
        
                   break

Having said that, there are at least two issues to be further investigated and fixed here:

why the job is considered successful if there was a crash in one of the cmsRun steps (apparently because it only takes into account the step status value...)? I think we should add a check for the errors attribute as well.
why the FJR reports the same file twice?
if possible, we should not report these output files named after their output module name. The best would be to guarantee that from the worker node; second option is to have a check on the JobAccountant component.

amaltaro · 2021-10-11T12:58:29Z

Even though we have a workaround for this (well, better call that a hack!), we should make sure it gets dealt with in the coming months. If not in Q4, then in the very beginning of Q1/2022.

todor-ivanov · 2021-10-13T08:01:06Z

Hi @amaltaro It seems to me the most important question to answer here is why cmsRun3 job failure was not accounted for and the step was marked as successful. All the rest sound like consequences. Even the lfn duplication, which sems to be just a default value when the actual output was not provided (obviously because of the broken cmsRun3). So in this case I would say the duplicate error actually protects us (in a very uncomfortable way) rather than hurting us.

amaltaro · 2021-10-13T08:13:48Z

It's been a while since I looked into this, but I believe the last 3 bullets in the initial description go straight to the problems identified here.
I can't discard the possibility of having a duplicate LFN coming from a cmsRun step that was mistakenly marked as successful. This requires real debugging and tests to have a full understanding of the issue (I definitely do not have it now!)

todor-ivanov · 2021-10-13T11:32:32Z

Ok! Upon a proper formatting of the error we've got 3 months ago from this particular cmsRun3 job [1], we can see this is actually a WMException, and I am now confident that the error is thrown from this piece of code:

WMCore/src/python/WMCore/WMSpec/Steps/Executors/CMSSW.py

Lines 287 to 301 in 3ff3441

    
           if returnCode != 0: 
        
               argsDump = {'arguments': args} 
        
               msg = "Error running cmsRun\n%s\n" % argsDump 
        
               try: 
        
                   self.report.parse(jobReportXML, stepName=self.stepName) 
        
                   (returnCode, returnMessage) = self.report.getStepExitCodeAndMessage(stepName=self.stepName) 
        
                   msg += "CMSSW Return code: %s\n" % returnCode 
        
               except Exception as ex: 
        
                   # If report parsing fails, report linux exit code 
        
                   msg += "Linux Return code: %s\n" % returnCode 
        
               finally: 
        
                   logging.critical(msg) 
        
                   logging.critical("Error message: %s", returnMessage) 
        
                   self._setStatus(returnCode, returnMessage) 
        
                   raise WMExecutionFailure(returnCode, "CmsRunFailure", msg)

And I can also see how the error code from the cmsRun process had got lost - we are basically rewriting the returnCode variable with which we enter the try/except block at line

WMCore/src/python/WMCore/WMSpec/Steps/Executors/CMSSW.py

Line 292 in 3ff3441

    
           (returnCode, returnMessage) = self.report.getStepExitCodeAndMessage(stepName=self.stepName)

Which is (even if the method called at the time the variable is overwritten is expected to return the same returnCode and just populate it properly in the tuple) a definite recipe for a bug.

[1]

 Adding last 25 lines of CMSSW stdout:
 Time Summary: 
 - Min event:   4.25431
 - Max event:   70.563
 - Avg event:   17.8879
 - Total loop:  1478.21
 - Total init:  60.2161
 - Total job:   1550.93
 - EventSetup Lock:   0
 - EventSetup Get:   0
 Event Throughput: 0.435662 ev/s
 CPU Summary: 
 - Total loop:  11614.3
 - Total init:  53.2713
 - Total extra: 0
 - Total job:   11681.7
 Processing Summary: 
 - Number of Events:  644
 - Number of Global Begin Lumi Calls:  1
 - Number of Global Begin Run Calls: 1



Fatal system signal has occurred during exit
Complete
process id is 2406 status is 134
<@========== WMException Start ==========@>
Exception Class: CmsRunFailure
Message: Error running cmsRun
{'arguments': ['/bin/bash', '/srv/job/WMTaskSpace/cmsRun3/cmsRun3-main.sh', '', u'slc7_amd64_gcc900', 'scramv1', 'CMSSW', 'CMSSW_11_2_2_patch1', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', '', '', '']}
CMSSW Return code: 0

	ModuleName : WMCore.WMSpec.Steps.WMExecutionFailure
	MethodName : __init__
	ClassInstance : None
	FileName : /srv/job/WMCore.zip/WMCore/WMSpec/Steps/WMExecutionFailure.py
	ClassName : None
	LineNumber : 18
	ErrorNr : 0

Traceback: 

<@---------- WMException End ----------@>

todor-ivanov · 2021-10-13T11:44:39Z

And indeed, the first thing this method getStepExitCodeAndMessage does is to to revert/set a default returnCode to 0 and only then enters the rest of its logic:

WMCore/src/python/WMCore/FwkJobReport/Report.py

Line 362 in 5f7408c

returnCode = 0

. So any further fails or misbehavior during the calls to the next two methods:

WMCore/src/python/WMCore/FwkJobReport/Report.py

Lines 364 to 365 in 5f7408c

    
           reportStep = self.retrieveStep(stepName) 
        
           errorCount = getattr(reportStep.errors, "errorCount", 0)

Would/may lead to error miscount and skip the rest of the execution of the current method's code and directly return the so set to 0 returnCode , hence overwrite the value with which we have entered the outer/calling try/except block, as I mentioned in my previous comment.

FYI @amaltaro

todor-ivanov · 2021-10-13T11:52:01Z

And if we look at the dump of the last 25 lines of the CMSSW stdout (from the Error message from above comments) we notice:

Fatal system signal has occurred during exit
Complete
process id is 2406 status is 134

Which to me sounds like the process has been killed by the worker.

amaltaro · 2021-10-15T09:52:18Z

Hi Todor, your comment regarding the returnCode in Steps/Executors/CMSSW.py module is very good. I have already seem a problem related to that code in testbed a few weeks ago, but didn't manage to proceed and collect further information.

I definitely think that code needs further review and some minor changes. One example is, I'm not 100% convinced that a subprocess.call exit code 0 means that the cmsRun process had an exit code equal to 0 as well (from getStepExitCodeAndMessage()).

However, that logic in getStepExitCodeAndMessage from FwkJobReport/Report.py seems to be okay to me. If there is a failure within that method, the exception would be raised and propagated upstream, failing the whole CMSSW parsing on the worker node.

amaltaro · 2022-03-14T18:03:34Z

And this seems no longer to be specific to StepChain workflows!

We just had another flood of duplicate LFNs in the RelVal agent (vocms0259), and while running that script to remove the job reports which create duplicate LFNs, I found this TaskChain workflow:
pdmvserv_RVCMSSW_12_3_0_pre6PromptCosmics__RelVal_2021GR5_220311_120949_2532

with the following list of duplicate LFNs:

    "dup_lfns": [
      "/store/unmerged/CMSSW_12_3_0_pre6/ExpressCosmics/DQMIO/123X_dataRun3_Prompt_v5_RelVal_2021GR5-v1/10019/DQMoutput.root",
      "/store/unmerged/CMSSW_12_3_0_pre6/ExpressCosmics/DQMIO/123X_dataRun3_Prompt_v5_RelVal_2021GR5-v1/10019/DQMoutput.root",
      "/store/unmerged/CMSSW_12_3_0_pre6/ExpressCosmics/RECO/123X_dataRun3_Prompt_v5_RelVal_2021GR5-v1/10019/RECOoutput.root",
      "/store/unmerged/CMSSW_12_3_0_pre6/ExpressCosmics/RECO/123X_dataRun3_Prompt_v5_RelVal_2021GR5-v1/10019/RECOoutput.root"
    ]

as can be seen in the report dictionary under:

/data/srv/wmagent/v1.5.7.patch1/install/wmagentpy3/JobCreator/JobCache/pdmvserv_RVCMSSW_12_3_0_pre6PromptCosmics__RelVal_2021GR5_220311_120949_2532/RECOCOSDPROMPTRUN3/JobCollection_21251_0/job_199838/Report.0.pkl

One can find the output of the removeDupJobAccountant.py here: https://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_9633/dupPickles.json

and in short, it looks like the output section is duplicated(?), in addition to having the output file named after the output module.

amaltaro · 2022-05-12T04:12:20Z

When the message above was posted, we apparently had an issue with that script which was not considering unique jobs, thus sometimes reporting false duplicate files. That issue was fixed by Kenyi around a month ago.

Now that we had many agents with JobAccountant down, I see that this issue is back in WMAgent 2.0.2.patchX and it's apparently affecting StepChain. Example of a bad job output is:

Duplicate files are:
['/store/unmerged/Run3Winter21DRMiniAOD/DYJetsToMuMu_M-50_TuneCP5_14TeV-madgraphMLM-pythia8/ALCARECO/TkAlZMuMu-designGT_112X_mcRun3_2021_realistic_v16_ext1-v2/2820025/ALCARECOStreamTkAlZMuMu.root', '/store/unmerged/Run3Winter21DRMiniAOD/DYJetsToMuMu_M-50_TuneCP5_14TeV-madgraphMLM-pythia8/AODSIM/designGT_112X_mcRun3_2021_realistic_v16_ext1-v2/2820025/AODSIMoutput.root', '/store/unmerged/Run3Winter21DRMiniAOD/DYJetsToMuMu_M-50_TuneCP5_14TeV-madgraphMLM-pythia8/MINIAODSIM/designGT_112X_mcRun3_2021_realistic_v16_ext1-v2/2820025/MINIAODSIMoutput.root']

I restarted the agents many times today, so we should reconsider this issue for this quarter and try to get this bug properly understood and fixed before a new WMAgent stable release is made, in June.

amaltaro changed the title ~~Crash in StepChain jobs can result in misnamed output files~~ Failure in StepChain jobs can result in misnamed output files Apr 14, 2020

amaltaro added BUG Medium Priority WMAgent WMAgent Crash labels Apr 14, 2020

amaltaro added this to the July_2020 milestone Jun 15, 2020

amaltaro added High Priority and removed Medium Priority labels Aug 12, 2020

amaltaro removed this from the July_2020 milestone Apr 12, 2021

amaltaro self-assigned this May 13, 2021

amaltaro mentioned this issue May 13, 2021

JobAccountant workaround for StepChain jobs with duplicate files - wmagent branch #10486

Merged

amaltaro mentioned this issue Jul 1, 2021

Port JobAccountant work around to the next WMAgent stable release #10653

Closed

amaltaro mentioned this issue Aug 25, 2021

JobAccountant workaround for StepChain jobs with duplicate files - wmagent branch #10774

Merged

todor-ivanov self-assigned this Oct 13, 2021

amaltaro mentioned this issue Oct 15, 2021

T0 JobAccountant crashes due to multiple jobs generating same file #10870

Closed

This was referenced Feb 1, 2022

JobAccountant workaround for StepChain jobs with duplicate files - wmagent 1.5.7 branch #10971

Closed

JobAccountant workaround for StepChain jobs with duplicate files - wmagent 1.5.7 branch #10972

Merged

amaltaro mentioned this issue Feb 16, 2022

Duplicate output file crash in JobAccountant #9011

Closed

amaltaro removed their assignment May 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure in StepChain jobs can result in misnamed output files #9633

Failure in StepChain jobs can result in misnamed output files #9633

amaltaro commented Apr 14, 2020 •

edited

Loading

amaltaro commented Apr 12, 2021

amaltaro commented Apr 12, 2021

amaltaro commented May 13, 2021

amaltaro commented May 13, 2021 •

edited

Loading

amaltaro commented Oct 11, 2021

todor-ivanov commented Oct 13, 2021

amaltaro commented Oct 13, 2021

todor-ivanov commented Oct 13, 2021 •

edited

Loading

todor-ivanov commented Oct 13, 2021 •

edited

Loading

todor-ivanov commented Oct 13, 2021 •

edited

Loading

amaltaro commented Oct 15, 2021

amaltaro commented Mar 14, 2022

amaltaro commented May 12, 2022

Failure in StepChain jobs can result in misnamed output files #9633

Failure in StepChain jobs can result in misnamed output files #9633

Comments

amaltaro commented Apr 14, 2020 • edited Loading

amaltaro commented Apr 12, 2021

amaltaro commented Apr 12, 2021

amaltaro commented May 13, 2021

amaltaro commented May 13, 2021 • edited Loading

amaltaro commented Oct 11, 2021

todor-ivanov commented Oct 13, 2021

amaltaro commented Oct 13, 2021

todor-ivanov commented Oct 13, 2021 • edited Loading

todor-ivanov commented Oct 13, 2021 • edited Loading

todor-ivanov commented Oct 13, 2021 • edited Loading

amaltaro commented Oct 15, 2021

amaltaro commented Mar 14, 2022

amaltaro commented May 12, 2022

amaltaro commented Apr 14, 2020 •

edited

Loading

amaltaro commented May 13, 2021 •

edited

Loading

todor-ivanov commented Oct 13, 2021 •

edited

Loading

todor-ivanov commented Oct 13, 2021 •

edited

Loading

todor-ivanov commented Oct 13, 2021 •

edited

Loading