Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in StepChain jobs can result in misnamed output files #9633

Open
amaltaro opened this issue Apr 14, 2020 · 13 comments · Fixed by #10774
Open

Failure in StepChain jobs can result in misnamed output files #9633

amaltaro opened this issue Apr 14, 2020 · 13 comments · Fixed by #10774

Comments

@amaltaro
Copy link
Contributor

amaltaro commented Apr 14, 2020

Impact of the bug
WMAgent (seen in submit3 so far)

Describe the bug
It still needs further debugging, but it looks like that if a stepchain job fails in a specific way (or some specific exit code), its job report returns with the standard local output file names, such as:

      "/store/unmerged/Run3Summer19DRPremix/QCD_Pt-15_IsoTrkFilter_Pt-30_TuneCP5_14TeV-pythia8/GEN-SIM-RECO/2021ScenarioRECO_106X_mcRun3_2021_realistic_v3-v3/230001/RECOSIMoutput.root",

which is wrong and should never be inserted into the database. Also because as soon as this job exits the worker node, that file is no longer available.

As soon as we get 2 jobs failing the same way, we hit a duplicate LFN in the database, which then crashes JobAccountant.

How to reproduce it
Still to be investigated

Expected behavior
Such file names should never get returned to the agent. There are many ways to see/resolve this issue though:

  1. transient file names in a StepChain job should not appear in the job report
  2. it should return with a local worker node name path (instead of the usual LFN/PFN)
  3. we might have to add some logic to JobAccountant to skip such files from database insertion (they anyways will represent failed jobs)

Additional context and error message
Log dump from the script to remove duplicate files in JobAccountant can be found here (notice it has multiple jobs):
http://amaltaro.web.cern.ch/amaltaro/Issue_9633/dupPickles.json

@amaltaro amaltaro changed the title Crash in StepChain jobs can result in misnamed output files Failure in StepChain jobs can result in misnamed output files Apr 14, 2020
@amaltaro amaltaro added this to the July_2020 milestone Jun 15, 2020
@amaltaro
Copy link
Contributor Author

The system seems to be full of such problems since - at least - the past week. Here is the json unpickled version of the job reported processed by JobAccountant in submit3:
http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_9633/badPickle.json

other files, for that same job, can also be found under the same directory.

Having a quick look at the wmagentJob.log, there is something very very suspicious:

2021-04-09 07:49:47,781:INFO:PerformanceMonitor:PSS: 3577811; RSS: 3625756; PCPU: 759; PMEM: 2.7
2021-04-09 07:52:15,003:INFO:Report:addOutputFile method called with outputModule: MINIAODSIMoutput, aFile: None
2021-04-09 07:52:15,004:INFO:Report:addOutputFile method fileRef: , whole tree: {}
2021-04-09 07:52:15,004:INFO:Report:addOutputFile method called with outputModule: RECOSIMoutput, aFile: None
2021-04-09 07:52:15,005:INFO:Report:addOutputFile method fileRef: , whole tree: {}
2021-04-09 07:52:15,016:CRITICAL:CMSSW:Error running cmsRun
{'arguments': ['/bin/bash', '/srv/job/WMTaskSpace/cmsRun3/cmsRun3-main.sh', '', u'slc7_amd64_gcc900', 'scramv1', 'CMSSW', 'CMSSW_11_2_2_patch1', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', '', '', '']}
CMSSW Return code: 0

2021-04-09 07:52:15,016:CRITICAL:CMSSW:Error message: None
2021-04-09 07:52:15,209:INFO:Report:addOutputFile method called with outputModule: MINIAODSIMoutput, aFile: None
2021-04-09 07:52:15,210:INFO:Report:addOutputFile method fileRef: , whole tree: {}
2021-04-09 07:52:15,210:INFO:Report:addOutputFile method called with outputModule: RECOSIMoutput, aFile: None
2021-04-09 07:52:15,211:INFO:Report:addOutputFile method fileRef: , whole tree: {}
2021-04-09 07:52:15,228:INFO:CMSSW:Steps.Executors.CMSSW.post called
2021-04-09 07:52:15,234:INFO:ExecuteMaster:StepName: cmsRun3, StepType: CMSSW, with result: 50116

where it actually reports the very same output file (e.g. MINIAODSIMoutput.root) for the same step (cmsRun3).

In short, there are two critical errors with this job:

  1. it returns the output file named after the output module instead of the one created from the uuid algorithm
  2. it returns the same file multiple times for the same cmsRun step

@amaltaro amaltaro removed this from the July_2020 milestone Apr 12, 2021
@amaltaro
Copy link
Contributor Author

Just fixed submit3, submit4, vocms0250, vocms0253 and vocms0281. If it keeps happening, we need to bring it to the top of the list, make a proper fix (or at least a temporary JobAccountant fix) and get all the agents patched.

@amaltaro amaltaro self-assigned this May 13, 2021
@amaltaro
Copy link
Contributor Author

There is another wave of such failures in many agents. Will try to get a workaround in place to at least keep the agents stable.

@amaltaro
Copy link
Contributor Author

amaltaro commented May 13, 2021

Based on the debugging done this morning, I have some interesting observations to make. First, let me share a dump of a job report:
http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_9633/job_1019852.json

from submit7, for this job:
/storage/local/data1/cmsdataops/srv/wmagent/v1.4.7.patch2/install/wmagent/JobCreator/JobCache/cmsunified_task_TSG-Run3Winter21GS-00057__v1_T_210429_192814_4663/TSG-Run3Winter21GS-00057_0/JobCollection_65415_0/job_1019852/Report.1.pkl

now the facts:

  • every single step had a step status=0, meaning the job was successful
  • cmsRun3 actually had a hard crash, reported under the errors attribute (thus, this job should actually be marked as failed!)
  • only cmsRun3 step reports output files, under the output attribute.
  • cmsRun3 step actually reports the same output file twice (causing duplication!)
  • the name of the output files isn't unique - because the job failed before we could stage them out - another duplication issue in the agent
  • output files don't have any location associated to them, so the job is actually marked as failed, see:
    # Make sure every file has a valid location
    # see https://github.com/dmwm/WMCore/issues/9353
    for fwjrFile in fileList:
    # T0 has analysis file without any location, see:
    # https://github.com/dmwm/WMCore/issues/9497
    if not fwjrFile.get("locations") and fwjrFile.get("lfn", "").endswith(".root"):
    logging.warning("The following file doesn't have any location: %s", fwjrFile)
    jobSuccess = False
    break

Having said that, there are at least two issues to be further investigated and fixed here:

  • why the job is considered successful if there was a crash in one of the cmsRun steps (apparently because it only takes into account the step status value...)? I think we should add a check for the errors attribute as well.
  • why the FJR reports the same file twice?
  • if possible, we should not report these output files named after their output module name. The best would be to guarantee that from the worker node; second option is to have a check on the JobAccountant component.

@amaltaro
Copy link
Contributor Author

Even though we have a workaround for this (well, better call that a hack!), we should make sure it gets dealt with in the coming months. If not in Q4, then in the very beginning of Q1/2022.

@todor-ivanov
Copy link
Contributor

Hi @amaltaro It seems to me the most important question to answer here is why cmsRun3 job failure was not accounted for and the step was marked as successful. All the rest sound like consequences. Even the lfn duplication, which sems to be just a default value when the actual output was not provided (obviously because of the broken cmsRun3). So in this case I would say the duplicate error actually protects us (in a very uncomfortable way) rather than hurting us.

@amaltaro
Copy link
Contributor Author

It's been a while since I looked into this, but I believe the last 3 bullets in the initial description go straight to the problems identified here.
I can't discard the possibility of having a duplicate LFN coming from a cmsRun step that was mistakenly marked as successful. This requires real debugging and tests to have a full understanding of the issue (I definitely do not have it now!)

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Oct 13, 2021

Ok! Upon a proper formatting of the error we've got 3 months ago from this particular cmsRun3 job [1], we can see this is actually a WMException, and I am now confident that the error is thrown from this piece of code:

if returnCode != 0:
argsDump = {'arguments': args}
msg = "Error running cmsRun\n%s\n" % argsDump
try:
self.report.parse(jobReportXML, stepName=self.stepName)
(returnCode, returnMessage) = self.report.getStepExitCodeAndMessage(stepName=self.stepName)
msg += "CMSSW Return code: %s\n" % returnCode
except Exception as ex:
# If report parsing fails, report linux exit code
msg += "Linux Return code: %s\n" % returnCode
finally:
logging.critical(msg)
logging.critical("Error message: %s", returnMessage)
self._setStatus(returnCode, returnMessage)
raise WMExecutionFailure(returnCode, "CmsRunFailure", msg)

And I can also see how the error code from the cmsRun process had got lost - we are basically rewriting the returnCode variable with which we enter the try/except block at line

(returnCode, returnMessage) = self.report.getStepExitCodeAndMessage(stepName=self.stepName)

Which is (even if the method called at the time the variable is overwritten is expected to return the same returnCode and just populate it properly in the tuple) a definite recipe for a bug.

[1]

 Adding last 25 lines of CMSSW stdout:
 Time Summary: 
 - Min event:   4.25431
 - Max event:   70.563
 - Avg event:   17.8879
 - Total loop:  1478.21
 - Total init:  60.2161
 - Total job:   1550.93
 - EventSetup Lock:   0
 - EventSetup Get:   0
 Event Throughput: 0.435662 ev/s
 CPU Summary: 
 - Total loop:  11614.3
 - Total init:  53.2713
 - Total extra: 0
 - Total job:   11681.7
 Processing Summary: 
 - Number of Events:  644
 - Number of Global Begin Lumi Calls:  1
 - Number of Global Begin Run Calls: 1



Fatal system signal has occurred during exit
Complete
process id is 2406 status is 134
<@========== WMException Start ==========@>
Exception Class: CmsRunFailure
Message: Error running cmsRun
{'arguments': ['/bin/bash', '/srv/job/WMTaskSpace/cmsRun3/cmsRun3-main.sh', '', u'slc7_amd64_gcc900', 'scramv1', 'CMSSW', 'CMSSW_11_2_2_patch1', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', '', '', '']}
CMSSW Return code: 0

	ModuleName : WMCore.WMSpec.Steps.WMExecutionFailure
	MethodName : __init__
	ClassInstance : None
	FileName : /srv/job/WMCore.zip/WMCore/WMSpec/Steps/WMExecutionFailure.py
	ClassName : None
	LineNumber : 18
	ErrorNr : 0

Traceback: 

<@---------- WMException End ----------@>

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Oct 13, 2021

And indeed, the first thing this method getStepExitCodeAndMessage does is to to revert/set a default returnCode to 0 and only then enters the rest of its logic:

. So any further fails or misbehavior during the calls to the next two methods:
reportStep = self.retrieveStep(stepName)
errorCount = getattr(reportStep.errors, "errorCount", 0)

Would/may lead to error miscount and skip the rest of the execution of the current method's code and directly return the so set to 0 returnCode , hence overwrite the value with which we have entered the outer/calling try/except block, as I mentioned in my previous comment.

FYI @amaltaro

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Oct 13, 2021

And if we look at the dump of the last 25 lines of the CMSSW stdout (from the Error message from above comments) we notice:

Fatal system signal has occurred during exit
Complete
process id is 2406 status is 134

Which to me sounds like the process has been killed by the worker.

@amaltaro
Copy link
Contributor Author

Hi Todor, your comment regarding the returnCode in Steps/Executors/CMSSW.py module is very good. I have already seem a problem related to that code in testbed a few weeks ago, but didn't manage to proceed and collect further information.

I definitely think that code needs further review and some minor changes. One example is, I'm not 100% convinced that a subprocess.call exit code 0 means that the cmsRun process had an exit code equal to 0 as well (from getStepExitCodeAndMessage()).

However, that logic in getStepExitCodeAndMessage from FwkJobReport/Report.py seems to be okay to me. If there is a failure within that method, the exception would be raised and propagated upstream, failing the whole CMSSW parsing on the worker node.

@amaltaro
Copy link
Contributor Author

And this seems no longer to be specific to StepChain workflows!

We just had another flood of duplicate LFNs in the RelVal agent (vocms0259), and while running that script to remove the job reports which create duplicate LFNs, I found this TaskChain workflow:
pdmvserv_RVCMSSW_12_3_0_pre6PromptCosmics__RelVal_2021GR5_220311_120949_2532

with the following list of duplicate LFNs:

    "dup_lfns": [
      "/store/unmerged/CMSSW_12_3_0_pre6/ExpressCosmics/DQMIO/123X_dataRun3_Prompt_v5_RelVal_2021GR5-v1/10019/DQMoutput.root",
      "/store/unmerged/CMSSW_12_3_0_pre6/ExpressCosmics/DQMIO/123X_dataRun3_Prompt_v5_RelVal_2021GR5-v1/10019/DQMoutput.root",
      "/store/unmerged/CMSSW_12_3_0_pre6/ExpressCosmics/RECO/123X_dataRun3_Prompt_v5_RelVal_2021GR5-v1/10019/RECOoutput.root",
      "/store/unmerged/CMSSW_12_3_0_pre6/ExpressCosmics/RECO/123X_dataRun3_Prompt_v5_RelVal_2021GR5-v1/10019/RECOoutput.root"
    ]

as can be seen in the report dictionary under:

/data/srv/wmagent/v1.5.7.patch1/install/wmagentpy3/JobCreator/JobCache/pdmvserv_RVCMSSW_12_3_0_pre6PromptCosmics__RelVal_2021GR5_220311_120949_2532/RECOCOSDPROMPTRUN3/JobCollection_21251_0/job_199838/Report.0.pkl

One can find the output of the removeDupJobAccountant.py here: https://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_9633/dupPickles.json

and in short, it looks like the output section is duplicated(?), in addition to having the output file named after the output module.

@amaltaro
Copy link
Contributor Author

When the message above was posted, we apparently had an issue with that script which was not considering unique jobs, thus sometimes reporting false duplicate files. That issue was fixed by Kenyi around a month ago.

Now that we had many agents with JobAccountant down, I see that this issue is back in WMAgent 2.0.2.patchX and it's apparently affecting StepChain. Example of a bad job output is:

Duplicate files are:
['/store/unmerged/Run3Winter21DRMiniAOD/DYJetsToMuMu_M-50_TuneCP5_14TeV-madgraphMLM-pythia8/ALCARECO/TkAlZMuMu-designGT_112X_mcRun3_2021_realistic_v16_ext1-v2/2820025/ALCARECOStreamTkAlZMuMu.root', '/store/unmerged/Run3Winter21DRMiniAOD/DYJetsToMuMu_M-50_TuneCP5_14TeV-madgraphMLM-pythia8/AODSIM/designGT_112X_mcRun3_2021_realistic_v16_ext1-v2/2820025/AODSIMoutput.root', '/store/unmerged/Run3Winter21DRMiniAOD/DYJetsToMuMu_M-50_TuneCP5_14TeV-madgraphMLM-pythia8/MINIAODSIM/designGT_112X_mcRun3_2021_realistic_v16_ext1-v2/2820025/MINIAODSIMoutput.root']

I restarted the agents many times today, so we should reconsider this issue for this quarter and try to get this bug properly understood and fixed before a new WMAgent stable release is made, in June.

@amaltaro amaltaro removed their assignment May 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants