Workflows with no errors #11771

lucalavezzo · 2023-10-24T14:56:08Z

Impact of the bug
Makes error handling for P&R very hard, as we rely on WM for error reports.

Describe the bug
Some worklfows (e.g. https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_BPH-RunIISummer20UL18GEN-00237) do not seem to have any errors. We do see some errors in the job log and WorkloadSummary (https://cmsweb.cern.ch/couchdb/workloadsummary/_design/WorkloadSummary/_show/histogramByWorkflow/pdmvserv_task_BPH-RunIISummer20UL18GEN-00237__v1_T_230926_104347_8763)

jenimal · 2023-10-25T19:37:13Z

If you look at WmStats, it shows 73 failures, but when you click on the L, no failures show up to dig into so we are running blind.

khurtado · 2023-10-27T20:39:15Z

There are indeed, jobs failing. In the workflow example above, the condor monitor also shows these errors:

https://es-cms.cern.ch/dashboards/app/discover#/?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-2M,to:now))&_a=(columns:!(_source),filters:!(),index:'0f3117a0-d2e6-11ed-bdae-e3cf8c07f2fc',interval:auto,query:(language:kuery,query:'WMAgent_RequestName:"pdmvserv_task_BPH-RunIISummer20UL18GEN-00237__v1_T_230926_104347_8763"%20and%20ExitCode:8001'),sort:!())

So, there is an exception ("Other CMS exception"): 8001

 An exception of category 'ExternalFailed' occurred while
   [0] Processing global begin LuminosityBlock run: 1 luminosityBlock: 1080
   [1] Calling method for module ExternalGeneratorFilter/'generator'
Exception Message:
failed waiting for external process

Somehow, the job report is still produced, but it has no steps because of the exception above.

The JobAccountant tries to load the job report and notices the bad pkl reports:

2023-10-17 19:45:54,864:140693643872000:INFO:AccountantWorker:Handling /storage/local/data1/cmsdataops/srv/wmagent/v2.2.5/install/wmagentpy3/JobCreator/JobCache/pdmvserv_task_BPH-RunIISummer20UL18GEN-00237__v1_T_230926_104347_8763/BPH-RunIISummer20UL18GEN-00237_0/JobCollection_27562_0/job_307022/Report.0.pkl
2023-10-17 19:45:54,866:140693643872000:WARNING:AccountantWorker:Job 307022 , bad jobReport, failing job
2023-10-17 19:45:54,869:140693643872000:INFO:AccountantWorker:Handling /storage/local/data1/cmsdataops/srv/wmagent/v2.2.5/install/wmagentpy3/JobCreator/JobCache/pdmvserv_task_BPH-RunIISummer20UL18GEN-00237__v1_T_230926_104347_8763/BPH-RunIISummer20UL18GEN-00237_0/JobCollection_27562_0/job_307023/Report.0.pkl
2023-10-17 19:45:54,871:140693643872000:WARNING:AccountantWorker:Job 307023 , bad jobReport, failing job
2023-10-18 06:15:26,684:140693643872000:ERROR:AccountantWorker:FwkJobReport with no steps: /storage/local/data1/cmsdataops/srv/wmagent/v2.2.5/install/wmagentpy3/JobCreator/JobCache/pdmvserv_task_BPH-RunIISummer20UL18GEN-00237__v1_T_230926_104347_8763/BPH-RunIISummer20UL18GEN-00237_0/JobCollection_30739_0/job_345960/Report.0.pkl

Could the JobAccounting errors be somehow couples with the fact WMStats does not show the job errors in the "L" panel?

The wmAgent log itself shows thew following for job 307023:

$ /data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir/p/pdmvserv_task_BPH-RunIISummer20UL18GEN-00237__v1_T_230926_104347_8763/JobCluster_307/Job_307023.tar.bz2

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH
/cvmfs/cms.cern.ch/COMP/slc7_amd64_gcc630/external/python3/3.8.2-comp/bin/python3 -m WMCore.WMRuntime.ScriptInvoke WMTaskSpace.cmsRun1 SetupCMSSWPset

2023-10-18 00:34:46,249:INFO:CMSSW:RUNNING SCRAM SCRIPTS
2023-10-18 00:34:46,249:INFO:CMSSW:Executing CMSSW. args: ['/bin/bash', '/srv/job/WMTaskSpace/cmsRun1/cmsRun1-main.sh', '', 'slc7_amd64_gcc700', 'scramv1', 'CMSSW', 'CMSSW_10_6_30_patch1', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', '']
2023-10-18 00:39:18,864:INFO:PerformanceMonitor:PSS: 161746; RSS: 288936; PCPU: 92.1; PMEM: 0.1
2023-10-18 00:40:21,761:INFO:Report:addOutputFile method called with outputModule: RAWSIMoutput, aFile: None
2023-10-18 00:40:21,761:INFO:Report:addOutputFile method fileRef: , whole tree: {}
2023-10-18 00:40:21,768:ERROR:XMLParser:Tried to divide by zero doing storage statistics report parsing.
2023-10-18 00:40:21,768:ERROR:XMLParser:Either you aren't reading and writing data, or you aren't reporting it.
2023-10-18 00:40:21,768:ERROR:XMLParser:Not adding any storage performance info to report.
2023-10-18 00:40:21,769:CRITICAL:CMSSW:Error running cmsRun
{'arguments': ['/bin/bash', '/srv/job/WMTaskSpace/cmsRun1/cmsRun1-main.sh', '', 'slc7_amd64_gcc700', 'scramv1', 'CMSSW', 'CMSSW_10_6_30_patch1', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', '']}
CMSSW Return code: 8001

2023-10-18 00:40:21,769:CRITICAL:CMSSW:Error message: An exception of category 'ExternalFailed' occurred while
   [0] Processing global begin LuminosityBlock run: 1 luminosityBlock: 2
   [1] Calling method for module ExternalGeneratorFilter/'generator'
Exception Message:
failed waiting for external process
2023-10-18 00:40:21,780:INFO:CMSSW:Step cmsRun1: Chirp_WMCore_cmsRun_ExitCode 8001
2023-10-18 00:40:21,804:INFO:CMSSW:Step cmsRun1: Chirp_WMCore_cmsRun1_ExitCode 8001
2023-10-18 00:40:21,852:CRITICAL:CMSSW:CMSDefaultHandler Diagnostic Handler invoked
2023-10-18 00:40:21,855:WARNING:Report:Exit code: 8001 has been already added to the job report
2023-10-18 00:40:21,855:INFO:Report:addOutputFile method called with outputModule: RAWSIMoutput, aFile: None
2023-10-18 00:40:21,855:INFO:Report:addOutputFile method fileRef: , whole tree: {}
2023-10-18 00:40:21,859:ERROR:XMLParser:Tried to divide by zero doing storage statistics report parsing.
2023-10-18 00:40:21,859:ERROR:XMLParser:Either you aren't reading and writing data, or you aren't reporting it.
2023-10-18 00:40:21,860:ERROR:XMLParser:Not adding any storage performance info to report.
2023-10-18 00:40:21,860:WARNING:Report:Exit code: 8001 has been already added to the job report

I couldn't find anythign relevant in the reqmon logs

Tagging @amaltaro

lucalavezzo · 2023-10-30T12:39:39Z

Hi @khurtado, thank you very much for the detailed investigation. It would of course be great to resolve this, but since it does seem like that this issue happens specifically for the error

 An exception of category 'ExternalFailed' occurred while
   [0] Processing global begin LuminosityBlock run: 1 luminosityBlock: 1080
   [1] Calling method for module ExternalGeneratorFilter/'generator'
Exception Message:
failed waiting for external process

We can try to submit some ACDCs for this and see if they work, and then apply that solution to the rest of them.

haozturk · 2023-11-03T15:10:22Z

@amaltaro Here is the workflow we looked at in the meeting: pdmvserv_task_PPS-Run3Summer22EEMiniAODv4-00177__v1_T_231101_141258_2773

Others: http://cmsworkflow-frontend.cern.ch/search?search=exitCode%20=%20-2 (You need to be on CERN network)

amaltaro · 2023-11-03T22:16:31Z

As discussed in the meeting today, I decided to look into the same workflow that Hasan pasted above, which was executed only by vocms0255.
I confirm that I do see around 10 jobs processed by JobAccountant (different than the single job reported in WMStats), some have failed, e.g.:

2023-11-03 06:24:05,435:140461312231168:INFO:AccountantWorker:Handling /data/srv/wmagent/v2.2.5/install/wmagentpy3/JobCreator/JobCache/pdmvserv_task_PPS-Run3Summer22EEMiniAODv4-00177__v1_T_231101_141258_2773/PPS-Run3Summer22EEMiniAODv4-00177_0/JobCollection_257301_0/job_2393084/Report.0.pkl
2023-11-03 06:24:05,437:140461312231168:WARNING:AccountantWorker:Job 2393084 , bad jobReport, failing job

and one/some jobs succeeded, but had this strange (unknown to me) message:

2023-11-03 07:48:31,373:140461312231168:INFO:AccountantWorker:Handling /data/srv/wmagent/v2.2.5/install/wmagentpy3/JobCreator/JobCache/pdmvserv_task_PPS-Run3Summer22EEMiniAODv4-00177__v1_T_231101_141258_2773/PPS-Run3Summer22EEMiniAODv4-00177_0/LogCollectForPPS-Run3Summer22EEMiniAODv4-00177_0/JobCollection_259900_0/job_2406134/Report.0.pkl
2023-11-03 07:48:31,375:140461312231168:INFO:AccountantWorker:Job 2406134 , handle successful job
2023-11-03 07:48:31,391:140461312231168:ERROR:AccountantWorker:job report is missing for updating log archive mapping
 Input file list
 [{'lfn': '/store/unmerged/logs/prod/2023/11/3/pdmvserv_task_PPS-Run3Summer22EEMiniAODv4-00177__v1_T_231101_141258_2773/PPS-Run3Summer22EEMiniAODv4-00177_0/0000/0/77d15e8e-5963-4d50-8859-90bd21b83460-9-0-logArchive.tar.gz', 'size': 0, 'events': 0, 'checksums': {}, 'runs': set(), 'merged': False, 'last_event': 0, 'first_event': 0, 'locations': set(), 'parents': set(), 'pfn': None, 'branches': [], 'catalog': None, 'guid': None, 'input_source_class': None, 'module_label': None, 'input_type': None},

Now, what worries me even more is that JobAccountant is permanently trying to upload documents to couch that exceed the configured size limit, hence raising an exception like:

2023-11-03 23:02:21,868:140461312231168:WARNING:CMSCouch:queue larger than 250 records, committing
2023-11-03 23:02:23,699:140461312231168:ERROR:ChangeState:Error updating job in couch: Error type: CouchRequestTooLargeError, Status code: 413, Reason: Request Entity Too Large, Data: "Document has 61425196 bytes and it's too large to be 
accepted by CouchDB. Check the CouchDB configuration to see the current value under 'couchdb.max_document_size' (default is 8M bytes)."
Traceback (most recent call last):
  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/Database/CMSCouch.py", line 133, in makeRequest
    result, status, reason, cached = JSONRequests.makeRequest(
  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 185, in makeRequest
    result, response = self.makeRequest_pycurl(uri, data, verb, headers)
  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 202, in makeRequest_pycurl
    response, result = self.reqmgr.request(uri, data, headers, verb=verb,
  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/Utils/PortForward.py", line 68, in portMangle
    return callFunc(callObj, url, *args, **kwargs)
  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/Services/pycurl_manager.py", line 353, in request
    raise exc
http.client.HTTPException: url=http://localhost:5984/wmagent_jobdump%2Ffwjrs/_bulk_docs/, code=413, reason=Request Entity Too Large, headers={'Connection': 'close', 'Date': 'Fri, 03 Nov 2023 22:02:23 GMT', 'Server': 'CouchDB/3.2.2 (Erlang
 OTP/22)', 'Cache-Control': 'must-revalidate', 'Content-Length': '51', 'Content-Type': 'application/json', 'X-Couch-Request-ID': '3e1a52900a', 'X-CouchDB-Body-Time': '70'}, result=b'{"error":"document_too_large","reason":"451940-0"}\n'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/JobStateMachine/ChangeState.py", line 207, in propagate
    self.recordInCouch(jobs, newstate, oldstate, updatesummary)
  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/JobStateMachine/ChangeState.py", line 404, in recordInCouch
    self.fwjrdatabase.queue(fwjrDocument, timestamp=True, callback=discardConflictingDocument)
  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/Database/CMSCouch.py", line 251, in queue
    self.commit(viewlist=viewlist, callback=callback)
  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/Database/CMSCouch.py", line 317, in commit
    retval = self.post(uri, data)
  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 154, in post
    return self.makeRequest(uri, data, 'POST', incoming_headers,
  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/Database/CMSCouch.py", line 137, in makeRequest
    self.checkForCouchError(getattr(e, "status", None),
  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/Database/CMSCouch.py", line 169, in checkForCouchError
    raise CouchRequestTooLargeError(reason, data, result, status)
WMCore.Database.CMSCouch.CouchRequestTooLargeError: Error type: CouchRequestTooLargeError, Status code: 413, Reason: Request Entity Too Large, Data: "Document has 61425196 bytes and it's too large to be accepted by CouchDB. Check the CouchDB configuration to see the current value under 'couchdb.max_document_size' (default is 8M bytes)."

this issue was supposedly fixed by this PR: #11502

My feeling is that failing to inject this document over and over is making the CouchDB replication unstable.
I am going to investigate the following now:

why 11502 isn't automatically resolving this issue (by setting the fwjr attribute to an empty string (or dict)
which workflow is actually creating a 60MB(!!!) json document.

amaltaro mentioned this issue Nov 3, 2023

Fix ChangeState logic for limiting number of docs to commit in bulk #11786

Merged

amaltaro closed this as completed in #11786 Nov 10, 2023

amaltaro mentioned this issue Jan 22, 2024

No jobdetail information for workflows #11820

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflows with no errors #11771

Workflows with no errors #11771

lucalavezzo commented Oct 24, 2023

jenimal commented Oct 25, 2023

khurtado commented Oct 27, 2023 •

edited

Loading

lucalavezzo commented Oct 30, 2023

haozturk commented Nov 3, 2023 •

edited

Loading

amaltaro commented Nov 3, 2023

Workflows with no errors #11771

Workflows with no errors #11771

Comments

lucalavezzo commented Oct 24, 2023

jenimal commented Oct 25, 2023

khurtado commented Oct 27, 2023 • edited Loading

lucalavezzo commented Oct 30, 2023

haozturk commented Nov 3, 2023 • edited Loading

amaltaro commented Nov 3, 2023

khurtado commented Oct 27, 2023 •

edited

Loading

haozturk commented Nov 3, 2023 •

edited

Loading