Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflows with no errors #11771

Closed
lucalavezzo opened this issue Oct 24, 2023 · 5 comments · Fixed by #11786
Closed

Workflows with no errors #11771

lucalavezzo opened this issue Oct 24, 2023 · 5 comments · Fixed by #11786

Comments

@lucalavezzo
Copy link

Impact of the bug
Makes error handling for P&R very hard, as we rely on WM for error reports.

Describe the bug
Some worklfows (e.g. https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_BPH-RunIISummer20UL18GEN-00237) do not seem to have any errors. We do see some errors in the job log and WorkloadSummary (https://cmsweb.cern.ch/couchdb/workloadsummary/_design/WorkloadSummary/_show/histogramByWorkflow/pdmvserv_task_BPH-RunIISummer20UL18GEN-00237__v1_T_230926_104347_8763)

@jenimal
Copy link

jenimal commented Oct 25, 2023

If you look at WmStats, it shows 73 failures, but when you click on the L, no failures show up to dig into so we are running blind.

@khurtado
Copy link
Contributor

khurtado commented Oct 27, 2023

There are indeed, jobs failing. In the workflow example above, the condor monitor also shows these errors:

https://es-cms.cern.ch/dashboards/app/discover#/?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-2M,to:now))&_a=(columns:!(_source),filters:!(),index:'0f3117a0-d2e6-11ed-bdae-e3cf8c07f2fc',interval:auto,query:(language:kuery,query:'WMAgent_RequestName:"pdmvserv_task_BPH-RunIISummer20UL18GEN-00237__v1_T_230926_104347_8763"%20and%20ExitCode:8001'),sort:!())

So, there is an exception ("Other CMS exception"): 8001

 An exception of category 'ExternalFailed' occurred while
   [0] Processing global begin LuminosityBlock run: 1 luminosityBlock: 1080
   [1] Calling method for module ExternalGeneratorFilter/'generator'
Exception Message:
failed waiting for external process

Somehow, the job report is still produced, but it has no steps because of the exception above.

The JobAccountant tries to load the job report and notices the bad pkl reports:

2023-10-17 19:45:54,864:140693643872000:INFO:AccountantWorker:Handling /storage/local/data1/cmsdataops/srv/wmagent/v2.2.5/install/wmagentpy3/JobCreator/JobCache/pdmvserv_task_BPH-RunIISummer20UL18GEN-00237__v1_T_230926_104347_8763/BPH-RunIISummer20UL18GEN-00237_0/JobCollection_27562_0/job_307022/Report.0.pkl
2023-10-17 19:45:54,866:140693643872000:WARNING:AccountantWorker:Job 307022 , bad jobReport, failing job
2023-10-17 19:45:54,869:140693643872000:INFO:AccountantWorker:Handling /storage/local/data1/cmsdataops/srv/wmagent/v2.2.5/install/wmagentpy3/JobCreator/JobCache/pdmvserv_task_BPH-RunIISummer20UL18GEN-00237__v1_T_230926_104347_8763/BPH-RunIISummer20UL18GEN-00237_0/JobCollection_27562_0/job_307023/Report.0.pkl
2023-10-17 19:45:54,871:140693643872000:WARNING:AccountantWorker:Job 307023 , bad jobReport, failing job
2023-10-18 06:15:26,684:140693643872000:ERROR:AccountantWorker:FwkJobReport with no steps: /storage/local/data1/cmsdataops/srv/wmagent/v2.2.5/install/wmagentpy3/JobCreator/JobCache/pdmvserv_task_BPH-RunIISummer20UL18GEN-00237__v1_T_230926_104347_8763/BPH-RunIISummer20UL18GEN-00237_0/JobCollection_30739_0/job_345960/Report.0.pkl

Could the JobAccounting errors be somehow couples with the fact WMStats does not show the job errors in the "L" panel?

The wmAgent log itself shows thew following for job 307023:

$ /data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir/p/pdmvserv_task_BPH-RunIISummer20UL18GEN-00237__v1_T_230926_104347_8763/JobCluster_307/Job_307023.tar.bz2

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH
/cvmfs/cms.cern.ch/COMP/slc7_amd64_gcc630/external/python3/3.8.2-comp/bin/python3 -m WMCore.WMRuntime.ScriptInvoke WMTaskSpace.cmsRun1 SetupCMSSWPset

2023-10-18 00:34:46,249:INFO:CMSSW:RUNNING SCRAM SCRIPTS
2023-10-18 00:34:46,249:INFO:CMSSW:Executing CMSSW. args: ['/bin/bash', '/srv/job/WMTaskSpace/cmsRun1/cmsRun1-main.sh', '', 'slc7_amd64_gcc700', 'scramv1', 'CMSSW', 'CMSSW_10_6_30_patch1', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', '']
2023-10-18 00:39:18,864:INFO:PerformanceMonitor:PSS: 161746; RSS: 288936; PCPU: 92.1; PMEM: 0.1
2023-10-18 00:40:21,761:INFO:Report:addOutputFile method called with outputModule: RAWSIMoutput, aFile: None
2023-10-18 00:40:21,761:INFO:Report:addOutputFile method fileRef: , whole tree: {}
2023-10-18 00:40:21,768:ERROR:XMLParser:Tried to divide by zero doing storage statistics report parsing.
2023-10-18 00:40:21,768:ERROR:XMLParser:Either you aren't reading and writing data, or you aren't reporting it.
2023-10-18 00:40:21,768:ERROR:XMLParser:Not adding any storage performance info to report.
2023-10-18 00:40:21,769:CRITICAL:CMSSW:Error running cmsRun
{'arguments': ['/bin/bash', '/srv/job/WMTaskSpace/cmsRun1/cmsRun1-main.sh', '', 'slc7_amd64_gcc700', 'scramv1', 'CMSSW', 'CMSSW_10_6_30_patch1', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', '']}
CMSSW Return code: 8001

2023-10-18 00:40:21,769:CRITICAL:CMSSW:Error message: An exception of category 'ExternalFailed' occurred while
   [0] Processing global begin LuminosityBlock run: 1 luminosityBlock: 2
   [1] Calling method for module ExternalGeneratorFilter/'generator'
Exception Message:
failed waiting for external process
2023-10-18 00:40:21,780:INFO:CMSSW:Step cmsRun1: Chirp_WMCore_cmsRun_ExitCode 8001
2023-10-18 00:40:21,804:INFO:CMSSW:Step cmsRun1: Chirp_WMCore_cmsRun1_ExitCode 8001
2023-10-18 00:40:21,852:CRITICAL:CMSSW:CMSDefaultHandler Diagnostic Handler invoked
2023-10-18 00:40:21,855:WARNING:Report:Exit code: 8001 has been already added to the job report
2023-10-18 00:40:21,855:INFO:Report:addOutputFile method called with outputModule: RAWSIMoutput, aFile: None
2023-10-18 00:40:21,855:INFO:Report:addOutputFile method fileRef: , whole tree: {}
2023-10-18 00:40:21,859:ERROR:XMLParser:Tried to divide by zero doing storage statistics report parsing.
2023-10-18 00:40:21,859:ERROR:XMLParser:Either you aren't reading and writing data, or you aren't reporting it.
2023-10-18 00:40:21,860:ERROR:XMLParser:Not adding any storage performance info to report.
2023-10-18 00:40:21,860:WARNING:Report:Exit code: 8001 has been already added to the job report

I couldn't find anythign relevant in the reqmon logs

Tagging @amaltaro

@lucalavezzo
Copy link
Author

Hi @khurtado, thank you very much for the detailed investigation. It would of course be great to resolve this, but since it does seem like that this issue happens specifically for the error

 An exception of category 'ExternalFailed' occurred while
   [0] Processing global begin LuminosityBlock run: 1 luminosityBlock: 1080
   [1] Calling method for module ExternalGeneratorFilter/'generator'
Exception Message:
failed waiting for external process

We can try to submit some ACDCs for this and see if they work, and then apply that solution to the rest of them.

@haozturk
Copy link

haozturk commented Nov 3, 2023

@amaltaro Here is the workflow we looked at in the meeting: pdmvserv_task_PPS-Run3Summer22EEMiniAODv4-00177__v1_T_231101_141258_2773

Others: http://cmsworkflow-frontend.cern.ch/search?search=exitCode%20=%20-2 (You need to be on CERN network)

@amaltaro
Copy link
Contributor

amaltaro commented Nov 3, 2023

As discussed in the meeting today, I decided to look into the same workflow that Hasan pasted above, which was executed only by vocms0255.
I confirm that I do see around 10 jobs processed by JobAccountant (different than the single job reported in WMStats), some have failed, e.g.:

2023-11-03 06:24:05,435:140461312231168:INFO:AccountantWorker:Handling /data/srv/wmagent/v2.2.5/install/wmagentpy3/JobCreator/JobCache/pdmvserv_task_PPS-Run3Summer22EEMiniAODv4-00177__v1_T_231101_141258_2773/PPS-Run3Summer22EEMiniAODv4-00177_0/JobCollection_257301_0/job_2393084/Report.0.pkl
2023-11-03 06:24:05,437:140461312231168:WARNING:AccountantWorker:Job 2393084 , bad jobReport, failing job

and one/some jobs succeeded, but had this strange (unknown to me) message:

2023-11-03 07:48:31,373:140461312231168:INFO:AccountantWorker:Handling /data/srv/wmagent/v2.2.5/install/wmagentpy3/JobCreator/JobCache/pdmvserv_task_PPS-Run3Summer22EEMiniAODv4-00177__v1_T_231101_141258_2773/PPS-Run3Summer22EEMiniAODv4-00177_0/LogCollectForPPS-Run3Summer22EEMiniAODv4-00177_0/JobCollection_259900_0/job_2406134/Report.0.pkl
2023-11-03 07:48:31,375:140461312231168:INFO:AccountantWorker:Job 2406134 , handle successful job
2023-11-03 07:48:31,391:140461312231168:ERROR:AccountantWorker:job report is missing for updating log archive mapping
 Input file list
 [{'lfn': '/store/unmerged/logs/prod/2023/11/3/pdmvserv_task_PPS-Run3Summer22EEMiniAODv4-00177__v1_T_231101_141258_2773/PPS-Run3Summer22EEMiniAODv4-00177_0/0000/0/77d15e8e-5963-4d50-8859-90bd21b83460-9-0-logArchive.tar.gz', 'size': 0, 'events': 0, 'checksums': {}, 'runs': set(), 'merged': False, 'last_event': 0, 'first_event': 0, 'locations': set(), 'parents': set(), 'pfn': None, 'branches': [], 'catalog': None, 'guid': None, 'input_source_class': None, 'module_label': None, 'input_type': None},

Now, what worries me even more is that JobAccountant is permanently trying to upload documents to couch that exceed the configured size limit, hence raising an exception like:

2023-11-03 23:02:21,868:140461312231168:WARNING:CMSCouch:queue larger than 250 records, committing
2023-11-03 23:02:23,699:140461312231168:ERROR:ChangeState:Error updating job in couch: Error type: CouchRequestTooLargeError, Status code: 413, Reason: Request Entity Too Large, Data: "Document has 61425196 bytes and it's too large to be 
accepted by CouchDB. Check the CouchDB configuration to see the current value under 'couchdb.max_document_size' (default is 8M bytes)."
Traceback (most recent call last):
  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/Database/CMSCouch.py", line 133, in makeRequest
    result, status, reason, cached = JSONRequests.makeRequest(
  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 185, in makeRequest
    result, response = self.makeRequest_pycurl(uri, data, verb, headers)
  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 202, in makeRequest_pycurl
    response, result = self.reqmgr.request(uri, data, headers, verb=verb,
  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/Utils/PortForward.py", line 68, in portMangle
    return callFunc(callObj, url, *args, **kwargs)
  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/Services/pycurl_manager.py", line 353, in request
    raise exc
http.client.HTTPException: url=http://localhost:5984/wmagent_jobdump%2Ffwjrs/_bulk_docs/, code=413, reason=Request Entity Too Large, headers={'Connection': 'close', 'Date': 'Fri, 03 Nov 2023 22:02:23 GMT', 'Server': 'CouchDB/3.2.2 (Erlang
 OTP/22)', 'Cache-Control': 'must-revalidate', 'Content-Length': '51', 'Content-Type': 'application/json', 'X-Couch-Request-ID': '3e1a52900a', 'X-CouchDB-Body-Time': '70'}, result=b'{"error":"document_too_large","reason":"451940-0"}\n'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/JobStateMachine/ChangeState.py", line 207, in propagate
    self.recordInCouch(jobs, newstate, oldstate, updatesummary)
  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/JobStateMachine/ChangeState.py", line 404, in recordInCouch
    self.fwjrdatabase.queue(fwjrDocument, timestamp=True, callback=discardConflictingDocument)
  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/Database/CMSCouch.py", line 251, in queue
    self.commit(viewlist=viewlist, callback=callback)
  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/Database/CMSCouch.py", line 317, in commit
    retval = self.post(uri, data)
  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 154, in post
    return self.makeRequest(uri, data, 'POST', incoming_headers,
  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/Database/CMSCouch.py", line 137, in makeRequest
    self.checkForCouchError(getattr(e, "status", None),
  File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/Database/CMSCouch.py", line 169, in checkForCouchError
    raise CouchRequestTooLargeError(reason, data, result, status)
WMCore.Database.CMSCouch.CouchRequestTooLargeError: Error type: CouchRequestTooLargeError, Status code: 413, Reason: Request Entity Too Large, Data: "Document has 61425196 bytes and it's too large to be accepted by CouchDB. Check the CouchDB configuration to see the current value under 'couchdb.max_document_size' (default is 8M bytes)."

this issue was supposedly fixed by this PR: #11502

My feeling is that failing to inject this document over and over is making the CouchDB replication unstable.
I am going to investigate the following now:

  1. why 11502 isn't automatically resolving this issue (by setting the fwjr attribute to an empty string (or dict)
  2. which workflow is actually creating a 60MB(!!!) json document.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants