-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflows with no errors #11771
Comments
If you look at WmStats, it shows 73 failures, but when you click on the L, no failures show up to dig into so we are running blind. |
There are indeed, jobs failing. In the workflow example above, the condor monitor also shows these errors: So, there is an exception ("Other CMS exception"): 8001
Somehow, the job report is still produced, but it has no steps because of the exception above. The JobAccountant tries to load the job report and notices the bad pkl reports:
Could the JobAccounting errors be somehow couples with the fact WMStats does not show the job errors in the "L" panel? The wmAgent log itself shows thew following for job 307023:
I couldn't find anythign relevant in the reqmon logs Tagging @amaltaro |
Hi @khurtado, thank you very much for the detailed investigation. It would of course be great to resolve this, but since it does seem like that this issue happens specifically for the error
We can try to submit some ACDCs for this and see if they work, and then apply that solution to the rest of them. |
@amaltaro Here is the workflow we looked at in the meeting: Others: http://cmsworkflow-frontend.cern.ch/search?search=exitCode%20=%20-2 (You need to be on CERN network) |
As discussed in the meeting today, I decided to look into the same workflow that Hasan pasted above, which was executed only by vocms0255.
and one/some jobs succeeded, but had this strange (unknown to me) message:
Now, what worries me even more is that JobAccountant is permanently trying to upload documents to couch that exceed the configured size limit, hence raising an exception like:
this issue was supposedly fixed by this PR: #11502 My feeling is that failing to inject this document over and over is making the CouchDB replication unstable.
|
Impact of the bug
Makes error handling for P&R very hard, as we rely on WM for error reports.
Describe the bug
Some worklfows (e.g. https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_BPH-RunIISummer20UL18GEN-00237) do not seem to have any errors. We do see some errors in the job log and WorkloadSummary (https://cmsweb.cern.ch/couchdb/workloadsummary/_design/WorkloadSummary/_show/histogramByWorkflow/pdmvserv_task_BPH-RunIISummer20UL18GEN-00237__v1_T_230926_104347_8763)
The text was updated successfully, but these errors were encountered: