-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HLT crash in run-368566 (BasicSingleVertexState
error from PFTauPrimaryVertexProducer
)
#41914
Comments
A new Issue was created by @missirol Marino Missiroli. @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign hlt |
New categories assigned: hlt @missirol,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks |
assign reconstruction FYI @cms-sw/pf-l2 @cms-sw/tau-pog-l2 |
New categories assigned: reconstruction @mandrenguyen,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks |
since when CMS allows that not being able to invert a matrix abort a production job? |
There were two another instances of this in run 370497:
|
Unfortunately, in this case too we don't happen to have the corresponding error-stream file. @smorovic (DAQ) explained that the file was deleted too early [1], and he provided the full log of the corresponding [1]
[2]
|
If I understood correctly, this same error happened again in run
@smorovic Maybe in this case the streamers were not deleted? It would be really useful to have them to debug this issue... |
Still: it should NOT throw, just report an error! |
Unfortunately, it seems this time it is also gone: |
While I agree disrupting processing is nasty, I am wondering if we would have ever uncovered the underlying issue without this. I am a bit hesitant in removing the throw. |
an error message would do it. Throwing is against CMS coding rules |
I would agree, but unfortunately error messages are completely ignored at HLT, and I presume also during the Offline processing. |
there's some effort in cleaning things up in #41456 |
Ah, thanks for the pointer. |
Let me summarize the investigation we have done from the BeamSpot point of view:
This was checked running "offline" (doing re-HLT) on the same LS reading the conditions from the HLT GT and the ESProducer doesn't produce any fake BS. @smorovic is there a way to check if there were some Frontier/squid errors or warnings during the processing of these LSs? |
Last case has a lot of various CMSSW log errors and warnings, not sure what to search for. The log file is here in my .cms home: However, I think the failure to read condition data would crash the process (it's a fatal error), as far as we know from previous experience with frontier services going down. That would be appearing in F3mon as exception, stacktrace or Fatal log level. Also, I found a bug causing files to be deleted early and prepared PRs with a fix: |
As far as I know, this precise exception never happened in 2024. |
cms-bot internal usage |
Okay. :) |
In run-368566 (pp collisions, release
CMSSW_13_0_7
), DAQ reported a CMSSW crash at HLT not seen previously, to my knowledge [link to HLT elog]. Metadata and exception message can be found in [1].Unfortunately, the error-stream file could not be recovered this time, so there is no reproducer right now. I'm opening the issue anyway, in case experts have feedback, or in case this happens again.
FYI: @cms-sw/hlt-l2 @silviodonato @fwyzard @mzarucki @trtomei
[1]
The text was updated successfully, but these errors were encountered: