-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation violation in SectorProcessorShower::process #42185
Comments
A new Issue was created by @iarspider . @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign l1 |
New categories assigned: l1 @epalencia,@aloeliger you have been requested to review this Pull request/Issue and eventually sign? Thanks |
urgent |
Hi, I'm looking at this, but I don't know the reason for crash for now. Would you be able to give me more information? Maybe about which RelVals are crashing? Thanks! |
For the relval that crash and their logs please have a look at the most recent Integration Builds in https://cmssdt.cern.ch/SDT/html/cmssdt-ib/#/ib/CMSSW_13_2_X |
I tried to reproduce the crash locally, using the same file from one of the crashes and running the modified How can I run the workflow in an identical way to what crashed? |
Hello @eyigitba, you can run it from the same cmssw environment used in the latest IBs by following these steps:
Let me know if that helps! |
I ran one of the failing workflows in the debugger. The crash happens here
the value of cmssw/DataFormats/L1Trigger/interface/BXVector.icc Lines 208 to 211 in f60e501
and |
Tracing further back, the
which is happening here
The hard coded values are clearly wrong for handling this case. |
Thank you @Dr15Jones for the detailed investigation. I think the origin of the problem is clear now. The bug was already present in the code of BXVector, but it only surfaced after merging #42176 because this line now pushes into the exact bx of the digi, while before that PR it was always pushed at bx=0. @eyigitba I think that a possible quick fix could be putting a protection into BXVector::push_back so that only if Still, I don't understand the role of L135-L139 in BXVector.icc: cmssw/DataFormats/L1Trigger/interface/BXVector.icc Lines 135 to 139 in f60e501
itrs_ is already setup in the constructor, what is the purpose of moving it by one in the following bx's if you push data_ for a previous bx?
|
Thanks @Dr15Jones and @perrotta for further inofrmation on this. I now see where the problem is. I didn't realize that there was data with BX values outside the [-2,2] range. I can add the protection to I think this shouldn't cause any issues with other workflows since we didn't see this crash before. Thanks @aandvalenzuela for the instructions. I'll test the code with this workflow. |
I think the underlying issue of #41645 is correlated with this. @aloeliger FYI |
@mmusich Am I reading this right that WF |
I think so (or at least we're getting events with seemingly corrupt L1T data, with BX values outside the expected range). |
That's curious. The corrupt data only started showing up online in 2023. I think this is evidence that it's not a uGT firmware issue, but a direct unpacker failure for muons/muon showers. |
@eyigitba I think some inspecting of muon/muon shower unpackers is necessary and how the BX is assigned to them. |
@aloeliger , this is for sure something that needs to happen for muon showers. I don't think the problem is in muons, since we didn't touch anything there. However, for the muon showers #38941 changed how unpackers/emulators work on CSC side which apparently have these issues appearing in L1T side. We can also ping @dinyar here in case he has any insight on possible problems with muon unpackers. |
@eyigitba would it be possible to provide a quick and possibly not so dirty solution for the problem at hand? We must close 13_2_0_pre3 (the last open pre) and if the problem persists we will be forced to revert #42176 I have the impression that adding a protection into |
@perrotta I just discussed a quick solution with him. Either he or I should have it available quickly. Let me ask and then I can give an ETA. |
Okay. I'll push a quick bx boundary check. Should be available within 30 minutes at a quick guess? I'll update when I have it. |
Thanks @aloeliger . My connection is not great for now and staying connected to lxplus is not possible for some reason. |
please close |
We observe multiple RelVal failures in CMSSW_13_2_X_2023-07-04 IBs (all platforms). Example of crash log
Full log: link
Looks like #42176 is the culprit
The text was updated successfully, but these errors were encountered: