-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault in MuonShowerInformationFiller in SKYLAKEAVX512_X #33663
Comments
assign reconstruction |
A new Issue was created by @makortel Matti Kortelainen. @Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
with LessPhi defined ascmssw/RecoMuon/MuonIdentification/interface/MuonShowerInformationFiller.h Lines 156 to 163 in 89aac0a
|
@ArnabPurohit @trocino @makortel |
So far no, this is the only instance. I was first thinking if NaN or Inf could cause it, but the code is calling I'm not sure if it is worth to put much effort here unless the problem shows up again. I also realize I have no idea what (vector) architecture the job was ran on. |
in the log it says it is using 'haswell" aka AVX2. |
do they realize that they are calling atan2 each time? (same with all those perp and mag)? |
actually it's worse as globalPosition is not cached, so it also makes a toGlobal transformation each time |
so there is a whole set of pointer-chasing + virtualTable + SSE like operation (+atan2) involved in each comparison |
I'd start by caching the values to be sorted. It's at least conceivable that there are different code paths that swap the ordering of two entries, maybe due to different vectorization at different stages of the sort. |
attn @cms-sw/muon-pog-l2 |
type muon |
Occurred again in CMSSW_12_6_SKYLAKEAVX512_X_2022-09-15-2300 el8_amd64_gcc10 workflow 136.844 step 3
|
Hi, This issue is happening again in CMSSW_12_6_SKYLAKEAVX512_X_2022-11-20-2300 IB on platform el8_amd64_gcc11 workflow 136.844. |
I finally got annoyed enough to put this to the test by replacing the call to
with usage replacing the call to
In my checks this produces identical results to the standard stable sort, only it doesn't crash on skylake and it takes 1/3 the time. The |
After poking at this code some more, I think there's another issue. In this block of code, cmssw/RecoMuon/MuonIdentification/src/MuonShowerInformationFiller.cc Lines 916 to 935 in 4fc972d
it's iterating over muonCorrelatedHits and calling findThetaCluster() . That routine,cmssw/RecoMuon/MuonIdentification/src/MuonShowerInformationFiller.cc Lines 275 to 297 in 4fc972d
sorts the input muonRecHits collection, which is the same collection that is being iterated over--so the collection being iterated over is getting reordered on every iteration. The result is that it is pretty much guaranteed that the loop over the collection is not actually looping over every member. I think a proper fix will have to make a copy of the muonRecHits so that the it isn't iterating over a collection that gets reordered on every iteration. (This was found when I tried a sort replacement that invalidated outstanding iterators on the collection...) |
segfault in 136.844 seems to be fixed, so I think this issue can be closed. |
@cmsbuild, please close |
Workflow 136.844 step 3 segfaults in CMSSW_12_0_SKYLAKEAVX512_X_2021-05-06-2300
https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc900/CMSSW_12_0_SKYLAKEAVX512_X_2021-05-06-2300/pyRelValMatrixLogs/run/136.844_RunCharmonium2017B+RunCharmonium2017B+HLTDR2_2017+RECODR2_2017reHLT_skimCharmonium_Prompt+HARVEST2017/step3_RunCharmonium2017B+RunCharmonium2017B+HLTDR2_2017+RECODR2_2017reHLT_skimCharmonium_Prompt+HARVEST2017.log#/
The text was updated successfully, but these errors were encountered: