-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
heap-buffer-overflow HcalDDDSimConstants::getPhiCons() #39480
Comments
A new Issue was created by @dan131riley Dan Riley. @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign geometry |
New categories assigned: geometry @mdhildreth,@ianna,@Dr15Jones,@makortel,@bsunanda,@civanch you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Looks similar to #39445 even if the stack trace is different. |
It's plausible they are related |
So I was able to replicate the problem with a debug build under gdb. The crash I see happens on line 173 cmssw/Geometry/HcalCommonData/src/HcalDDDRecConstants.cc Lines 171 to 173 in 8faa021
but the problem originates on line 171 as the value of For the problem I see, the values are being deduced by looking at the HcalID of values read from the pileup where the raw value of the id is |
So the values used to fill
which gets its data from DD4hep The values pulled from the loop filling the structure are
The DD4Hep description appears to come from this module
Is this the correct geometry file for this pileup? |
Hi
I believe they are using a pileup file from runs 1,2, or 3 for phase2 where the ieta value is restricted to 16. I shall make a protection to make a more realistic diagnostic. The ieta value for this channel is 28 where the maximum ieta vaue should be 16 (there is no HE in run4). Thanks Chris for providing the diagnosis.
Sunanda
________________________________
From: Chris Jones ***@***.***
Sent: 28 October 2022 23:23
To: cms-sw/cmssw
Cc: Sunanda Banerjee; Mention
Subject: Re: [cms-sw/cmssw] heap-buffer-overflow HcalDDDSimConstants::getPhiCons() (Issue #39480)
So the values used to fill ietaMap come from this ES module
>> print(process.hcalParameters.dumpPython())
cms.ESProducer("HcalParametersESModule",
appendToDataLabel = cms.string(''),
fromDD4hep = cms.bool(True)
)
which gets its data from DD4hep
The values pulled from the loop filling the structure are
nEta 16 i 0 hpar->etagroup[i] 1
nEta 16 i 1 hpar->etagroup[i] 1
nEta 16 i 2 hpar->etagroup[i] 1
nEta 16 i 3 hpar->etagroup[i] 1
nEta 16 i 4 hpar->etagroup[i] 1
nEta 16 i 5 hpar->etagroup[i] 1
nEta 16 i 6 hpar->etagroup[i] 1
nEta 16 i 7 hpar->etagroup[i] 1
nEta 16 i 8 hpar->etagroup[i] 1
nEta 16 i 9 hpar->etagroup[i] 1
nEta 16 i 10 hpar->etagroup[i] 1
nEta 16 i 11 hpar->etagroup[i] 1
nEta 16 i 12 hpar->etagroup[i] 1
nEta 16 i 13 hpar->etagroup[i] 1
nEta 16 i 14 hpar->etagroup[i] 1
nEta 16 i 15 hpar->etagroup[i] 1
The DD4Hep description appears to come from this module
>> print(process.DDDetectorESProducer.dumpPython())
cms.ESSource("DDDetectorESProducer",
appendToDataLabel = cms.string(''),
confGeomXMLFiles = cms.FileInPath('Geometry/CMSCommonData/data/dd4hep/cmsExtendedGeometry2026D88.xml')
)
Is this the correct geometry file for this pileup?
—
Reply to this email directly, view it on GitHub<#39480 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABGMZORK45GZHVEC2NJ2KE3WFQ74FANCNFSM6AAAAAAQTCGRMM>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Umm, I'd expect runTheMatrix workflows to have consistent pileup files (@cms-sw/pdmv-l2). The workflows 20834.911 and 23634.0 (that are the ones failing after the re-numbering of phase 2 workflows) do not have pileup. (checking workflows 39434.911 and 38634.0, that were mentioned in the issue description, from 12_5_X shows they don't have pileup either) |
After #39920 workflow 20834.911 step 2 fails with an exception
(as I noted above, this workflow does not include pileup) |
The configuration of the mixing module in the job that is failing is
|
So in this workflow, step1 does the GEN+SIM step and then step 2 (where the problem is seen) does the digi step. It looks like the SIM step thinks the geometry of the detector is different from what the SIM step thinks. I checked the parameters passed to cmsDriver.py in both steps, and they seem consistent step 1
step 2
|
IMHO, we must move this and all other constants as records to DB and pick them up from a GT. |
The pileup of Phase-2 is controlled by So, if you are running runTheMatrix with 12_6_0_pre4, it will pick MinBias GS from CMSSW_12_3_0_pre5,
If GS from Should this be updated, and see if we still have issue? |
The failing workflows do not use pileup, so I don't see how their update would help. (written that, there may be other reasons to justify the update) |
Hi @makortel I think for wf point-of-view, i.e. to get consistent geometry of pileup wf, this maybe good to update. |
Could the heap-buffer-overlow in |
This exception happens in the SIM step (step1) and not in the DIGI step (step2). So clearly eta assignment gets wrong even at the time of assigning it - does it happen for all CMSSW architectures and only for Phase2 dd4hep case? |
I think my conclusion in last ORP was wrong if I read original report by @dan131riley #39480 (comment) and also comment from @makortel #39480 (comment). The issue also sees in DDD workflow, i.e. in 23634.0 (it is NoPU D95 wf, From #39445 (comment) reports that in normal IB, the issue appears randomly. One should try to run on ASAN IB. |
urgent |
The exception (introduced in #39920) appears to be occurring only in workflow 20834.911 step 2, i.e. Phase 2 D88 DD4Hep without pileup, and pretty consistently across all IB flavors. The ASAN failure (mentioned in the description of this issue) appears to be occurring in the following workflows in a random fashion (i.e. each of the workflow fails in some ASAN IBs, but not in all of them)
(note that an execution not triggering the ASAN failure does not mean all would be well, it could easily be that the ASAN's instrumentation just fails to catch a read from incorrect address). |
Dear All,
Most likely I found the bug. But this bug has been there from the very first days of CMSSW and I believe it has affected all workflows from Run1 onward. It only gave wrong results and did not cause the crash till we removed HE. I am trying to make a PR out of this. Regards
Sunanda
…________________________________
From: Matti Kortelainen ***@***.***
Sent: 02 November 2022 18:37
To: cms-sw/cmsswDear
Cc: Sunanda Banerjee; Mention
Subject: Re: [cms-sw/cmssw] heap-buffer-overflow HcalDDDSimConstants::getPhiCons() (Issue #39480)
@bsunanda<https://github.com/bsunanda>
The exception (introduced in #39920<#39920>) appears to be occurring only in workflow 20834.911 step 2, i.e. Phase 2 D88 DD4Hep without pileup, and pretty consistently across all IB flavors.
The ASAN failure (mentioned in the description of this issue) appears to be occurring in the following workflows in a random fashion (i.e. each of the workflow fails in some ASAN IBs, but not in all of them)
* 20834.911 step1, i.e. Phase2 D88 DD4Hep without pileup
* 22034.0 step 1, i.e. Phase2 D91 DDD without pileup
* 23634.0 step 1, i.e. Phase2 D95 DDD without pileup
(note that an execution not triggering the ASAN failure does not mean all would be well, it could easily be that the ASAN's instrumentation just fails to catch a read from incorrect address).
—
Reply to this email directly, view it on GitHub<#39480 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABGMZOUNUQGGZFPJDV5FGFLWGKRHDANCNFSM6AAAAAAQTCGRMM>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
This problem got fixed by #39967 |
@cmsbuild, please close |
Log from 39434.911, also seen in 38634.0:
https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/raw/el8_amd64_gcc11/CMSSW_12_6_ASAN_X_2022-09-21-1100/pyRelValMatrixLogs/run/39434.911_TTbar_14TeV+2026D88_DD4hep+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTrigger+RecoGlobal+HARVESTGlobal/step1_TTbar_14TeV+2026D88_DD4hep+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTrigger+RecoGlobal+HARVESTGlobal.log
The text was updated successfully, but these errors were encountered: