-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure in L1FPGATrackProducer/'l1tTTTracksFromExtendedTrackletEmulation' in 14_0_0 #44306
Comments
cms-bot internal usage |
A new Issue was created by @srimanob. @rappoccio, @antoniovilela, @smuzaffar, @makortel, @Dr15Jones, @sextonkennedy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign l1 |
New categories assigned: l1 @epalencia,@aloeliger you have been requested to review this Pull request/Issue and eventually sign? Thanks |
@Jingyan95 - can you have a look at this please? Is it somehow related to #41357 ? |
assign upgrade |
New categories assigned: upgrade @srimanob,@subirsarkar you have been requested to review this Pull request/Issue and eventually sign? Thanks |
I should point out that I believe this is the Tracker group's code rather than L1Trigger |
I was able to fix the first exception in #44427. If there is a recipe for the exception in PU relvals, I can try to debug that one as well. |
Hi @aehart |
I was able to reproduce the exception seen in PU relvals by copying the PSet.pkl from one of the job logs. I traced this to a numerical stability issue, which I fixed in #44471. Once that is merged, I think this issue is resolved, as far as I can see. |
We still see the issue in CMSSW_14_1_0_pre3 where #44471 was merged (see release log), see reports in |
Note that, something is very strange to me. We don't see this issue at all in 14_0_6 while we see too many fail jobs in 14_1_0_pre3. The only issue I see in 14_0_6 is TripleMU_i84 NULL pointer, which I contact L1T separately. From my check, I don't see his L1FPGSTrackProducer/l1tTTTracksFrom.. at all. |
@srimanob is there a recipe for how to reproduce the crash? it is difficult for us to debug otherwise. a note on the releases -- there are some possibly relevant PRs that were included in 14_1_0_pre3, that are not in 14_0_6. |
Hi @skinnari Here is the recipe,
|
With the private production, I confirm that the crash seems to appear in the 14_1 only, I don't see it when I produce the sample with 14_0_6. |
I was able to reproduce the crash locally in 14_1_0_pre3, and with debugging symbols, the backtrace points to this line:
I can't see how this line could be the cause though, so I guess there is some kind of memory mismanagement somewhere else that is the actual cause. I will keep playing with it… |
The line
makes use of a variable sized array which is NOT supported by the C++ standard (but almost all compilers support it). The problem is this can use lots of stack memory and can exceed the allowed space on a stack. Switching to a dynamic container to see if that solves the problem. |
This seems to be a good suggestion. I switched If this seems like a reasonable fix, I can open a PR right away. |
Thanks very much @Dr15Jones @aehart for suggestion and test. Do you somehow understand why it does not happen in 14_0? Do we just about at the limit in 14_1 due to some modules? (1) However, it seems to be on the safe side if you make the backport to 14_0, right? Thx. |
That's my guess, although it could also be related to removing the bins used in the
There's no harm in backporting this to 14_0, so I can do that as well.
I've only checked the L1Trigger/TrackFinding* packages by recompiling them with the |
Just for fun, here is a table of all variable-length arrays in L1Trigger in CMSSW_14_1_0_pre3. I leave it to the experts of other subpackages to fix them, but hopefully this is a useful starting point.
|
I observe a failure in L1FPGATrackProducer/'l1tTTTracksFromExtendedTrackletEmulation' in 14_0_0 in recent NoPU relvals,
https://cms-unified.web.cern.ch/cms-unified/report/pdmvserv_RVCMSSW_14_0_0QCD_FlatPt_15_3000HS_14__STD_2026D98_noPU_240226_205544_77
To reproduce the issue with CMSSW_14_0_0:
Input file = /eos/cms/store/relval/CMSSW_14_0_0/RelValQCD_FlatPt_15_3000HS_14/GEN-SIM/140X_mcRun4_realistic_v1_STD_2026D98_noPU-v1/2580000/1af0b992-5804-40e9-911f-933e5c413f97.root
and cmsDriver on step2:
cmsDriver.py step2 -s DIGI:pdigi_valid,L1TrackTrigger,L1,DIGI2RAW,HLT:@relval2026 --conditions auto:phase2_realistic_T25 --datatier GEN-SIM-DIGI-RAW -n 10 --eventcontent FEVTDEBUGHLT --geometry Extended2026D98 --era Phase2C17I13M9 --python DigiTrigger_2026D98.py --no_exec --filein file:step1.root --fileout file:step2.root --nThreads 8 --nStreams 1 --customise_commands "process.source.lumisToProcess = cms.untracked.VLuminosityBlockRange('1:432-1:432') \n process.source.eventsToProcess = cms.untracked.VEventRange('1:43160-1:43169')"
CMSSW_14_0_0 already include a (temporary) fix on track jet eta, #43922, see on release report https://github.com/cms-sw/cmssw/releases/CMSSW_14_0_0
Note that, there is also issue in PU relvals, for example in
https://cms-unified.web.cern.ch/cms-unified/report/pdmvserv_RVCMSSW_14_0_0DisplacedSUSY_14TeV__STD_2026D98_PU_240302_001633_312
https://cms-unified.web.cern.ch/cms-unified/report/pdmvserv_RVCMSSW_14_0_0QCD_Pt15To7000_Flat_14__STD_2026D98_PU_240302_001955_88
https://cms-unified.web.cern.ch/cms-unified/report/pdmvserv_RVCMSSW_14_0_0QCD_Pt_1800_2400_14__STD_2026D98_PU_240302_001939_1360
with error
I have not reproduced the error of PU yet, as it needs to mix samples.
The text was updated successfully, but these errors were encountered: