Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in L1FPGATrackProducer/'l1tTTTracksFromExtendedTrackletEmulation' in 14_0_0 #44306

Open
srimanob opened this issue Mar 4, 2024 · 24 comments

Comments

@srimanob
Copy link
Contributor

srimanob commented Mar 4, 2024

I observe a failure in L1FPGATrackProducer/'l1tTTTracksFromExtendedTrackletEmulation' in 14_0_0 in recent NoPU relvals,
https://cms-unified.web.cern.ch/cms-unified/report/pdmvserv_RVCMSSW_14_0_0QCD_FlatPt_15_3000HS_14__STD_2026D98_noPU_240226_205544_77

Fatal Exception (Exit code: 8022)
An exception of category 'FatalRootError' occurred while
[0] Processing Event run: 1 lumi: 432 event: 43164 stream: 2
[1] Running path 'L1TrackTrigger_step'
[2] Calling method for module L1FPGATrackProducer/'l1tTTTracksFromExtendedTrackletEmulation'
Additional Info:
[a] Fatal Root Error: @SUB=operator=(const TMatrixT &)
matrices not compatible

To reproduce the issue with CMSSW_14_0_0:
Input file = /eos/cms/store/relval/CMSSW_14_0_0/RelValQCD_FlatPt_15_3000HS_14/GEN-SIM/140X_mcRun4_realistic_v1_STD_2026D98_noPU-v1/2580000/1af0b992-5804-40e9-911f-933e5c413f97.root

and cmsDriver on step2:
cmsDriver.py step2 -s DIGI:pdigi_valid,L1TrackTrigger,L1,DIGI2RAW,HLT:@relval2026 --conditions auto:phase2_realistic_T25 --datatier GEN-SIM-DIGI-RAW -n 10 --eventcontent FEVTDEBUGHLT --geometry Extended2026D98 --era Phase2C17I13M9 --python DigiTrigger_2026D98.py --no_exec --filein file:step1.root --fileout file:step2.root --nThreads 8 --nStreams 1 --customise_commands "process.source.lumisToProcess = cms.untracked.VLuminosityBlockRange('1:432-1:432') \n process.source.eventsToProcess = cms.untracked.VEventRange('1:43160-1:43169')"

CMSSW_14_0_0 already include a (temporary) fix on track jet eta, #43922, see on release report https://github.com/cms-sw/cmssw/releases/CMSSW_14_0_0

Note that, there is also issue in PU relvals, for example in
https://cms-unified.web.cern.ch/cms-unified/report/pdmvserv_RVCMSSW_14_0_0DisplacedSUSY_14TeV__STD_2026D98_PU_240302_001633_312
https://cms-unified.web.cern.ch/cms-unified/report/pdmvserv_RVCMSSW_14_0_0QCD_Pt15To7000_Flat_14__STD_2026D98_PU_240302_001955_88
https://cms-unified.web.cern.ch/cms-unified/report/pdmvserv_RVCMSSW_14_0_0QCD_Pt_1800_2400_14__STD_2026D98_PU_240302_001939_1360
with error

[0] Processing Event run: 1 lumi: 45 event: 2235 stream: 0
[1] Running path 'L1TrackTrigger_step'
[2] Calling method for module L1FPGATrackProducer/'l1tTTTracksFromExtendedTrackletEmulation'
Additional Info:
[a] Fatal Root Error: @SUB=TDecompLU::DecomposeLUCrout
matrix is singular

I have not reproduced the error of PU yet, as it needs to mix samples.

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 4, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 4, 2024

A new Issue was created by @srimanob.

@rappoccio, @antoniovilela, @smuzaffar, @makortel, @Dr15Jones, @sextonkennedy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@srimanob
Copy link
Contributor Author

srimanob commented Mar 4, 2024

assign l1

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 4, 2024

New categories assigned: l1

@epalencia,@aloeliger you have been requested to review this Pull request/Issue and eventually sign? Thanks

@aloeliger
Copy link
Contributor

@BenjaminRS

@srimanob srimanob changed the title Fail in L1FPGATrackProducer/'l1tTTTracksFromExtendedTrackletEmulation' in 14_0_0 Failure in L1FPGATrackProducer/'l1tTTTracksFromExtendedTrackletEmulation' in 14_0_0 Mar 4, 2024
@BenjaminRS
Copy link
Contributor

@Jingyan95 - can you have a look at this please? Is it somehow related to #41357 ?

@makortel
Copy link
Contributor

makortel commented Mar 4, 2024

assign upgrade

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 4, 2024

New categories assigned: upgrade

@srimanob,@subirsarkar you have been requested to review this Pull request/Issue and eventually sign? Thanks

@BenjaminRS
Copy link
Contributor

I should point out that I believe this is the Tracker group's code rather than L1Trigger

@aehart
Copy link
Contributor

aehart commented Mar 15, 2024

I was able to fix the first exception in #44427.

If there is a recipe for the exception in PU relvals, I can try to debug that one as well.

@srimanob
Copy link
Contributor Author

Hi @aehart
Thanks. For PU, it is a bit difficult to start from GEN-SIM and random is used to mix. The way seems to be re-run L1 on top of RAW (which skipped L1 first, so that issue is still there).

@aehart
Copy link
Contributor

aehart commented Mar 19, 2024

I was able to reproduce the exception seen in PU relvals by copying the PSet.pkl from one of the job logs. I traced this to a numerical stability issue, which I fixed in #44471.

Once that is merged, I think this issue is resolved, as far as I can see.

@srimanob
Copy link
Contributor Author

srimanob commented May 6, 2024

We still see the issue in CMSSW_14_1_0_pre3 where #44471 was merged (see release log), see reports in
#44471 (comment)
#44471 (comment)

@srimanob
Copy link
Contributor Author

srimanob commented May 6, 2024

Note that, something is very strange to me. We don't see this issue at all in 14_0_6 while we see too many fail jobs in 14_1_0_pre3. The only issue I see in 14_0_6 is TripleMU_i84 NULL pointer, which I contact L1T separately. From my check, I don't see his L1FPGSTrackProducer/l1tTTTracksFrom.. at all.

@skinnari
Copy link
Contributor

skinnari commented May 7, 2024

@srimanob is there a recipe for how to reproduce the crash? it is difficult for us to debug otherwise.

a note on the releases -- there are some possibly relevant PRs that were included in 14_1_0_pre3, that are not in 14_0_6.

@srimanob
Copy link
Contributor Author

srimanob commented May 7, 2024

Hi @skinnari

Here is the recipe,

cmsrel CMSSW_14_1_0_pre3
cd CMSSW_14_1_0_pre3/src/
cmsenv
cmsDriver.py step2 -s DIGI:pdigi_valid,L1TrackTrigger,L1,DIGI2RAW,HLT:@relval2026 --conditions auto:phase2_realistic_T33 --datatier GEN-SIM-DIGI-RAW -n -1 --eventcontent FEVTDEBUGHLT --geometry Extended2026D110 --era Phase2C17I13M9 --python step2.py --no_exec --filein file:step1.root --fileout file:step2.root --nThreads 8 --nStreams 2
ln -s /eos/cms/store/group/offcomp_upgrade-sw/srimanob/L1T/1410pre3-debug/step1-13.root ./step1.root
cmsRun step2.py

@srimanob
Copy link
Contributor Author

srimanob commented May 8, 2024

With the private production, I confirm that the crash seems to appear in the 14_1 only, I don't see it when I produce the sample with 14_0_6.

@aehart
Copy link
Contributor

aehart commented May 8, 2024

I was able to reproduce the crash locally in 14_1_0_pre3, and with debugging symbols, the backtrace points to this line:

I can't see how this line could be the cause though, so I guess there is some kind of memory mismanagement somewhere else that is the actual cause. I will keep playing with it…

@Dr15Jones
Copy link
Contributor

I can't see how this line could be the cause though, so I guess there is some kind of memory mismanagement somewhere else that is the actual cause. I will keep playing with it…

The line

bool dupMap[numStublists][numStublists]; // Ends up symmetric

makes use of a variable sized array which is NOT supported by the C++ standard (but almost all compilers support it). The problem is this can use lots of stack memory and can exceed the allowed space on a stack. Switching to a dynamic container to see if that solves the problem.

@aehart
Copy link
Contributor

aehart commented May 8, 2024

Switching to a dynamic container to see if that solves the problem.

This seems to be a good suggestion. I switched dupMap (and also noMerge) from C-style arrays to vectors:
CMSSW_14_1_0_pre3...aehart:cmssw:2ecd340123eb2efb73108d92bf8a799d4563362a
With this, the job that previously crashed is able to run to completion.

If this seems like a reasonable fix, I can open a PR right away.

@srimanob
Copy link
Contributor Author

srimanob commented May 8, 2024

Thanks very much @Dr15Jones @aehart for suggestion and test. Do you somehow understand why it does not happen in 14_0? Do we just about at the limit in 14_1 due to some modules?

(1) However, it seems to be on the safe side if you make the backport to 14_0, right?
(2) Is this the only place that uses variable sized array in L1T code? Could this be review and fix overall? @aloeliger @epalencia

Thx.

@aehart
Copy link
Contributor

aehart commented May 8, 2024

Do you somehow understand why it does not happen in 14_0? Do we just about at the limit in 14_1 due to some modules?

That's my guess, although it could also be related to removing the bins used in the PurgeDuplicate class (f68c199). The value of numStublists would be smaller in each of the bins we had before, so these problematic arrays would also be smaller. I haven't tested that this is why we don't see this problem in 14_0, but I think it makes sense.

(1) However, it seems to be on the safe side if you make the backport to 14_0, right?

There's no harm in backporting this to 14_0, so I can do that as well.

(2) Is this the only place that uses variable sized array in L1T code? Could this be review and fix overall?

I've only checked the L1Trigger/TrackFinding* packages by recompiling them with the -Werror=vla flag, but there seem to be no more instances of this particular problem there.

@aehart
Copy link
Contributor

aehart commented May 8, 2024

I've only checked the L1Trigger/TrackFinding* packages by recompiling them with the -Werror=vla flag, but there seem to be no more instances of this particular problem there.

Just for fun, here is a table of all variable-length arrays in L1Trigger in CMSSW_14_1_0_pre3. I leave it to the experts of other subpackages to fix them, but hopefully this is a useful starting point.

File name Line number Name of offending array
L1Trigger/DTTriggerPhase2/src/MuonPathAssociator.cc 1004 useFit
L1Trigger/DTTriggerPhase2/src/MuonPathAssociator.cc 122 useFitSL1
L1Trigger/DTTriggerPhase2/src/MuonPathAssociator.cc 125 useFitSL3
L1Trigger/L1TCaloLayer1/src/UCTRegion.cc 132 activeTower
L1Trigger/L1TGlobal/plugins/GenToInputProducer.cc 264 idxMu
L1Trigger/L1TGlobal/plugins/GenToInputProducer.cc 265 muPtSorted
L1Trigger/L1TGlobal/plugins/GenToInputProducer.cc 333 idxEg
L1Trigger/L1TGlobal/plugins/GenToInputProducer.cc 334 egPtSorted
L1Trigger/L1TGlobal/plugins/GenToInputProducer.cc 364 idxTau
L1Trigger/L1TGlobal/plugins/GenToInputProducer.cc 365 tauPtSorted
L1Trigger/L1TGlobal/src/CorrCondition.cc 369 InvDeltaRSqLUT
L1Trigger/L1TGlobal/src/CorrCondition.cc 370 temp_InvDeltaRSq
L1Trigger/L1THGCal/src/backend/HGCalClusteringImpl.cc 253 isSeed
L1Trigger/L1THGCal/src/backend/HGCalClusteringImpl.cc 372 toRemove
L1Trigger/L1THGCal/src/backend/HGCalClusteringImpl.cc 44 isSeed
L1Trigger/L1TTrackMatch/plugins/L1TrackJetEmulatorProducer.cc 150 epbins_default
L1Trigger/L1TTrackMatch/plugins/L1TrackJetEmulatorProducer.cc 196 epbins
L1Trigger/L1TTrackMatch/plugins/L1TrackJetProducer.cc 140 epbins_default
L1Trigger/L1TTrackMatch/plugins/L1TrackJetProducer.cc 179 epbins
L1Trigger/Phase2L1ParticleFlow/interface/common/bitonic_hybrid_sort_ref.h 290 work
L1Trigger/Phase2L1ParticleFlow/interface/common/bitonic_hybrid_sort_ref.h 304 halfsorted
L1Trigger/Phase2L1ParticleFlow/interface/common/bitonic_hybrid_sort_ref.h 304 work
L1Trigger/Phase2L1ParticleFlow/interface/common/bitonic_hybrid_sort_ref.h 333 tomerge
L1Trigger/Phase2L1ParticleFlow/interface/common/bitonic_sort_ref.h 113 OutTmp
L1Trigger/Phase2L1ParticleFlow/interface/common/bitonic_sort_ref.h 128 outTmp2
L1Trigger/Phase2L1ParticleFlow/interface/common/bitonic_sort_ref.h 67 out2
L1Trigger/Phase2L1ParticleFlow/interface/common/bitonic_sort_ref.h 70 out3
L1Trigger/Phase2L1ParticleFlow/src/L1TCorrelatorLayer1PatternFileWriter.cc 325 ret
L1Trigger/TrackFindingTracklet/src/PurgeDuplicate.cc 194 dupMap
L1Trigger/TrackFindingTracklet/src/PurgeDuplicate.cc 202 noMerge

@srimanob
Copy link
Contributor Author

srimanob commented May 8, 2024

Thanks @aehart
I open the git issue #44937 to follow up. So we can close this one when no crash in relvals (i.e. next pre-release)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants