-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HLT crashes in HLTMuonL1TFilter::hltFilter
#44940
Comments
cms-bot internal usage |
A new Issue was created by @mmusich. @smuzaffar, @rappoccio, @makortel, @Dr15Jones, @sextonkennedy, @antoniovilela can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign hlt |
New categories assigned: hlt @Martin-Grunewald,@mmusich you have been requested to review this Pull request/Issue and eventually sign? Thanks |
HLTMuonL1TFilter::hltFilter
)HLTMuonL1TFilter::hltFilter
One more instance in run380531: |
I run 380466 on hlt machine with GPU and various thread/stream configurations w/o any crash |
indeed quoting myself:
|
@mmusich ok. sorry. Reading eos and offline I though you run a lxplus-like machine w/o GPU |
I did run on |
Would running valgrind be feasible? |
runninf multi-arch with a GPU got
|
Hmh, according to https://valgrind.org/info/platforms.html amd64/linux target should support instructions "up to and including AVX2". Ok, I found from the the release notes of 3.23 (we use 3.22)
https://valgrind.org/docs/manual/dist.news.html @smuzaffar Could we update valgrind to 3.23 (at least in 14_1_X, 14_0_X could be useful too)? |
@makortel , cms-sw/cmsdist#9185 updates valgrind to 3.23.0 for 14.1.X |
valgrnd manage to run with standard release and 1 GPU
plenty of those actually many of these as well
|
not obvious at first glance |
following https://github.com/jemalloc/jemalloc/wiki/Use-Case%3A-Find-a-memory-corruption-bug
and then segfault in
|
so I set |
btw I added this
it gets printout at the place where valgrind report the issue and the content is clear junk and is not reproducible
|
For reference cmssw/HLTrigger/Muon/plugins/HLTMuonL1TFilter.cc Lines 105 to 109 in ab4ccb0
where: cmssw/HLTrigger/Muon/plugins/HLTMuonL1TFilter.cc Lines 27 to 28 in ab4ccb0
and the crashing module configuration: process.hltL1sCDC = cms.EDFilter( "HLTL1TSeed",
saveTags = cms.bool( True ),
L1SeedsLogicalExpression = cms.string( "L1_CDC_SingleMu_3_er1p2_TOP120_DPHI2p618_3p142" ),
L1ObjectMapInputTag = cms.InputTag( "hltGtStage2ObjectMap" ),
L1GlobalInputTag = cms.InputTag( "hltGtStage2Digis" ),
L1MuonInputTag = cms.InputTag( 'hltGtStage2Digis','Muon' ),
L1MuonShowerInputTag = cms.InputTag( 'hltGtStage2Digis','MuonShower' ),
L1EGammaInputTag = cms.InputTag( 'hltGtStage2Digis','EGamma' ),
L1JetInputTag = cms.InputTag( 'hltGtStage2Digis','Jet' ),
L1TauInputTag = cms.InputTag( 'hltGtStage2Digis','Tau' ),
L1EtSumInputTag = cms.InputTag( 'hltGtStage2Digis','EtSum' ),
L1EtSumZdcInputTag = cms.InputTag( 'hltGtStage2Digis','EtSumZDC' )
)
process.hltL1fL1sCDCL1Filtered0 = cms.EDFilter( "HLTMuonL1TFilter",
saveTags = cms.bool( True ),
CandTag = cms.InputTag( 'hltGtStage2Digis','Muon' ),
PreviousCandTag = cms.InputTag( "hltL1sCDC" ),
MaxEta = cms.double( 2.5 ),
MinPt = cms.double( 0.0 ),
MaxDeltaR = cms.double( 0.3 ),
MinN = cms.int32( 1 ),
CentralBxOnly = cms.bool( False ),
SelectQualities = cms.vint32( )
) @cms-sw/l1-l2 FYI |
running UBSAN found this
|
Only change I know from the L1 side for muons recently is the OMTF->GMT unconstrained PT update. I think that involved an unpacker update however. I assume in that for HLT |
and this is ASAN: who aborts after finding the error
|
I read from the stack trace that you used |
It seems there are known mismatches between L1T firmware and emulator for CDC seeds (and I don't know if it's related to these crashes, but it might). I don't know anything about these mismatches (when they started, what is causing them, what is the plan to fix them). @elfontan , could you please clarify ? |
I'm using this #!/bin/bash
hltGetConfiguration run:381147 \
--globaltag 140X_dataRun3_HLT_v3 \
--no-prescale \
--no-output \
--max-events 1 \
--paths HLT_CDC_L2cosmic_10_er1p0_v* \
--input root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381147/run381147_ls0202_index000187_fu-c2b05-29-01_pid2159904.root \
> hlt.py
cat <<@EOF >> hlt.py
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
process.source.skipEvents = cms.untracked.uint32( 56 )
del process.MessageLogger
process.load("FWCore.MessageLogger.MessageLogger_cfi")
@EOF
cmsRun hlt.py &> hlt.log |
This dodges the issue. Not sure the warning is accurate, and whether or not this should be implemented regardless of the root cause of the problem (if so, the same check would probably have to be added for other L1T objects in that same EDFilter). diff --git a/HLTrigger/HLTfilters/plugins/HLTL1TSeed.cc b/HLTrigger/HLTfilters/plugins/HLTL1TSeed.cc
index 699a170d60d..6fae44e83bb 100644
--- a/HLTrigger/HLTfilters/plugins/HLTL1TSeed.cc
+++ b/HLTrigger/HLTfilters/plugins/HLTL1TSeed.cc
@@ -950,6 +950,13 @@ bool HLTL1TSeed::seedsL1TriggerObjectMaps(edm::Event& iEvent, trigger::TriggerFi
<< "\nNo muons added to filterproduct." << endl;
} else {
for (std::list<int>::const_iterator itObj = listMuon.begin(); itObj != listMuon.end(); ++itObj) {
+ if (*itObj < 0 or unsigned(*itObj) >= muons->size(0)) {
+ edm::LogWarning("HLTL1TSeed")
+ << "Invalid index from the L1ObjectMap (L1uGT emulator), will be ignored (l1t::MuonBxCollection):"
+ << " index=" << *itObj << " (size of unpacked L1T objects in BX0 = " << muons->size(0) << ")";
+ continue;
+ }
+
// Transform to index for Bx = 0 to begin of BxVector
unsigned int index = muons->begin(0) - muons->begin() + *itObj;
|
Here's my rough understanding of the underlying issue. Some of this might be inaccurate, a L1T expert should comment.
The example from #44940 (comment) shows (patch)
The "object map" returns indices Based on the above, I don't see how |
I think the patch proposed by @missirol should be implemented ASAP, at least to monitor the frequency of this misbehavior. |
BTW: should a new more specific issue be opened against L1TSeed or the L1TMuon unpacker? (or the cosmic HLT?) |
this is tracked at https://its.cern.ch/jira/browse/CMSHLT-3216 |
The patch in #44940 (comment) is implemented in #45047 (14_1_X) and #45048 (14_0_X). In the near future, maybe a better patch would ensure that |
Can For the out-of-time muons, would it be useful to be able to tag them ? Or anyway the HLT reconstruction would not be able to use them ? |
For what I understand, not in the current implementation, because
This, I don't really know (I would guess the HLT reconstruction would not be able to use them, but I might be wrong). |
I agree, this would be a consistency fix, and cover most use cases. Triggers looking at BX<>0 should be rare special cases which should need specific treatment anyway. |
this is tracked at https://its.cern.ch/jira/browse/CMSHLT-3218 |
Can we stick to GitHub for issues, instead of splitting them between GitHub and JIRA (which unlike GH has a horrible user interface) ? |
my understanding is that we are using gitHub for discussing s/w issues and JIRA for HLT configuration changes. So in short - no. |
OK, then. Feel free to enjoy the crappy user interface and the lack of feedback. |
to be honest I am not enjoying it at all, but if we want to move everything to gitHub (at least for the HLT-related items that directly rely on cmssw, e.g. menus, tests, etc. - broadly speaking the "HLT configurations" and "STORM tasks" components) and not on JIRA it's a decision that should be taken at coordination level (which is above my paygrade). It could be discussed elsewhere. |
solutions proposed (technically avoiding the crash online):
A CMSHLT JIRA ticket to discuss the next steps is open at https://its.cern.ch/jira/browse/CMSHLT-3216. |
This issue is fully signed and ready to be closed. |
@cmsbuild, please close |
This issue is to document several crashes related to
HLTMuonL1TFilter::hltFilter
that happened during:CMSSW_14_0_5_patch1
)CMSSW_14_0_6_MULTIARCHS
)CMSSW_14_0_6_MULTIARCHS
)CMSSW_14_0_7_MULTIARCHS
)In all occurrences there is a segmentation fault mentioning in the stack trace
HLTMuonL1TFilter::hltFilter
, e.g.:We have tried (unsuccessfully) to reproduce offline these crashes using the following scripts [1], [2].
For the record I am attaching the full stack trace from F3 mon for the runs in questions:
[1]
Script to check 380115
[2]
Script to check 380466
Cc: @cms-sw/hlt-l2 @trtomei @mzarucki @trocino
The text was updated successfully, but these errors were encountered: