Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HLT crashes in HLTMuonL1TFilter::hltFilter #44940

Closed
mmusich opened this issue May 9, 2024 · 47 comments
Closed

HLT crashes in HLTMuonL1TFilter::hltFilter #44940

mmusich opened this issue May 9, 2024 · 47 comments

Comments

@mmusich
Copy link
Contributor

mmusich commented May 9, 2024

This issue is to document several crashes related to HLTMuonL1TFilter::hltFilter that happened during:

In all occurrences there is a segmentation fault mentioning in the stack trace HLTMuonL1TFilter::hltFilter, e.g.:

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Tue May 7 12:54:35 CEST 2024
Thread 10 (Thread 0x7f8cbabfd700 (LWP 674105) "cmsRun"):
#0 0x00007f8d3e57b0e1 in poll () from /lib64/libc.so.6
#1 0x00007f8d348fd2ff in full_read.constprop () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#2 0x00007f8d348b0afc in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#3 0x00007f8d348b1460 in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#4 
#5 0x00007f8c56b4608d in HLTMuonL1TFilter::hltFilter(edm::Event&, edm::EventSetup const&, trigger::TriggerFilterObjectWithRefs&) const () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginHLTriggerMuonAuto.so
#6 0x00007f8cbc6fde4c in HLTFilter::filter(edm::StreamID, edm::Event&, edm::EventSetup const&) const () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libHLTriggerHLTcore.so
#7 0x00007f8d40fc5040 in edm::global::EDFilterBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#8 0x00007f8d40fbd83c in edm::WorkerT::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#9 0x00007f8d40f4bf59 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch::execute(tbb::detail::d1::execution_data&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#12 0x00007f8d3f6ee95b in tbb::detail::r1::task_dispatcher::local_wait_for_all (t=0x7f8be3d5ef00, waiter=..., this=0x7f8d39e92700) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#13 tbb::detail::r1::task_dispatcher::local_wait_for_all (t=0x0, waiter=..., this=0x7f8d39e92700) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#14 tbb::detail::r1::arena::process (tls=..., this=) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:137
#15 tbb::detail::r1::market::process (this=, j=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/market.cpp:599
#16 0x00007f8d3f6f0b0e in tbb::detail::r1::rml::private_worker::run (this=0x7f8d39e87000) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/private_server.cpp:271
#17 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7f8d39e87000) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/private_server.cpp:221
#18 0x00007f8d3e8241ca in start_thread () from /lib64/libpthread.so.0
#19 0x00007f8d3e48fe73 in clone () from /lib64/libc.so.6
[ message truncated - showing only crashed thread ]

We have tried (unsuccessfully) to reproduce offline these crashes using the following scripts [1], [2].
For the record I am attaching the full stack trace from F3 mon for the runs in questions:

[1]

Script to check 380115
#!/bin/bash -ex

scram p CMSSW CMSSW_14_0_5_patch1
cd CMSSW_14_0_5_patch1/src
eval `scramv1 runtime -sh`

https_proxy=http://cmsproxy.cms:3128 hltConfigFromDB --runNumber 380115 > hlt_run380115.py
cat <<@EOF >> hlt_run380115.py
from EventFilter.Utilities.EvFDaqDirector_cfi import EvFDaqDirector as _EvFDaqDirector
process.EvFDaqDirector = _EvFDaqDirector.clone(
  buBaseDir = '/eos/cms/store/group/tsg/FOG/error_stream/',
  runNumber = 380115
)
from EventFilter.Utilities.FedRawDataInputSource_cfi import source as _source
process.source = _source.clone(
  fileListMode = True,
  fileNames = (
  '/eos/cms/store/group/tsg/FOG/error_stream/run380115/run380115_ls0338_index000079_fu-c2b03-28-01_pid1451372.raw',
  '/eos/cms/store/group/tsg/FOG/error_stream/run380115/run380115_ls0338_index000104_fu-c2b03-28-01_pid1451372.raw'
  )
)
process.options.wantSummary = True

process.options.numberOfThreads = 32
process.options.numberOfStreams = 24
@EOF

mkdir run380115
cmsRun hlt_run380115.py &> crash_run380115.log

[2]

Script to check 380466
#!/bin/bash -ex

scram p CMSSW CMSSW_14_0_6_MULTIARCHS
cd CMSSW_14_0_6_MULTIARCHS/src
eval `scramv1 runtime -sh`

https_proxy=http://cmsproxy.cms:3128 hltConfigFromDB --runNumber 380466 > hlt_run380466.py
cat <<@EOF >> hlt_run380466.py
from EventFilter.Utilities.EvFDaqDirector_cfi import EvFDaqDirector as _EvFDaqDirector
process.EvFDaqDirector = _EvFDaqDirector.clone(
   buBaseDir = '/eos/cms/store/group/tsg/FOG/error_stream/',
   runNumber = 380466
)
from EventFilter.Utilities.FedRawDataInputSource_cfi import source as _source
process.source = _source.clone(
   fileListMode = True,
   fileNames = (
   '/eos/cms/store/group/tsg/FOG/error_stream/run380466/run380466_ls0276_index000212_fu-c2b03-09-01_pid672001.raw',
   '/eos/cms/store/group/tsg/FOG/error_stream/run380466/run380466_ls0276_index000232_fu-c2b03-09-01_pid672001.raw',
   '/eos/cms/store/group/tsg/FOG/error_stream/run380466/run380466_ls0276_index000246_fu-c2b03-09-01_pid672001.raw'
   )
)
process.options.wantSummary = True

process.options.numberOfThreads = 32
process.options.numberOfStreams = 24
@EOF

mkdir run380466
cmsRun hlt_run380466.py &> crash_run380466.log

Cc: @cms-sw/hlt-l2 @trtomei @mzarucki @trocino

@cmsbuild
Copy link
Contributor

cmsbuild commented May 9, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

cmsbuild commented May 9, 2024

A new Issue was created by @mmusich.

@smuzaffar, @rappoccio, @makortel, @Dr15Jones, @sextonkennedy, @antoniovilela can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

makortel commented May 9, 2024

assign hlt

@cmsbuild
Copy link
Contributor

cmsbuild commented May 9, 2024

New categories assigned: hlt

@Martin-Grunewald,@mmusich you have been requested to review this Pull request/Issue and eventually sign? Thanks

@mmusich mmusich changed the title HLT crashes in Run 380115 and 380466 (HLTMuonL1TFilter::hltFilter) HLT crashes in HLTMuonL1TFilter::hltFilter May 10, 2024
@mmusich
Copy link
Contributor Author

mmusich commented May 10, 2024

One more instance in run380531:

@VinInn
Copy link
Contributor

VinInn commented May 10, 2024

I run 380466 on hlt machine with GPU and various thread/stream configurations w/o any crash

@mmusich
Copy link
Contributor Author

mmusich commented May 10, 2024

I run 380466 on hlt machine with GPU and various thread/stream configurations w/o any crash

indeed quoting myself:

We have tried (unsuccessfully) to reproduce offline these crashes using the following scripts [1], [2].

@VinInn
Copy link
Contributor

VinInn commented May 10, 2024

@mmusich ok. sorry. Reading eos and offline I though you run a lxplus-like machine w/o GPU

@mmusich
Copy link
Contributor Author

mmusich commented May 10, 2024

Reading eos and offline I though you run a lxplus-like machine w/o GPU

I did run on lxplus-gpu using FRD files copied from the error stream (from the SM people). This is standard procedure from the FOG instructions.

@makortel
Copy link
Contributor

Would running valgrind be feasible?

@VinInn
Copy link
Contributor

VinInn commented May 10, 2024

runninf multi-arch with a GPU got

==756999== valgrind: Unrecognised instruction at address 0x57f5fd19.
==756999==    at 0x57F5FD19: void riemannFit::transformToPerigeePlane<Eigen::Matrix<double, 5, 1, 0, 5, 1>, Eigen::Matrix<double, 5, 5, 0, 5, 5>, Eigen::Matrix<double, 5, 1, 0, 5, 1>, Eigen::Matrix<double, 5, 5, 0, 5, 5> >(Eigen::Matrix<double, 5, 1, 0, 5, 1> const&, Eigen::Matrix<double, 5, 5, 0, 5, 5> const&, Eigen::Matrix<double, 5, 1, 0, 5, 1>&, Eigen::Matrix<double, 5, 5, 0, 5, 5>&) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02835/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_MULTIARCHS_X_2024-05-02-2300/lib/el9_amd64_gcc12/scram_x86-64-v3/pluginRecoPixelVertexingPixelTrackFittingPlugins.so)
==756999==    by 0x57F6E8F2: PixelTrackProducerFromSoAAlpaka<pixelTopology::Phase1>::produce(edm::StreamID, edm::Event&, edm::EventSetup const&) const (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02835/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_MULTIARCHS_X_2024-05-02-2300/lib/el9_amd64_gcc12/scram_x86-64-v3/pluginRecoPixelVertexingPixelTrackFittingPlugins.so)
==756999==    by 0x4A96411: edm::global::EDProducerBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02835/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_MULTIARCHS_X_2024-05-02-2300/lib/el9_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so)
==756999==    by 0x4A8FABB: edm::WorkerT<edm::global::EDProducerBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02835/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_MULTIARCHS_X_2024-05-02-2300/lib/el9_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so)
==756999==    by 0x4A19F48: std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02835/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_MULTIARCHS_X_2024-05-02-2300/lib/el9_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so)
==756999==    by 0x4A247B7: edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02835/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_MULTIARCHS_X_2024-05-02-2300/lib/el9_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so)
==756999==    by 0x4E89F77: tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02835/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_MULTIARCHS_X_2024-05-02-2300/lib/el9_amd64_gcc12/scram_x86-64-v3/libFWCoreConcurrency.so)
==756999==    by 0x641C91A: UnknownInlinedFun (task_dispatcher.h:322)
==756999==    by 0x641C91A: UnknownInlinedFun (task_dispatcher.h:458)
==756999==    by 0x641C91A: UnknownInlinedFun (arena.cpp:137)
==756999==    by 0x641C91A: tbb::detail::r1::market::process(rml::job&) (market.cpp:599)
==756999==    by 0x641EACD: UnknownInlinedFun (private_server.cpp:271)
==756999==    by 0x641EACD: tbb::detail::r1::rml::private_worker::thread_routine(void*) (private_server.cpp:221)
==756999==    by 0x68D9801: start_thread (in /usr/lib64/libc.so.6)
==756999== Your program just tried to execute an instruction that Valgrind
==756999== did not recognise.  There are two possible reasons for this.
==756999== 1. Your program has a bug and erroneously jumped to a non-code
==756999==    location.  If you are running Memcheck and you just saw a
==756999==    warning about a bad jump, it's probably your program's fault.
==756999== 2. The instruction is legitimate but Valgrind doesn't handle it,
==756999==    i.e. it's Valgrind's fault.  If you think this is the case or
==756999==    you are not sure, please let us know and we'll try to fix it.
==756999== Either way, Valgrind will now raise a SIGILL signal which will
==756999== probably kill your program.
==756999== Warning: ignored attempt to set SIGRT32 handler in sigaction();
==756999==          the SIGRT32 signal is used internally by Valgrind


A fatal system signal has occurred: illegal instruction
The following is the call stack containing the origin of the signal.

==756999== Unsupported clone() flags: 0x311
==756999==
==756999== The only supported clone() uses are:
==756999==  - via a threads library (LinuxThreads or NPTL)
==756999==  - via the implementation of fork or vfork
==756999==
==756999== Valgrind detected that your program requires
==756999== the following unimplemented functionality:
==756999==    Valgrind does not support general clone().
==756999== This may be because the functionality is hard to implement,
==756999== or because no reasonable program would behave this way,
==756999== or because nobody has yet needed it.  In any case, let us know at
==756999== www.valgrind.org and/or try to work around the problem, if you can.
==756999==
==756999== Valgrind has to exit now.  Sorry.  Bye!
==756999==

@makortel
Copy link
Contributor

Hmh, according to https://valgrind.org/info/platforms.html amd64/linux target should support instructions "up to and including AVX2". Ok, I found from the the release notes of 3.23 (we use 3.22)

AMD64 better supports code build with -march=x86-64-v3.
fused-multiple-add instructions (fma) are now emulated more
accurately. And memcheck now handles __builtin_strcmp using 128/256
bit vectors with sse4.1, avx/avx2.

https://valgrind.org/docs/manual/dist.news.html

@smuzaffar Could we update valgrind to 3.23 (at least in 14_1_X, 14_0_X could be useful too)?

@smuzaffar
Copy link
Contributor

@makortel , cms-sw/cmsdist#9185 updates valgrind to 3.23.0 for 14.1.X

@VinInn
Copy link
Contributor

VinInn commented May 11, 2024

valgrnd manage to run with standard release and 1 GPU
I found this

==790020== Thread 13:
==790020== Invalid read of size 8
==790020==    at 0xA58A3FB2: HLTMuonL1TFilter::hltFilter(edm::Event&, edm::EventSetup const&, trigger::TriggerFilterObjectWithRefs&) const (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/pluginHLTriggerMuonAuto.so)
==790020==    by 0x9210DBCA: HLTFilter::filter(edm::StreamID, edm::Event&, edm::EventSetup const&) const (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libHLTriggerHLTcore.so)
==790020==    by 0x4A8AE6D: edm::global::EDFilterBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libFWCoreFramework.so)
==790020==    by 0x4A84D3B: edm::WorkerT<edm::global::EDFilterBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libFWCoreFramework.so)
==790020==    by 0x4A11528: std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libFWCoreFramework.so)
==790020==    by 0x4A1B997: edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libFWCoreFramework.so)
==790020==    by 0x498877D: tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libFWCoreFramework.so)
==790020==    by 0x640991A: UnknownInlinedFun (task_dispatcher.h:322)
==790020==    by 0x640991A: UnknownInlinedFun (task_dispatcher.h:458)
==790020==    by 0x640991A: UnknownInlinedFun (arena.cpp:137)
==790020==    by 0x640991A: tbb::detail::r1::market::process(rml::job&) (market.cpp:599)
==790020==    by 0x640BACD: UnknownInlinedFun (private_server.cpp:271)
==790020==    by 0x640BACD: tbb::detail::r1::rml::private_worker::thread_routine(void*) (private_server.cpp:221)
==790020== Invalid read of size 8
==790020==    at 0xA58A3FC7: HLTMuonL1TFilter::hltFilter(edm::Event&, edm::EventSetup const&, trigger::TriggerFilterObjectWithRefs&) const (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/pluginHLTriggerMuonAuto.so)
==790020==    by 0x9210DBCA: HLTFilter::filter(edm::StreamID, edm::Event&, edm::EventSetup const&) const (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libHLTriggerHLTcore.so)
==790020==    by 0x4A8AE6D: edm::global::EDFilterBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libFWCoreFramework.so)
==790020==    by 0x4A84D3B: edm::WorkerT<edm::global::EDFilterBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libFWCoreFramework.so)
==790020==    by 0x4A11528: std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libFWCoreFramework.so)
==790020==    by 0x4A1B997: edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libFWCoreFramework.so)
==790020==    by 0x498877D: tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libFWCoreFramework.so)
==790020==    by 0x640991A: UnknownInlinedFun (task_dispatcher.h:322)
==790020==    by 0x640991A: UnknownInlinedFun (task_dispatcher.h:458)
==790020==    by 0x640991A: UnknownInlinedFun (arena.cpp:137)
==790020==    by 0x640991A: tbb::detail::r1::market::process(rml::job&) (market.cpp:599)
==790020==    by 0x640BACD: UnknownInlinedFun (private_server.cpp:271)
==790020==    by 0x640BACD: tbb::detail::r1::rml::private_worker::thread_routine(void*) (private_server.cpp:221)
==790020==    by 0x68C6801: start_thread (in /usr/lib64/libc.so.6)

plenty of those actually

many of these as well

==790020== Thread 9:
==790020== Conditional jump or move depends on uninitialised value(s)
==790020==    at 0xB9DFBC31: muonisolation::CaloExtractorByAssociator::deposits(edm::Event const&, edm::EventSetup const&, reco::Track const&) const (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/pluginRecoMuonMuonIsolationPlugins.so)
==790020==    by 0xB94C22AA: MuonIdProducer::fillMuonIsolation(edm::Event&, edm::EventSetup const&, reco::Muon&, reco::IsoDeposit&, reco::IsoDeposit&, reco::IsoDeposit&, reco::IsoDeposit&, reco::IsoDeposit&) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/pluginRecoMuonMuonIdentificationPlugins.so)
==790020==    by 0xB94C7CCC: MuonIdProducer::produce(edm::Event&, edm::EventSetup const&) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/pluginRecoMuonMuonIdentificationPlugins.so)
==790020==    by 0x4AA65C2: edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libFWCoreFramework.so)
==790020==    by 0x4A853EB: edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libFWCoreFramework.so)
==790020==    by 0x4A11528: std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, ed
m::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-0
5-09-2300/lib/el9_amd64_gcc12/libFWCoreFramework.so)
==790020==    by 0x4A1B997: edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nwee
k-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libFWCoreFramework.so)
==790020==    by 0x4E74F27: tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) (in /cvmfs/cms-ib.cern.ch/s
w/x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libFWCoreConcurrency.so)
==790020==    by 0x640991A: UnknownInlinedFun (task_dispatcher.h:322)
==790020==    by 0x640991A: UnknownInlinedFun (task_dispatcher.h:458)
==790020==    by 0x640991A: UnknownInlinedFun (arena.cpp:137)
==790020==    by 0x640991A: tbb::detail::r1::market::process(rml::job&) (market.cpp:599)
==790020==    by 0x640BACD: UnknownInlinedFun (private_server.cpp:271)
==790020==    by 0x640BACD: tbb::detail::r1::rml::private_worker::thread_routine(void*) (private_server.cpp:221)
==790020==    by 0x68C6801: start_thread (in /usr/lib64/libc.so.6)

@VinInn
Copy link
Contributor

VinInn commented May 11, 2024

==796505== Thread 16:
==796505== Invalid read of size 8
==796505==    at 0xA5533FB2: UnknownInlinedFun (PtEtaPhiM4D.h:142)
==796505==    by 0xA5533FB2: UnknownInlinedFun (LorentzVector.h:644)
==796505==    by 0xA5533FB2: UnknownInlinedFun (ParticleState.h:139)
==796505==    by 0xA5533FB2: UnknownInlinedFun (LeafCandidate.h:148)
==796505==    by 0xA5533FB2: HLTMuonL1TFilter::hltFilter(edm::Event&, edm::EventSetup const&, trigger::TriggerFilterObjectWithRefs&) const (HLTMuonL1TFilter.cc:139)
==796505==    by 0x91F48BCA: HLTFilter::filter(edm::StreamID, edm::Event&, edm::EventSetup const&) const (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libHLTriggerHLTcore.so)

not obvious at first glance

@VinInn
Copy link
Contributor

VinInn commented May 11, 2024

if (deltaR2(muon->eta(), muon->phi(), prevMuons[it2]->eta(), prevMuons[it2]->phi()) < maxDR2_)

given that muon->eta() is accessed above (and no reports from valgrind) it should be prevMuons...

@VinInn
Copy link
Contributor

VinInn commented May 12, 2024

following https://github.com/jemalloc/jemalloc/wiki/Use-Case%3A-Find-a-memory-corruption-bug
I set export MALLOC_CONF=junk:true and got a crash somewhere else!
multiple

%MSG-e EcalRecHitError:  EcalRecHitProducer:hltEcalRecHit  12-May-2024 08:41:48 CEST Run: 380466 Event: 490512903
No intercalib const found for xtal 2779096485! something wrong with EcalIntercalibConstants in your DB?
%MSG
%MSG-e EcalLaserDbService:  EcalRecHitProducer:hltEcalRecHit  12-May-2024 08:41:48 CEST Run: 380466 Event: 490512903
 DetId is NOT in ECAL

and then segfault in

Module: HLTEcalRecHitInAllL1RegionsProducer:hltRechitInRegionsECAL (crashed)

@VinInn
Copy link
Contributor

VinInn commented May 12, 2024

so I set export MALLOC_CONF=zero:true
I get no crash: just different junk in the location where valgrind report the issue.

@VinInn
Copy link
Contributor

VinInn commented May 12, 2024

btw I added this

diff --git a/HLTrigger/Muon/plugins/HLTMuonL1TFilter.cc b/HLTrigger/Muon/plugins/HLTMuonL1TFilter.cc
index 3b8f3334bef..b2da2a351e5 100644
--- a/HLTrigger/Muon/plugins/HLTMuonL1TFilter.cc
+++ b/HLTrigger/Muon/plugins/HLTMuonL1TFilter.cc
@@ -136,6 +136,10 @@ bool HLTMuonL1TFilter::hltFilter(edm::Event& iEvent,
       bool matchPrevL1 = false;
       int prevSize = prevMuons.size();
       for (int it2 = 0; it2 < prevSize; it2++) {
+        if (prevMuons[it2].isNull()) std::cout << ">>> not valid ref " << it2 << std::endl;
+        auto const &  m = prevMuons[it2];
+        if (m->pt() < 0.01 || std::abs(m->eta())>7 || std::abs(m->phi())>6.3) std::cout << ">>> ??"
+           << m.index() << ' ' << m.get() << ' ' << m->pt() << ' ' << m->eta() << ' ' << m->phi() << std::endl;
         if (deltaR2(muon->eta(), muon->phi(), prevMuons[it2]->eta(), prevMuons[it2]->phi()) < maxDR2_) {
           matchPrevL1 = true;
           break;

it gets printout at the place where valgrind report the issue and the content is clear junk and is not reproducible

[innocent@gputest-genoa-01 (gpu-c2e35-08-01) hltBug]$ grep ">>> ??"  *.log
bug.log:>>> ??4 0x7fa75a0afd60 6.9347e-310 2.98429e-315 0
bug.log:>>> ??4 0x7fa75a0afd60 6.9347e-310 2.98429e-315 0
bug.log:>>> ??4 0x7fa75a0afd60 6.9347e-310 2.98429e-315 0
bug2.log:>>> ??4 0x7fe65e713460 6.94792e-310 6.91692e-323 6.38221e+25
bug2.log:>>> ??4 0x7fe65e713460 6.94792e-310 6.91692e-323 6.38221e+25
bug2.log:>>> ??4 0x7fe65e713460 6.94792e-310 6.91692e-323 6.38221e+25
bug3.log:>>> ??4 0x7f517eba7d60 7.45854e+82 -3.91644e+217 -nan
bug3.log:>>> ??4 0x7f517eba7d60 7.45854e+82 -3.91644e+217 -nan
bug3.log:>>> ??4 0x7f517eba7d60 7.45854e+82 -3.91644e+217 -nan
valg2.log:>>> ??4 0x19fa8be10 7.11455e-322 0 1.44712e-320
valg2.log:>>> ??4 0x19fa8be10 7.11455e-322 0 1.44712e-320
valg2.log:>>> ??4 0x19fa8be10 7.11455e-322 0 1.44712e-320
zeroMem.log:>>> ??4 0x7f21d76ec660 6.90621e-310 6.91692e-323 6.38221e+25
zeroMem.log:>>> ??4 0x7f21d76ec660 6.90621e-310 6.91692e-323 6.38221e+25
zeroMem.log:>>> ??4 0x7f21d76ec660 6.90621e-310 6.91692e-323 6.38221e+25

@mmusich
Copy link
Contributor Author

mmusich commented May 12, 2024

For reference prevMuons is defined as:

Handle<TriggerFilterObjectWithRefs> previousLevelTFOWR;
iEvent.getByToken(previousCandToken_, previousLevelTFOWR);
vector<MuonRef> prevMuons;
previousLevelTFOWR->getObjects(TriggerL1Mu, prevMuons);

where:

previousCandTag_(iConfig.getParameter<edm::InputTag>("PreviousCandTag")),
previousCandToken_(consumes<trigger::TriggerFilterObjectWithRefs>(previousCandTag_)),

and the crashing module configuration: hltL1fL1sCDCL1Filtered0 is

process.hltL1sCDC = cms.EDFilter( "HLTL1TSeed",
    saveTags = cms.bool( True ),
    L1SeedsLogicalExpression = cms.string( "L1_CDC_SingleMu_3_er1p2_TOP120_DPHI2p618_3p142" ),
    L1ObjectMapInputTag = cms.InputTag( "hltGtStage2ObjectMap" ),
    L1GlobalInputTag = cms.InputTag( "hltGtStage2Digis" ),
    L1MuonInputTag = cms.InputTag( 'hltGtStage2Digis','Muon' ),
    L1MuonShowerInputTag = cms.InputTag( 'hltGtStage2Digis','MuonShower' ),
    L1EGammaInputTag = cms.InputTag( 'hltGtStage2Digis','EGamma' ),
    L1JetInputTag = cms.InputTag( 'hltGtStage2Digis','Jet' ),
    L1TauInputTag = cms.InputTag( 'hltGtStage2Digis','Tau' ),
    L1EtSumInputTag = cms.InputTag( 'hltGtStage2Digis','EtSum' ),
    L1EtSumZdcInputTag = cms.InputTag( 'hltGtStage2Digis','EtSumZDC' )
)

process.hltL1fL1sCDCL1Filtered0 = cms.EDFilter( "HLTMuonL1TFilter",
    saveTags = cms.bool( True ),
    CandTag = cms.InputTag( 'hltGtStage2Digis','Muon' ),
    PreviousCandTag = cms.InputTag( "hltL1sCDC" ),
    MaxEta = cms.double( 2.5 ),
    MinPt = cms.double( 0.0 ),
    MaxDeltaR = cms.double( 0.3 ),
    MinN = cms.int32( 1 ),
    CentralBxOnly = cms.bool( False ),
    SelectQualities = cms.vint32(  )
)

@cms-sw/l1-l2 FYI

@VinInn
Copy link
Contributor

VinInn commented May 17, 2024

running UBSAN found this

src/HLTrigger/Muon/plugins/HLTMuonL1TFilter.cc:139:89: runtime error: member call on address 0x7f6d1879fb60 which does not point to an object of type 'LeafCandidate'
0x7f6d1879fb60: note: object has a possibly invalid vptr: abs(offset to top) too big
 6d 7f 00 00  40 58 ae 1b 6d 7f 00 00  c1 33 3c 40 03 00 00 00  24 01 00 00 00 00 00 00  00 00 00 00
              ^~~~~~~~~~~~~~~~~~~~~~~
              possibly invalid vptr
    #0 0x7f6db97b7c7e in HLTMuonL1TFilter::hltFilter(edm::Event&, edm::EventSetup const&, trigger::TriggerFilterObjectWithRefs&) const src/HLTrigger/Muon/plugins/HLTMuonL1TFilter.cc:139
    #1 0x7f6dd386d849 in HLTFilter::filter(edm::StreamID, edm::Event&, edm::EventSetup const&) const src/HLTrigger/HLTcore/src/HLTFilter.cc:34
    #2 0x7f6ec15a8ee6 in edm::global::EDFilterBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) src/FWCore/Framework/src/global/EDFilterBase.cc:67
    #3 0x7f6ec15768f8 in edm::WorkerT<edm::global::EDFilterBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) src/FWCore/Framework/src/WorkerT.cc:202
    #4 0x7f6ec0d4e03f in edm::workerhelper::CallImpl<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::call(edm::Worker*, edm::StreamID, edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*, edm::StreamContext const*) src/FWCore/Framework/interface/maker/Worker.h:700
    #5 0x7f6ec0d4e03f in edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}::operator()() const src/FWCore/Framework/interface/maker/Worker.h:1259
    #6 0x7f6ec0d4e03f in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) src/FWCore/Utilities/interface/ConvertException.h:21
    #7 0x7f6ec0d4ea44 in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) src/FWCore/Framework/interface/maker/Worker.h:1258
    #8 0x7f6ec0d4ea44 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) src/FWCore/Framework/interface/maker/Worker.h:1172
    #9 0x7f6ec0d672bf in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() src/FWCore/Framework/interface/maker/Worker.h:499
    #10 0x7f6ec04eb14c in edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}::operator()() const src/FWCore/Concurrency/interface/WaitingTaskHolder.h:107
    #11 0x7f6ec04eb14c in task_ptr_or_nullptr_impl<const edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::<lambda()>&> /data/cmsbld/jenkins/workspace/build-any-ib/w/el8_amd64_gcc12/external/tbb/v2021.9.0-a7089dd5ec356e9a0bc222e109b15cef/include/
oneapi/tbb/task_group.h:115
    #12 0x7f6ec04eb14c in task_ptr_or_nullptr<const edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::<lambda()>&> /data/cmsbld/jenkins/workspace/build-any-ib/w/el8_amd64_gcc12/external/tbb/v2021.9.0-a7089dd5ec356e9a0bc222e109b15cef/include/oneap
i/tbb/task_group.h:125
    #13 0x7f6ec04eb14c in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) /data/cmsbld/jenkins/workspace/build-any-ib/w/el8_amd64_gcc12/external/tbb/v2021.9
.0-a7089dd5ec356e9a0bc222e109b15cef/include/oneapi/tbb/task_group.h:452
    #14 0x7f6eb9bda95a in tbb::detail::d1::task* tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter>(tbb::detail::d1::task*, tbb::detail::r1::outermost_worker_waiter&) /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testB
uildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-a7089dd5ec356e9a0bc222e109b15cef/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
    #15 0x7f6eb9bda95a in tbb::detail::d1::task* tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter>(tbb::detail::d1::task*, tbb::detail::r1::outermost_worker_waiter&) /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir
/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-a7089dd5ec356e9a0bc222e109b15cef/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
    #16 0x7f6eb9bda95a in tbb::detail::r1::arena::process(tbb::detail::r1::thread_data&) /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-a7089dd5ec356e9a0bc222e109b15cef/tbb-v2021.9.0/src/tbb/arena.cpp:137
    #17 0x7f6eb9bda95a in tbb::detail::r1::market::process(rml::job&) /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-a7089dd5ec356e9a0bc222e109b15cef/tbb-v2021.9.0/src/tbb/market.cpp:599
    #18 0x7f6eb9bdcb0d in tbb::detail::r1::rml::private_worker::run() /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-a7089dd5ec356e9a0bc222e109b15cef/tbb-v2021.9.0/src/tbb/private_server.cpp:271
    #19 0x7f6eb9bdcb0d in tbb::detail::r1::rml::private_worker::thread_routine(void*) /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-a7089dd5ec356e9a0bc222e109b15cef/tbb-v2021.9.0/src/tbb/private_server.cpp:221
    #20 0x7f6eb86de1c9 in start_thread (/lib64/libpthread.so.0+0x81c9)
    #21 0x7f6eb8349e72 in clone (/lib64/libc.so.6+0x39e72)

@aloeliger
Copy link
Contributor

Only change I know from the L1 side for muons recently is the OMTF->GMT unconstrained PT update. I think that involved an unpacker update however. I assume in that for HLT hltGtStage2Digis is unpacked?

@VinInn
Copy link
Contributor

VinInn commented May 17, 2024

and this is ASAN: who aborts after finding the error

=================================================================
==1067253==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6180043d7810 at pc 0x7effcafbad57 bp 0x7effda01d5b0 sp 0x7effda01d5a8
READ of size 8 at 0x6180043d7810 thread T12
    #0 0x7effcafbad56 in HLTMuonL1TFilter::hltFilter(edm::Event&, edm::EventSetup const&, trigger::TriggerFilterObjectWithRefs&) const (/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02837/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_ASAN_X_2024-05-15-2300/lib/el8_amd64_gcc12/pluginHLTr
iggerMuonAuto.so+0x220d56)
    #1 0x7effd3c15568 in HLTFilter::filter(edm::StreamID, edm::Event&, edm::EventSetup const&) const (/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02837/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_ASAN_X_2024-05-15-2300/lib/el8_amd64_gcc12/libHLTriggerHLTcore.so+0xf6568)
    #2 0x7f004f17bcf7 in edm::global::EDFilterBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) (/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02837/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_ASAN_X_2024-05-15-2300/lib/el8_amd64_
gcc12/libFWCoreFramework.so+0x917cf7)
    #3 0x7f004f166d58 in edm::WorkerT<edm::global::EDFilterBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) (/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02837/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_ASAN_X_2024-05-15-2300/lib/el8_amd64_gcc12/libFW
CoreFramework.so+0x902d58)
    #4 0x7f004ee34457 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm:
:StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrinc
ipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) (/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02837/el8_amd64_gcc12/cms/cm
ssw/CMSSW_14_1_ASAN_X_2024-05-15-2300/lib/el8_amd64_gcc12/libFWCoreFramework.so+0x5d0457)
    #5 0x7f004ee34b2f in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActio
nType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) (/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02837/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_ASAN_X_2024-05-15-2300/
lib/el8_amd64_gcc12/libFWCoreFramework.so+0x5d0b2f)
    #6 0x7f004ee3fddd in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() (/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02837/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_ASAN_X_2024-05-15-2300/lib/el8_amd64_gcc12/libFWCoreFr
amework.so+0x5dbddd)
    #7 0x7f004ea95bf1 in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) (/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02837/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_A
SAN_X_2024-05-15-2300/lib/el8_amd64_gcc12/libFWCoreFramework.so+0x231bf1)
    #8 0x7f004c91195a in tbb::detail::d1::task* tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter>(tbb::detail::d1::task*, tbb::detail::r1::outermost_worker_waiter&) /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBu
ildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-a7089dd5ec356e9a0bc222e109b15cef/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
    #9 0x7f004c91195a in tbb::detail::d1::task* tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter>(tbb::detail::d1::task*, tbb::detail::r1::outermost_worker_waiter&) /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/
BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-a7089dd5ec356e9a0bc222e109b15cef/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
    #10 0x7f004c91195a in tbb::detail::r1::arena::process(tbb::detail::r1::thread_data&) /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-a7089dd5ec356e9a0bc222e109b15cef/tbb-v2021.9.0/src/tbb/arena.cpp:137
    #11 0x7f004c91195a in tbb::detail::r1::market::process(rml::job&) /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-a7089dd5ec356e9a0bc222e109b15cef/tbb-v2021.9.0/src/tbb/market.cpp:599
    #12 0x7f004c913b0d in tbb::detail::r1::rml::private_worker::run() /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-a7089dd5ec356e9a0bc222e109b15cef/tbb-v2021.9.0/src/tbb/private_server.cpp:271
    #13 0x7f004c913b0d in tbb::detail::r1::rml::private_worker::thread_routine(void*) /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-a7089dd5ec356e9a0bc222e109b15cef/tbb-v2021.9.0/src/tbb/private_server.cpp:221
    #14 0x7f004ba4a1c9 in start_thread (/lib64/libpthread.so.0+0x81c9)
    #15 0x7f004b6b5e72 in clone (/lib64/libc.so.6+0x39e72)

0x6180043d7810 is located 48 bytes to the right of 864-byte region [0x6180043d7480,0x6180043d77e0)
allocated by thread T12 here:
    #0 0x7f004f4496d8 in operator new(unsigned long) ../../../../libsanitizer/asan/asan_new_delete.cpp:95
    #1 0x7f0036cd2fcf in void std::vector<l1t::Muon, std::allocator<l1t::Muon> >::_M_realloc_insert<l1t::Muon const&>(__gnu_cxx::__normal_iterator<l1t::Muon*, std::vector<l1t::Muon, std::allocator<l1t::Muon> > >, l1t::Muon const&) (/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-
02837/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_ASAN_X_2024-05-15-2300/lib/el8_amd64_gcc12/libDataFormatsL1Trigger.so+0x11ffcf)
    #2 0x7effe4d8c5bc in std::vector<l1t::Muon, std::allocator<l1t::Muon> >::insert(__gnu_cxx::__normal_iterator<l1t::Muon const*, std::vector<l1t::Muon, std::allocator<l1t::Muon> > >, l1t::Muon const&) (/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02837/el8_amd64_gcc12/cms/cm
ssw/CMSSW_14_1_ASAN_X_2024-05-15-2300/lib/el8_amd64_gcc12/pluginL1TriggerL1TGlobalPlugins.so+0x1245bc)
    #3 0x7effe4d8d357 in BXVector<l1t::Muon>::push_back(int, l1t::Muon) (/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02837/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_ASAN_X_2024-05-15-2300/lib/el8_amd64_gcc12/pluginL1TriggerL1TGlobalPlugins.so+0x125357)
    #4 0x7effd2192d47 in l1t::stage2::MuonUnpacker::unpackBx(int, std::vector<unsigned int, std::allocator<unsigned int> > const&, unsigned int, unsigned int) (/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02837/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_ASAN_X_2024-05-15-2300/lib/el
8_amd64_gcc12/pluginEventFilterL1TRawToDigiAuto.so+0x502d47)
    #5 0x7effd2197b1a in l1t::stage2::MuonUnpacker::unpack(l1t::Block const&, l1t::UnpackerCollections*) (/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02837/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_ASAN_X_2024-05-15-2300/lib/el8_amd64_gcc12/pluginEventFilterL1TRawToDigiAuto.so+0x5
07b1a)
    #6 0x7effd1daf763 in l1t::L1TRawToDigi::produce(edm::Event&, edm::EventSetup const&) (/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02837/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_ASAN_X_2024-05-15-2300/lib/el8_amd64_gcc12/pluginEventFilterL1TRawToDigiAuto.so+0x11f763)
    #7 0x7f004f1e5602 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) (/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02837/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_ASAN_X_2024-05-15-2300/lib/e
l8_amd64_gcc12/libFWCoreFramework.so+0x981602)
    #8 0x7f004f166338 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) (/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02837/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_ASAN_X_2024-05-15-2300/lib/el8_amd64_gc
c12/libFWCoreFramework.so+0x902338)
    #9 0x7f004ee34457 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm:
:StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrinc
ipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) (/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02837/el8_amd64_gcc12/cms/cm
ssw/CMSSW_14_1_ASAN_X_2024-05-15-2300/lib/el8_amd64_gcc12/libFWCoreFramework.so+0x5d0457)
    #10 0x7f004ee34b2f in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActi
onType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) (/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02837/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_ASAN_X_2024-05-15-2300
/lib/el8_amd64_gcc12/libFWCoreFramework.so+0x5d0b2f)
    #11 0x7f004ee3fddd in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() (/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02837/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_ASAN_X_2024-05-15-2300/lib/el8_amd64_gcc12/libFWCoreF
ramework.so+0x5dbddd)
    #12 0x7f004ea95bf1 in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) (/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02837/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_
ASAN_X_2024-05-15-2300/lib/el8_amd64_gcc12/libFWCoreFramework.so+0x231bf1)
    #13 0x7f004c91195a in tbb::detail::d1::task* tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter>(tbb::detail::d1::task*, tbb::detail::r1::outermost_worker_waiter&) /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testB
uildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-a7089dd5ec356e9a0bc222e109b15cef/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
    #14 0x7f004c91195a in tbb::detail::d1::task* tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter>(tbb::detail::d1::task*, tbb::detail::r1::outermost_worker_waiter&) /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir
/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-a7089dd5ec356e9a0bc222e109b15cef/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
    #15 0x7f004c91195a in tbb::detail::r1::arena::process(tbb::detail::r1::thread_data&) /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-a7089dd5ec356e9a0bc222e109b15cef/tbb-v2021.9.0/src/tbb/arena.cpp:137
    #16 0x7f004c91195a in tbb::detail::r1::market::process(rml::job&) /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-a7089dd5ec356e9a0bc222e109b15cef/tbb-v2021.9.0/src/tbb/market.cpp:599
    #17 0x7f004c913b0d in tbb::detail::r1::rml::private_worker::run() /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-a7089dd5ec356e9a0bc222e109b15cef/tbb-v2021.9.0/src/tbb/private_server.cpp:271
    #18 0x7f004c913b0d in tbb::detail::r1::rml::private_worker::thread_routine(void*) /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-a7089dd5ec356e9a0bc222e109b15cef/tbb-v2021.9.0/src/tbb/private_server.cpp:221

@mmusich
Copy link
Contributor Author

mmusich commented May 25, 2024

just run in a ASAN release on any raw and it will almost immediately crash...

I read from the stack trace that you used CMSSW_14_1_ASAN_X_2024-05-15-2300 for testing. Which menu / data in input has been used?
Please post a recipe for - the record - so that one does not have to start from scratch. Thank you.

@missirol
Copy link
Contributor

missirol commented May 25, 2024

https://its.cern.ch/jira/browse/CMSLITDPG-1221?focusedId=6247237&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-6247237

It seems there are known mismatches between L1T firmware and emulator for CDC seeds (and I don't know if it's related to these crashes, but it might). I don't know anything about these mismatches (when they started, what is causing them, what is the plan to fix them). @elfontan , could you please clarify ?

@missirol
Copy link
Contributor

I'm using this

#!/bin/bash

hltGetConfiguration run:381147 \
  --globaltag 140X_dataRun3_HLT_v3 \
  --no-prescale \
  --no-output \
  --max-events 1 \
  --paths HLT_CDC_L2cosmic_10_er1p0_v* \
  --input root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381147/run381147_ls0202_index000187_fu-c2b05-29-01_pid2159904.root \
  > hlt.py

cat <<@EOF >> hlt.py
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0

process.source.skipEvents = cms.untracked.uint32( 56 )

del process.MessageLogger
process.load("FWCore.MessageLogger.MessageLogger_cfi")
@EOF

cmsRun hlt.py &> hlt.log

@missirol
Copy link
Contributor

missirol commented May 25, 2024

This dodges the issue. Not sure the warning is accurate, and whether or not this should be implemented regardless of the root cause of the problem (if so, the same check would probably have to be added for other L1T objects in that same EDFilter).

diff --git a/HLTrigger/HLTfilters/plugins/HLTL1TSeed.cc b/HLTrigger/HLTfilters/plugins/HLTL1TSeed.cc
index 699a170d60d..6fae44e83bb 100644
--- a/HLTrigger/HLTfilters/plugins/HLTL1TSeed.cc
+++ b/HLTrigger/HLTfilters/plugins/HLTL1TSeed.cc
@@ -950,6 +950,13 @@ bool HLTL1TSeed::seedsL1TriggerObjectMaps(edm::Event& iEvent, trigger::TriggerFi
                                     << "\nNo muons added to filterproduct." << endl;
     } else {
       for (std::list<int>::const_iterator itObj = listMuon.begin(); itObj != listMuon.end(); ++itObj) {
+        if (*itObj < 0 or unsigned(*itObj) >= muons->size(0)) {
+          edm::LogWarning("HLTL1TSeed")
+              << "Invalid index from the L1ObjectMap (L1uGT emulator), will be ignored (l1t::MuonBxCollection):"
+              << " index=" << *itObj << " (size of unpacked L1T objects in BX0 = " << muons->size(0) << ")";
+          continue;
+        }
+
         // Transform to index for Bx = 0 to begin of BxVector
         unsigned int index = muons->begin(0) - muons->begin() + *itObj;
 

@missirol
Copy link
Contributor

missirol commented May 26, 2024

Here's my rough understanding of the underlying issue. Some of this might be inaccurate, a L1T expert should comment.

  • L1T algorithm: it selects based on two muons in different BXs (see here, look for L1_CDC_SingleMu_3_er1p2_TOP120_DPHI2p618_3p142 and <bx_offset>-1).

  • HLT runs the L1T emulator producing the "object map" (hltGtStage2ObjectMap in the current HLT menu): the "object map" basically gives, for every L1T algorithm, the list of indices of the L1T objects which participated in the L1T decision of that algo. hltGtStage2ObjectMap returns the L1T decisions for BX=0, but it uses the L1T unpacked objects of all 5 BXs (see hltGtStage2ObjectMap.L1DataBxInEvent). As in the case of L1_CDC_SingleMu_3_er1p2_TOP120_DPHI2p618_3p142, the L1T decision for BX=0 can depends on L1T objects not in BX=0.

  • HLT uses HLTL1TSeed (an EDFilter provided by L1T-sw, New HLT L1T seed filter for 2016 run (80x) #13166) to (a) select on the L1T decisions, and (b) add to the event the L1T objects which participated in that L1T decision (to do this, HLTL1TSeed needs the "object map" as input).

    • The "object map" is ultimately a collection of indices, and it does not provide BX information (example: the "object map" for a given algo can return "muon index 0, and muon index 1", but it does not say to which BXs the two belong).
    • HLTL1TSeed implicitly assumes that the indices returned by the "object map" refer to objects in BX=0. I say this based on how the objects are added to the event, e.g. here.

The example from #44940 (comment) shows (patch)

Begin processing the 1st record. Run 381147, Event 351398133, LumiSection 202 on stream 0 at 25-May-2024 23:02:48.997 CEST

 bx=-2  pt=1 eta=2.28375 phi=-1.48367 hwPt=3 hwEtaAtVtx=204 hwPhiAtVtx=380 hwQual=7
 bx=-1  pt=6.5 eta=2.22937 phi=2.49793 hwPt=14 hwEtaAtVtx=204 hwPhiAtVtx=217 hwQual=12
 bx=-1  pt=4 eta=0.36975 phi=2.87971 hwPt=9 hwEtaAtVtx=32 hwPhiAtVtx=204 hwQual=12
 bx=0  pt=4 eta=0.815625 phi=-0.632841 hwPt=9 hwEtaAtVtx=74 hwPhiAtVtx=458 hwQual=12

The "object map" returns indices 1 and 0 (the first one refers to the 2nd muon in BX=-1, the second one refers to the only muon in BX=0). Then, HLTL1TSeed interprets index 1 as a 2nd muon in BX=0 (which does not exist), and that leads to the problem.

Based on the above, I don't see how HLTL1TSeed can add to the event the two muons that fired L1_CDC_SingleMu_3_er1p2_TOP120_DPHI2p618_3p142 (up to now, the second muon was probably another muon that happened to be in BX=0, or unphysical values from a wrong memory access (?)). The problem is that it looks like the Paths HLT_CDC_L2cosmic_10_er1p0_v and HLT_CDC_L2cosmic_5p5_er1p0_v use hltL1fL1sCDCL1Filtered0 to seed the HLT muon reconstruction, which seems conceptually wrong given that the correct L1T objects cannot be retrieved in this case with the current L1T software ("object map" + HLTL1TSeed).

@VinInn
Copy link
Contributor

VinInn commented May 26, 2024

I think the patch proposed by @missirol should be implemented ASAP, at least to monitor the frequency of this misbehavior.
I leave to the Trigger coordinators to evaluate the urgency to put pressure on the L1T muon crew to solve the issue upstream.

@VinInn
Copy link
Contributor

VinInn commented May 26, 2024

BTW: should a new more specific issue be opened against L1TSeed or the L1TMuon unpacker? (or the cosmic HLT?)

@mmusich
Copy link
Contributor Author

mmusich commented May 26, 2024

BTW: should a new more specific issue be opened against L1TSeed or the L1TMuon unpacker? (or the cosmic HLT?)

this is tracked at https://its.cern.ch/jira/browse/CMSHLT-3216

@missirol
Copy link
Contributor

missirol commented May 26, 2024

The patch in #44940 (comment) is implemented in #45047 (14_1_X) and #45048 (14_0_X).

In the near future, maybe a better patch would ensure that HLTL1TSeed adds no objects to the Event (and emits a warning) if the L1T algo in question is using objects from different BXs (which is something that HLTL1TSeed cannot really handle).

@fwyzard
Copy link
Contributor

fwyzard commented May 26, 2024

In the near future, maybe a better patch would ensure that HLTL1TSeed adds no objects to the Event (and emits a warning) if the L1T algo in question is using objects from different BXs (which is something that HLTL1TSeed cannot really handle).

Can HLTL1TSeed add the in-time objects and skip the out-of-time ones ?

For the out-of-time muons, would it be useful to be able to tag them ? Or anyway the HLT reconstruction would not be able to use them ?

@missirol
Copy link
Contributor

Can HLTL1TSeed add the in-time objects and skip the out-of-time ones ?

For what I understand, not in the current implementation, because HLTL1TSeed just uses what the "object map" provides, and it seems the "object map" contains indices, but no info on the BX of the objects behind those indices. It seems to me that deeper changes would be needed to identify correctly the in-time ones (e.g. an improvement of the "object map" format). Alternatively, I was wondering if it would make sense to restrict the objects used by the "object map" to BX=0 (with hltGtStage2ObjectMap.L1DataBxInEvent = 1, iiuc): in that case the indices in the "object map" would just be the in-time ones (but unpacked and emulated L1T decisions would disagree for any L1T algo using objects from BXs different from 0).

For the out-of-time muons, would it be useful to be able to tag them ? Or anyway the HLT reconstruction would not be able to use them ?

This, I don't really know (I would guess the HLT reconstruction would not be able to use them, but I might be wrong).

@Martin-Grunewald
Copy link
Contributor

For what I understand, not in the current implementation, because HLTL1TSeed just uses what the "object map" provides, and it seems the "object map" contains indices, but no info on the BX of the objects behind those indices. It seems to me that deeper changes would be needed to identify correctly the in-time ones (e.g. an improvement of the "object map" format). Alternatively, I was wondering if it would make sense to restrict the objects used by the "object map" to BX=0 (with hltGtStage2ObjectMap.L1DataBxInEvent = 1, iiuc): in that case the indices in the "object map" would just be the in-time ones (but unpacked and emulated L1T decisions would disagree for any L1T algo using objects from BXs different from 0).

I agree, this would be a consistency fix, and cover most use cases. Triggers looking at BX<>0 should be rare special cases which should need specific treatment anyway.

@mmusich
Copy link
Contributor Author

mmusich commented May 28, 2024

Alternatively, I was wondering if it would make sense to restrict the objects used by the "object map" to BX=0 (with hltGtStage2ObjectMap.L1DataBxInEvent = 1, iiuc): in that case the indices in the "object map" would just be the in-time ones (but unpacked and emulated L1T decisions would disagree for any L1T algo using objects from BXs different from 0).

this is tracked at https://its.cern.ch/jira/browse/CMSHLT-3218

@fwyzard
Copy link
Contributor

fwyzard commented May 28, 2024

Can we stick to GitHub for issues, instead of splitting them between GitHub and JIRA (which unlike GH has a horrible user interface) ?

@mmusich
Copy link
Contributor Author

mmusich commented May 28, 2024

Can we stick to GitHub for issues, instead of splitting them between GitHub and JIRA (which unlike GH has a horrible user interface) ?

my understanding is that we are using gitHub for discussing s/w issues and JIRA for HLT configuration changes. So in short - no.

@fwyzard
Copy link
Contributor

fwyzard commented May 28, 2024

OK, then. Feel free to enjoy the crappy user interface and the lack of feedback.

@mmusich
Copy link
Contributor Author

mmusich commented May 28, 2024

Feel free to enjoy the crappy user interface and the lack of feedback.

to be honest I am not enjoying it at all, but if we want to move everything to gitHub (at least for the HLT-related items that directly rely on cmssw, e.g. menus, tests, etc. - broadly speaking the "HLT configurations" and "STORM tasks" components) and not on JIRA it's a decision that should be taken at coordination level (which is above my paygrade). It could be discussed elsewhere.

@mmusich
Copy link
Contributor Author

mmusich commented Sep 12, 2024

solutions proposed (technically avoiding the crash online):

A CMSHLT JIRA ticket to discuss the next steps is open at https://its.cern.ch/jira/browse/CMSHLT-3216.

@mmusich
Copy link
Contributor Author

mmusich commented Sep 12, 2024

+hlt

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

@makortel
Copy link
Contributor

@cmsbuild, please close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants