HLT crash in run 383631 - 'VertexException' and HcalDigisProducerGPU:hltHcalDigisGPU error #45555

jalimena · 2024-07-25T09:44:21Z

In run 383631 (pp collisions, release CMSSW_14_0_11_MULTIARCHS), we got this error:

An exception of category 'VertexException' occurred while
[0] Processing Event run: 383631 lumi: 444 event: 1064710185 stream: 16
[1] Running path 'HLT_DoublePNetTauhPFJet30_Medium_L2NN_eta2p3_v4'
[2] Calling method for module PrimaryVertexProducer/'hltVerticesPF'
Exception Message:
BasicSingleVertexState::could not invert error matrix 

terminate called after throwing an instance of 'cms::Exception'
what(): An exception of category 'AsyncCallNotAllowed' occurred while
[0] Calling Async::run()
Exception Message:
Framework is shutting down, further run() calls are not allowed

the complete stack trace is attached, the thread that crashed is Thread 5 and the module that crashed is HcalDigisProducerGPU:hltHcalDigisGPU.

Thread 5 (Thread 0x7fde7f85a700 (LWP 679022) "cmsRun"): 
#0  0x00007fdf06c0e0e1 in poll () from /lib64/libc.so.6 
#1  0x00007fdeff3670cf in full_read.constprop () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_11_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so 
#2  0x00007fdeff31b1ec in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_11_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so 
#3  0x00007fdeff31b370 in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_11_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so 
#4   
#5  0x00007fdf06b37aff in raise () from /lib64/libc.so.6 
#6  0x00007fdf06b0aea5 in abort () from /lib64/libc.so.6 
#7  0x00007fdf07514a49 in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95 
#8  0x00007fdf0752006a in __cxxabiv1::__terminate (handler=) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48 
#9  0x00007fdf0751f0d9 in __cxa_call_terminate (ue_header=0x7fdd574b88e0) at ../../../../libstdc++-v3/libsupc++/eh_call.cc:54 
#10 0x00007fdf0751f7f6 in __cxxabiv1::__gxx_personality_v0 (version=, actions=6, exception_class=5138137972254386944, ue_header=, context=0x7fde7f853500) at ../../../../libstdc++-v3/libsupc++/eh_personality.cc:688 
#11 0x00007fdf070e5864 in _Unwind_RaiseException_Phase2 (exc=0x7fdd574b88e0, context=0x7fde7f853500, frames_p=0x7fde7f853408) at ../../../libgcc/unwind.inc:64 
#12 0x00007fdf070e62bd in _Unwind_Resume (exc=0x7fdd574b88e0) at ../../../libgcc/unwind.inc:242 
#13 0x00007fde9ce05966 in cms::cuda::impl::ScopedContextHolderHelper::enqueueCallback(int, CUstream_st*) [clone .cold] () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_11_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libHeterogeneousCoreCUDACore.so 
#14 0x00007fde9ce07fef in cms::cuda::ScopedContextAcquire::~ScopedContextAcquire() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_11_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libHeterogeneousCoreCUDACore.so 
#15 0x00007fde9829562b in HcalDigisProducerGPU::acquire(edm::Event const&, edm::EventSetup const&, edm::WaitingTaskWithArenaHolder) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_11_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginEventFilterHcalRawToDigiGPUPlugins.so 
#16 0x00007fdf0966bee9 in edm::stream::doAcquireIfNeeded(edm::stream::impl::ExternalWork*, edm::Event const&, edm::EventSetup const&, edm::WaitingTaskWithArenaHolder&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_11_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so 
#17 0x00007fdf096726d7 in edm::stream::EDProducerAdaptorBase::doAcquire(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*, edm::WaitingTaskWithArenaHolder&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_11_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so 
#18 0x00007fdf09645652 in edm::Worker::runAcquire(edm::EventTransitionInfo const&, edm::ParentContext const&, edm::WaitingTaskWithArenaHolder&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_11_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so 
#19 0x00007fdf096457d6 in edm::Worker::runAcquireAfterAsyncPrefetch(std::__exception_ptr::exception_ptr, edm::EventTransitionInfo const&, edm::ParentContext const&, edm::WaitingTaskWithArenaHolder) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_11_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so 
#20 0x00007fdf095e041f in edm::Worker::AcquireTask, void>::execute() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_11_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so 
#21 0x00007fdf0978f1d8 in tbb::detail::d1::function_task::execute(tbb::detail::d1::execution_data&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_11_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreConcurrency.so 
#22 0x00007fdf07d7f95b in tbb::detail::r1::task_dispatcher::local_wait_for_all (t=0x7fdd9ef34200, waiter=..., this=0x7fdf019a0500) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322 
#23 tbb::detail::r1::task_dispatcher::local_wait_for_all (t=0x0, waiter=..., this=0x7fdf019a0500) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458 
#24 tbb::detail::r1::arena::process (tls=..., this=) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:137 
#25 tbb::detail::r1::market::process (this=, j=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/market.cpp:599 
#26 0x00007fdf07d81b0e in tbb::detail::r1::rml::private_worker::run (this=0x7fdf01995080) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/private_server.cpp:271 
#27 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7fdf01995080) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/private_server.cpp:221 
#28 0x00007fdf06eb71ca in start_thread () from /lib64/libpthread.so.0 
#29 0x00007fdf06b22e73 in clone () from /lib64/libc.so.6

Complete stack trace should be here:
old_hlt_run383631_pid675389.log

I tried to reproduce it, but no success:

#in CMSSW_14_0_11_MULTIARCHS:

hltGetConfiguration run:383631 --globaltag 140X_dataRun3_HLT_v3 --data --no-prescale --no-output --max-events -1 --input 'file:run383631_ls0444_index000423_fu-c2b14-43-01_pid675389.root','file:run383631_ls0445_index000002_fu-c2b14-43-01_pid675389.root','file:run383631_ls0445_index000025_fu-c2b14-43-01_pid675389.root','file:run383631_ls0666_index000027_fu-c2b14-43-01_pid675067.root','file:run383631_ls0666_index000047_fu-c2b14-43-01_pid675067.root','file:run383631_ls0666_index000065_fu-c2b14-43-01_pid675067.root' > hlt_383631.py

cat <@EOF >> hlt_383631.py
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt_383631.py &> hlt_383631.log

Possibly related to #41914

@cms-sw/hlt-l2 FYI
@cms-sw/heterogeneous-l2 FYI

Juliette, for FOG

The text was updated successfully, but these errors were encountered:

cmsbuild · 2024-07-25T09:44:44Z

cms-bot internal usage

cmsbuild · 2024-07-25T09:44:45Z

A new Issue was created by @jalimena.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

mmusich · 2024-07-25T09:49:28Z

For record, also using repacked data with the following script:

#!/bin/bash -ex

# CMSSW_14_0_11_MULTIARCHS

hltGetConfiguration run:383631 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input '/store/group/tsg/FOG/error_stream_root/run383631/run383631_ls0444_index000423_fu-c2b14-43-01_pid675389.root,/store/group/tsg/FOG/error_stream_root/run383631/run383631_ls0445_index000025_fu-c2b14
-43-01_pid675389.root,/store/group/tsg/FOG/error_stream_root/run383631/run383631_ls0666_index000047_fu-c2b14-43-01_pid675067.root,/store/group/tsg/FOG/error_stream_root/run383631/run383631_ls0445_index00
0002_fu-c2b14-43-01_pid675389.root,/store/group/tsg/FOG/error_stream_root/run383631/run383631_ls0666_index000027_fu-c2b14-43-01_pid675067.root,/store/group/tsg/FOG/error_stream_root/run383631/run383631_l
s0666_index000065_fu-c2b14-43-01_pid675067.root' > hlt.py

cat <<@EOF >> hlt.py
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt.py &> hlt.log

doesn't reproduce.

dan131riley · 2024-07-25T10:51:57Z

@jalimena where can we find the full stack trace for all threads?

jalimena · 2024-07-25T10:59:39Z

@jalimena where can we find the full stack trace for all threads?

sorry i have uploaded the complete stack trace from the DAQ shifter in the issue description. Let me know if it looks incomplete.

mmusich · 2024-07-25T11:18:34Z

As the "regular" reproducing script (see #45555 (comment)) doesn't lead to a crash I have explored the option of running filling with junk the memory on the host and device allocators (cf #44923 (comment)).
Using the following script in CMSSW_14_0_11_MULTIARCHS [1] I get:

----- Begin Fatal Exception 25-Jul-2024 13:09:17 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 383631 lumi: 444 event: 1064704776 stream: 0
   [1] Running path 'DQM_HcalReconstruction_v8'
   [2] Calling method for module PFRecHitSoAProducerHCAL@alpaka/'hltParticleFlowRecHitHBHESoA'
Exception Message:
A std::exception was thrown.
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_11_MULTIARCHS-el8_amd64_gcc12/build/CMSSW_14_0_11_MULTIARCHS-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alp
aka/event/EventUniformCudaHipRt.hpp(160) 'TApi::eventRecord(event.getNativeHandle(), queue.getNativeHandle())' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
----- End Fatal Exception -------------------------------------------------

when doing the same in CMSSW_14_0_12_MULTIARCHS (which contains #45374 ) I still get:

----- Begin Fatal Exception 25-Jul-2024 13:16:03 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 383631 lumi: 444 event: 1064704776 stream: 0
   [1] Running path 'DQM_HcalReconstruction_v8'
   [2] Calling method for module PFRecHitSoAProducerHCAL@alpaka/'hltParticleFlowRecHitHBHESoA'
Exception Message:
A std::exception was thrown.
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_12_MULTIARCHS-el8_amd64_gcc12/build/CMSSW_14_0_12_MULTIARCHS-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alp
aka/event/EventUniformCudaHipRt.hpp(160) 'TApi::eventRecord(event.getNativeHandle(), queue.getNativeHandle())' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
----- End Fatal Exception -------------------------------------------------

Not sure if that's expected.

[1]

Click me

#!/bin/bash -ex

# CMSSW_14_0_11_MULTIARCHS

hltGetConfiguration run:383631 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input '/store/group/tsg/FOG/error_stream_root/run383631/run383631_ls0444_index000423_fu-c2b14-43-01_pid675389.root,/store/group/tsg/FOG/error_stream_root/run383631/run383631_ls0445_index000025_fu-c2b14
-43-01_pid675389.root,/store/group/tsg/FOG/error_stream_root/run383631/run383631_ls0666_index000047_fu-c2b14-43-01_pid675067.root,/store/group/tsg/FOG/error_stream_root/run383631/run383631_ls0445_index00
0002_fu-c2b14-43-01_pid675389.root,/store/group/tsg/FOG/error_stream_root/run383631/run383631_ls0666_index000027_fu-c2b14-43-01_pid675067.root,/store/group/tsg/FOG/error_stream_root/run383631/run383631_l
s0666_index000065_fu-c2b14-43-01_pid675067.root' > hlt.py

cat <<@EOF >> hlt.py
del process.MessageLogger
process.load('FWCore.MessageService.MessageLogger_cfi')  
process.MessageLogger.CUDAService = {}
process.MessageLogger.AlpakaService = {}
process.load('HeterogeneousCore.CUDAServices.CUDAService_cfi')
from HeterogeneousCore.AlpakaServices.AlpakaServiceCudaAsync_cfi import AlpakaServiceCudaAsync as _AlpakaServiceCudaAsync
process.AlpakaServiceCudaAsync = _AlpakaServiceCudaAsync.clone(
  verbose = True,
  hostAllocator = dict(
  binGrowth = 2,
  minBin = 8,                           # 256 bytes
  maxBin = 30,                          #   1 GB
  maxCachedBytes = 64*1024*1024*1024,   #  64 GB
  maxCachedFraction = 0.8,              # or 80%, whatever is less
  fillAllocations = True,
  fillAllocationValue = 0xA5,
  fillReallocations = True,
  fillReallocationValue = 0x69,
  fillDeallocations = True,
  fillDeallocationValue = 0x5A,
  fillCaches = True,
  fillCacheValue = 0x96
  ),
  deviceAllocator = dict(
  binGrowth = 2,
  minBin = 8,                           # 256 bytes
  maxBin = 30,                          #   1 GB
  maxCachedBytes = 8*1024*1024*1024,    #   8 GB
  maxCachedFraction = 0.8,              # or 80%, whatever is less
  fillAllocations = True,
  fillAllocationValue = 0xA5,
  fillReallocations = True,
  fillReallocationValue = 0x69,
  fillDeallocations = True,
  fillDeallocationValue = 0x5A,
  fillCaches = True,
  fillCacheValue = 0x96
  )
)
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt.py &> hlt.log

dan131riley · 2024-07-25T11:24:46Z

sorry i have uploaded the complete stack trace from the DAQ shifter in the issue description. Let me know if it looks incomplete.

Thanks. There's a possibly relevant error from the PrimaryVertexProducer, and I guess a question for physics is whether having the BasicSingleVertexState throw on invalid errors is the desired behavior:

%MSG-e UnusableBeamSpot:  PrimaryVertexProducer:hltVerticesPF  25-Jul-2024 08:49:19 CEST Run: 383631 Event: 1064710185
Beamspot with invalid errors [            0           0           0
             0           0           0
             0           0           0 ]
%MSG
----- Begin Fatal Exception 25-Jul-2024 08:49:19 CEST-----------------------
An exception of category 'VertexException' occurred while
   [0] Processing  Event run: 383631 lumi: 444 event: 1064710185 stream: 16
   [1] Running path 'HLT_DoublePNetTauhPFJet30_Medium_L2NN_eta2p3_v4'
   [2] Calling method for module PrimaryVertexProducer/'hltVerticesPF'
Exception Message:
BasicSingleVertexState::could not invert error matrix 
----- End Fatal Exception -------------------------------------------------

The detailed stack trace shows that stack trace interpretation is getting complicated by the async GPU calls--the sig_dostack_then_abort identified as crashed looks to be an innocent bystander due to async ops getting triggered while the framework is shutting down. Tagging framework team, @Dr15Jones @makortel can we avoid throwing in these cases where the framework is already shutting down due to an exception?

mmusich · 2024-07-25T11:28:44Z

I didn't notice at first glance:

%MSG-e UnusableBeamSpot:  PrimaryVertexProducer:hltVerticesPF  25-Jul-2024 08:49:19 CEST Run: 383631 Event: 1064710185
Beamspot with invalid errors [            0           0           0
             0           0           0
             0           0           0 ]
%MSG

So the plot thickens and looks to go in the direction of the already observed #41914 in which we suspected beamspot issues.
@francescobrivio FYI

mmusich · 2024-07-25T11:33:22Z

assign hlt, reconstruction, alca

mmusich · 2024-07-25T11:33:29Z

type tracking

mmusich · 2024-07-25T11:33:35Z

@cms-sw/tracking-pog-l2 FYI

cmsbuild · 2024-07-25T11:33:47Z

New categories assigned: hlt,reconstruction,alca

@Martin-Grunewald,@mmusich,@jfernan2,@mandrenguyen,@saumyaphor4252,@perrotta,@consuegs you have been requested to review this Pull request/Issue and eventually sign? Thanks

mmusich · 2024-07-25T12:14:59Z

spoon-feeding a null-covariance matrix fake BeamSpot to the HLT with:

diff --git a/RecoVertex/BeamSpotProducer/plugins/BeamSpotOnlineProducer.cc b/RecoVertex/BeamSpotProducer/plugins/BeamSpotOnlineProducer.cc
index 83aa832cfa5..56f0c8948b6 100644
--- a/RecoVertex/BeamSpotProducer/plugins/BeamSpotOnlineProducer.cc
+++ b/RecoVertex/BeamSpotProducer/plugins/BeamSpotOnlineProducer.cc
@@ -108,7 +108,7 @@ void BeamSpotOnlineProducer::produce(Event& iEvent, const EventSetup& iSetup) {
         edm::LogWarning("BeamSpotFromDB")
             << "Online Beam Spot producer falls back to DB value because the ESProducer returned a fake beamspot ";
       }
-      fallBackToDB = true;
+      //fallBackToDB = true;
     } else {
       // translate from BeamSpotObjects to reco::BeamSpot
       // in case we need to switch to LHC reference frame
diff --git a/RecoVertex/BeamSpotProducer/plugins/OnlineBeamSpotESProducer.cc b/RecoVertex/BeamSpotProducer/plugins/OnlineBeamSpotESProducer.cc
index 0b8c4233c7c..2c42e2b6d63 100644
--- a/RecoVertex/BeamSpotProducer/plugins/OnlineBeamSpotESProducer.cc
+++ b/RecoVertex/BeamSpotProducer/plugins/OnlineBeamSpotESProducer.cc
@@ -55,13 +55,21 @@ OnlineBeamSpotESProducer::OnlineBeamSpotESProducer(const edm::ParameterSet& p)
   fakeBS_.setPosition(0.0001, 0.0001, 0.0001);
   fakeBS_.setType(-1);
   // Set diagonal covariance, i.e. errors on the parameters
-  fakeBS_.setCovariance(0, 0, 5e-10);
-  fakeBS_.setCovariance(1, 1, 5e-10);
-  fakeBS_.setCovariance(2, 2, 0.002);
-  fakeBS_.setCovariance(3, 3, 0.002);
-  fakeBS_.setCovariance(4, 4, 5e-11);
-  fakeBS_.setCovariance(5, 5, 5e-11);
-  fakeBS_.setCovariance(6, 6, 1e-09);
+  fakeBS_.setCovariance(0, 0, 0.);
+  fakeBS_.setCovariance(1, 1, 0.);
+  fakeBS_.setCovariance(2, 2, 0.);
+  fakeBS_.setCovariance(3, 3, 0.);
+  fakeBS_.setCovariance(4, 4, 0.);
+  fakeBS_.setCovariance(5, 5, 0.);
+  fakeBS_.setCovariance(6, 6, 0.);
 
   bsHLTToken_ = cc.consumesFrom<BeamSpotOnlineObjects, BeamSpotOnlineHLTObjectsRcd>();
   bsLegacyToken_ = cc.consumesFrom<BeamSpotOnlineObjects, BeamSpotOnlineLegacyObjectsRcd>();
@@ -179,7 +187,7 @@ std::shared_ptr<const BeamSpotObjects> OnlineBeamSpotESProducer::produce(const B
     return std::shared_ptr<const BeamSpotObjects>(best, edm::do_nothing_deleter());
   }
   edm::LogWarning("OnlineBeamSpotESProducer")
-      << "None of the Online BeamSpots in the ES is suitable, \n returning a fake one(fallback to PCL).";
+      << "None of the Online BeamSpots in the ES is suitable, \n returning a fake one (fallback to PCL).";
   return std::shared_ptr<const BeamSpotObjects>(&fakeBS_, edm::do_nothing_deleter());
 }

I sort of reproduce the crash:

%MSG-e UnusableBeamSpot:  PrimaryVertexProducer:hltVerticesPF  25-Jul-2024 14:08:04 CEST Run: 383631 Event: 1064704786
Beamspot with invalid errors [            0           0           0
            0           0           0
            0           0           0 ]
%MSG
----- Begin Fatal Exception 25-Jul-2024 14:08:04 CEST-----------------------
An exception of category 'VertexException' occurred while
  [0] Processing  Event run: 383631 lumi: 444 event: 1064704786 stream: 0
  [1] Running path 'HLT_ZeroBias_Beamspot_v14'
  [2] Calling method for module PrimaryVertexProducer/'hltVerticesPF'
Exception Message:
BasicSingleVertexState::could not invert error matrix 
----- End Fatal Exception -------------------------------------------------

the questions are:

how comes the HLT consumed such a beam spot (I don't see mentions of fallbacks to the PCL in the log, so it means it was consuming the one coming from one of the DQM online clients)
more importantly why when re-running a posteriori there's no crash.

mmusich · 2024-07-25T13:00:29Z

more importantly why when re-running a posteriori there's no crash.

there's no such payload in the DB for run 383631:

BeamSpotOnlineLegacy:

BeamSpotOnlineHLT:

slava77 · 2024-07-25T13:16:30Z

there's no such payload in the DB for run 383631:

does it look like there was a glitch in the DB access or did the zeroes show up from another part of the logic in the BS producer?

mmusich · 2024-07-25T13:26:54Z

does it look like there was a glitch in the DB access

Possible, but it's not clear from where it would read it then (I would have expected a framework exception).
Notice that even if it had have fallen back to the PCL (the logs do not mention it) at the time of running the last open IoV would have been this one:

did the zeroes show up from another part of the logic in the BS producer?

By looking at the code it's not clear to me from where it could come.

makortel · 2024-07-25T13:56:26Z

Tagging framework team, @Dr15Jones @makortel can we avoid throwing in these cases where the framework is already shutting down due to an exception?

It could indeed make sense to demote the exception

----- Begin Fatal Exception 25-Jul-2024 08:49:19 CEST-----------------------
An exception of category 'AsyncCallNotAllowed' occurred while
   [0] Calling Async::run()
Exception Message:
Framework is shutting down, further run() calls are not allowed
----- End Fatal Exception -------------------------------------------------
%MSG-w FastMonitoringService:  PostProcessPath 25-Jul-2024 08:49:19 CEST  Run: 383631 Event: 1064835836
 STREAM 9 earlyTermination -: ID:run: 383631 lumi: 445 event: 1064835836 LS:445  FromThisContext
%MSG
terminate called after throwing an instance of 'cms::Exception'
  what():  An exception of category 'AsyncCallNotAllowed' occurred while
   [0] Calling Async::run()
Exception Message:
Framework is shutting down, further run() calls are not allowed

to LogWarning (even if the "Alpaka framework" should avoid double exceptions in this particular case where the "CUDA framework" causes them).

makortel · 2024-07-25T16:10:48Z

Tagging framework team, @Dr15Jones @makortel can we avoid throwing in these cases where the framework is already shutting down due to an exception?

It could indeed make sense to demote the exceptionFramework is shutting down, further run() calls are not allowed
to LogWarning (even if the "Alpaka framework" should avoid double exceptions in this particular case where the "CUDA framework" causes them).

After are more careful look I convinced myself this case has to stay as an exception. Generally we want to stop processing as soon as possible when this situation happens (e.g. in this case to prevent calling HcalDigisProducerGPU::produce() when HcalDigisProducerGPU::acquire() did not synchronize properly), and exception is (by far) the easiest way to communicate that. As I mentioned above, in the "Alpaka framework" addresses the "double exception" deficiency of the "CUDA framework", and at this point I think addressing the latter would not be worth of the effort (because it will be phased out "soon").

Modules in different streams throwing exceptions should be fine, and two independent modules processed by one stream throwing exceptions should also be fine. From the (core) framework perspective the std::terminate() being called here is an (unfortunate) property of the HcalDigisProducerGPU module (really the "CUDA framework").

I would add tests for the "Alpaka framework" that an Alpaka module throwing an exception in acquire() doesn't lead to std::terminate() even then the framework is shutting down at the time because of an exception being thrown elsewhere.

makortel · 2024-07-25T17:40:26Z

Written that, I wonder what the other exception HcalDigisProducerGPU::acquire() is. Oh wait, the destructor of ScopedContextAcquire is not marked as noexcept(false)

cmssw/HeterogeneousCore/CUDACore/interface/ScopedContext.h

Line 123 in 0efbba9

~ScopedContextAcquire();

which I'd bet is the cause for std::terminate() here 🤦 . I'll fix that.

mmusich · 2024-08-05T12:59:43Z

which should exactly tell which of the BeamSpot is being consumed, seem to indicate that exactly at LSs 441-444 the "fallback to PCL mechanism" kicked in.

@cms-sw/db-l2 @PonIlya is it possible to have an audit of the P5 frontier squid at the time of the crash ( 25-Jul-2024 08:49:19 )?

PonIlya · 2024-08-09T10:26:56Z

which should exactly tell which of the BeamSpot is being consumed, seem to indicate that exactly at LSs 441-444 the "fallback to PCL mechanism" kicked in.

@cms-sw/db-l2 @PonIlya is it possible to have an audit of the P5 frontier squid at the time of the crash ( 25-Jul-2024 08:49:19 )?
Do you need log analysis for this time period? What exactly are we expecting to find? I think I need to create a ticket for Frontier support team.

mmusich · 2024-08-09T10:44:21Z

@PonIlya , thank you for the reply.

Do you need log analysis for this time period? What exactly are we expecting to find? I think I need to create a ticket for Frontier support team.

to be perfectly honest I am not sure, but there are indications that somehow the connection to the DB "glitched" during this period of time. As a DB expert you are perhaps in a better position to judge if there was any anomaly during that time.

PonIlya · 2024-08-16T14:43:22Z

@mmusich Unfortunately, the logs are no longer available for audit. Let me know if a similar issue arises again so that I can respond more quickly. The logs are deleted once they reach a certain size. For example, as of today, only logs from August 7th are available.

mmusich · 2024-08-16T16:46:23Z

@PonIlya thanks.

Let me know if a similar issue arises again so that I can respond more quickly.

Sure, will do. In the meanwhile perhaps you could take a look to the related issue #45714

mmusich · 2024-10-07T08:14:58Z

Logging here for the record.

in run-386614 we observed a burst of HLT crashes, all around 02:44-02:45, on multiple FUs.
For every FU with a crash, there is one crash of type

----- Begin Fatal Exception 06-Oct-2024 02:44:15 CEST-----------------------
An exception of category 'VertexException' occurred while
   [0] Processing  Event run: 386614 lumi: 55 event: 48707691 stream: 18
   [1] Running path 'HLT_DoublePNetTauhPFJet30_Medium_L2NN_eta2p3_v5'
   [2] Calling method for module PrimaryVertexProducer/'hltVerticesPF'
Exception Message:
BasicSingleVertexState::could not invert error matrix 
----- End Fatal Exception -------------------------------------------------

(the other crashes on the same FU come from the same PID). In total, 8 processes crashed (logs are attached below [1])
F3Mon reports 22 crashes, because multiple errors are reported for a given PID. The crashes are not fully understood, they seem related to this very issue cms-sw/cmssw#45555, which was suspected to be caused by a conditions-db issue (in that case from the past, it looked like the correct beamspot info could not be retrieved from the db).
Looking at the online DQM of that run link one can see that there are 3 LS (55,56,57) in which the BS fell back to the PCL

Interestingly in the logs there's this:

%MSG-w BeamSpotOnlineProducer:  BeamSpotOnlineProducer:hltOnlineBeamSpot  06-Oct-2024 02:44:13 CEST Run: 386614 Event: 48631085
Online Beam Spot producer falls back to DB value because the ESProducer returned a fake beamspot 
%MSG

(which matches the content of the DQM plots) , which comes from here, but there is no warning from here which I would have expected to see as well (given the other kind of message from the vertex producer).
Trying to reproduce "offline" with the following script [2] does not succeed.

@cms-sw/db-l2 @PonIlya FYI

[1]
old_hlt_run386614_pid18256.log
old_hlt_run386614_pid399919.log
old_hlt_run386614_pid984863.log
old_hlt_run386614_pid2066514.log
old_hlt_run386614_pid3143597.log
old_hlt_run386614_pid3391489.log
old_hlt_run386614_pid4022333.log
old_hlt_run386614_pid4022422.log

[2]

#!/bin/bash -ex

#in CMSSW_14_0_15_patch1

hltGetConfiguration run:386614 \
		    --globaltag 140X_dataRun3_HLT_v3 \
		    --data \
		    --no-prescale \
		    --no-output \
		    --max-events -1 \
		    --input /store/group/tsg/FOG/error_stream_root/run386614/run386614_ls0055_index000003_fu-c2b02-33-01_pid4022333.root,/store/group/tsg/FOG/error_stream_root/run386614/run386614_ls0055_index000005_fu-c2b04-26-01_pid18256.root,/store/group/tsg/FOG/error_stream_root/run386614/run386614_ls0055_index000019_fu-c2b04-26-01_pid18256.root,/store/group/tsg/FOG/error_stream_root/run386614/run386614_ls0055_index000027_fu-c2b02-33-01_pid4022333.root,/store/group/tsg/FOG/error_stream_root/run386614/run386614_ls0055_index000033_fu-c2b04-26-01_pid18256.root,/store/group/tsg/FOG/error_stream_root/run386614/run386614_ls0055_index000062_fu-c2b02-12-01_pid399919.root,/store/group/tsg/FOG/error_stream_root/run386614/run386614_ls0055_index000078_fu-c2b02-12-01_pid399919.root,/store/group/tsg/FOG/error_stream_root/run386614/run386614_ls0055_index000156_fu-c2b02-33-01_pid4022422.root,/store/group/tsg/FOG/error_stream_root/run386614/run386614_ls0056_index000023_fu-c2b03-34-01_pid3391489.root,/store/group/tsg/FOG/error_stream_root/run386614/run386614_ls0056_index000046_fu-c2b03-34-01_pid3391489.root,/store/group/tsg/FOG/error_stream_root/run386614/run386614_ls0056_index000101_fu-c2b14-11-01_pid2066514.root,/store/group/tsg/FOG/error_stream_root/run386614/run386614_ls0056_index000138_fu-c2b01-36-01_pid984863.root,/store/group/tsg/FOG/error_stream_root/run386614/run386614_ls0056_index000152_fu-c2b01-36-01_pid984863.root,/store/group/tsg/FOG/error_stream_root/run386614/run386614_ls0057_index000266_fu-c2b01-12-01_pid3143597.root,/store/group/tsg/FOG/error_stream_root/run386614/run386614_ls0057_index000277_fu-c2b01-12-01_pid3143597.root > hlt_386614.py

cat <<@EOF >> hlt_386614.py
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt_386614.py &> hlt_386614.log

mmusich · 2024-10-08T14:34:23Z

As far as I could dig, the only place in which a default constructed reco::BeamSpot is instantiated (which would explain the zeros in the error matrix):

cmssw/DataFormats/BeamSpot/src/BeamSpot.cc

Lines 31 to 35 in d4aca16

    
           for (int j = 0; j < 7; j++) { 
        
             for (int k = j; k < 7; k++) { 
        
               error_(j, k) = 0.; 
        
             } 
        
           }

is here:

cmssw/RecoVertex/PrimaryVertexProducer/plugins/PrimaryVertexProducer.cc

Line 159 in 1d0afd0

reco::BeamSpot beamSpot;

on the other hand it's not clear to me why in the following instructions in the routine

cmssw/RecoVertex/PrimaryVertexProducer/plugins/PrimaryVertexProducer.cc

Lines 160 to 166 in 1d0afd0

    
           edm::Handle<reco::BeamSpot> recoBeamSpotHandle; 
        
           iEvent.getByToken(bsToken, recoBeamSpotHandle); 
        
           if (recoBeamSpotHandle.isValid()) { 
        
             beamSpot = *recoBeamSpotHandle; 
        
           } else { 
        
             edm::LogError("UnusableBeamSpot") << "No beam spot available from EventSetup"; 
        
           }

if the recoBeamSpotHandle.isValid() is false, there is no trace of the LogError in the logs reported above.

At any rate, skipping the fetching of the reco::BeamSpot from the event and keeping the default constructed one, the issue is sort of reproduced.
A relatively simple way to avoid the crash is to apply the following patch, in which if the beamspot type is unknown, the beamspot constraint is skipped in the vertex fitting algorithm.

diff --git a/RecoVertex/PrimaryVertexProducer/interface/SequentialPrimaryVertexFitterAdapter.h b/RecoVertex/PrimaryVertexProducer/interface/SequentialPrimaryVertexFitterAdapter.h
index bd5f866e2f2..370786b4529 100644
--- a/RecoVertex/PrimaryVertexProducer/interface/SequentialPrimaryVertexFitterAdapter.h
+++ b/RecoVertex/PrimaryVertexProducer/interface/SequentialPrimaryVertexFitterAdapter.h
@@ -26,7 +26,7 @@ public:
     for (auto& cluster : clusters) {
       const std::vector<reco::TransientTrack>& tracklist = cluster.originalTracks();
       TransientVertex v;
-      if (useBeamConstraint && (tracklist.size() > 1)) {
+      if (useBeamConstraint && (tracklist.size() > 1) && beamspot.type() != reco::BeamSpot::Unknown) {
         v = fitter->vertex(tracklist, beamspot);
       } else if (!(useBeamConstraint) && (tracklist.size() > 1)) {
         v = fitter->vertex(tracklist);

Having said that, while this seems relatively safe to do in any case, I am not sure if we really want to keep processing the event in such cases (also probably the processing would crash elsewhere).

trocino · 2024-10-09T10:46:20Z

arsahasransu · 2024-10-13T21:38:44Z

Dr15Jones · 2024-10-14T19:15:40Z

If the CondDBESSource is fetching new IOVs while running, could we get a printout in here?

cmssw/CondCore/CondDB/src/IOVProxy.cc

Lines 295 to 318 in 5dd85e8

    
           void IOVProxy::fetchSequence(cond::Time_t lowerGroup, cond::Time_t higherGroup) { 
        
             m_data->iovSequence.clear(); 
        
             m_session->iovSchema().iovTable().select( 
        
                 m_data->tagInfo.name, lowerGroup, higherGroup, m_data->snapshotTime, m_data->iovSequence); 
        
             if (m_data->iovSequence.empty()) { 
        
               m_data->groupLowerIov = cond::time::MAX_VAL; 
        
               m_data->groupHigherIov = cond::time::MIN_VAL; 
        
             } else { 
        
               if (lowerGroup > cond::time::MIN_VAL) { 
        
                 m_data->groupLowerIov = std::get<0>(m_data->iovSequence.front()); 
        
               } else { 
        
                 m_data->groupLowerIov = cond::time::MIN_VAL; 
        
               } 
        
               m_data->groupHigherIov = std::get<0>(m_data->iovSequence.back()); 
        
               if (higherGroup < cond::time::MAX_VAL) { 
        
                 m_data->groupHigherIov = cond::time::tillTimeFromNextSince(higherGroup, m_data->tagInfo.timeType); 
        
               } else { 
        
                 m_data->groupHigherIov = cond::time::MAX_VAL; 
        
               } 
        
             } 
        
             m_data->numberOfQueries++; 
        
           }

This appears to be where new IOVs would be fetched. Knowing lowerGroup higherGroup m_data->tagInfo.name and final values for m_data->groupLowerIov and m_data->groupHigherIov would go a long way in diagnosing possible glitches.

mmusich · 2024-10-15T08:10:47Z

If the CondDBESSource is fetching new IOVs while running, could we get a printout in here?

are you suggesting to apply this patch in production and debugging live? We haven't been able to reproduce the issue offline so far.

mmusich · 2024-10-15T09:26:02Z

More observed in Run 386872. 21 instances (62 total alerts in F3Mon) in 2 minutes.

just for the record, tried to reproduce (unsuccessfully) also with the error stream files with:

#!/bin/bash -ex

#in CMSSW_14_0_15_patch1

hltGetConfiguration run:386872 \
		    --globaltag 140X_dataRun3_HLT_v3 \
		    --data \
		    --no-prescale \
		    --no-output \
		    --max-events -1 \
		    --input /store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0175_index000013_fu-c2b04-26-01_pid3160231.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0175_
index000033_fu-c2b04-26-01_pid3160231.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0175_index000147_fu-c2b01-11-01_pid832479.root,/store/group/tsg/FOG/error_stream_root/run386872/run
386872_ls0175_index000244_fu-c2b01-39-01_pid2435484.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0175_index000282_fu-c2b03-37-01_pid4070284.root,/store/group/tsg/FOG/error_stream_roo
t/run386872/run386872_ls0175_index000284_fu-c2b03-37-01_pid4070284.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0176_index000195_fu-c2b01-23-01_pid1248291.root,/store/group/tsg/FOG/e
rror_stream_root/run386872/run386872_ls0176_index000205_fu-c2b01-23-01_pid1248291.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0176_index000207_fu-c2b02-13-01_pid3339298.root,/store/
group/tsg/FOG/error_stream_root/run386872/run386872_ls0176_index000238_fu-c2b03-03-01_pid2506133.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0176_index000245_fu-c2b04-32-01_pid26231
4.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0176_index000267_fu-c2b04-33-01_pid186185.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0176_index000301_fu-c2b04-3
3-01_pid186185.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0176_index000306_fu-c2b01-40-01_pid1590507.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0176_index000
333_fu-c2b04-33-01_pid186133.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0176_index000362_fu-c2b03-10-01_pid2698052.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_l
s0176_index000386_fu-c2b03-10-01_pid2698052.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0177_index000181_fu-c2b03-16-01_pid195016.root,/store/group/tsg/FOG/error_stream_root/run3868
72/run386872_ls0177_index000194_fu-c2b03-16-01_pid195016.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0177_index000275_fu-c2b02-34-01_pid618556.root,/store/group/tsg/FOG/error_stream
_root/run386872/run386872_ls0177_index000295_fu-c2b02-34-01_pid618556.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0177_index000388_fu-c2b14-39-01_pid2868193.root,/store/group/tsg/FO
G/error_stream_root/run386872/run386872_ls0178_index000021_fu-c2b05-33-01_pid2821581.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0178_index000032_fu-c2b05-33-01_pid2821581.root,/sto
re/group/tsg/FOG/error_stream_root/run386872/run386872_ls0178_index000057_fu-c2b03-18-01_pid482065.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0178_index000092_fu-c2b03-18-01_pid482
065.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0178_index000153_fu-c2b05-18-01_pid875422.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0178_index000161_fu-c2b02
-27-01_pid2912926.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0178_index000181_fu-c2b01-16-01_pid185808.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0178_index0
00184_fu-c2b02-27-01_pid2912926.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0178_index000200_fu-c2b05-18-01_pid875422.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872
_ls0178_index000350_fu-c2b03-10-01_pid2698167.root > hlt_386872.py

cat <<@EOF >> hlt_386872.py
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt_386872.py &> hlt_386872.log

Dr15Jones · 2024-10-15T13:50:50Z

are you suggesting to apply this patch in production and debugging live? We haven't been able to reproduce the issue offline so far.

Not quite. I think we should create a release with this and only this change and then run it in HLT (and all future standard production as well).

mmusich · 2024-10-15T13:53:37Z

Not quite. I think we should create a release with this and only this change and then run it in HLT (and all future standard production as well).

I fail to see the difference, but by all means if you think it will help, please provide a PR for that.

Dr15Jones · 2024-10-15T14:46:22Z

See #46393

mmusich · 2024-10-16T10:02:01Z

More observed in run386951: 43 total HLT alerts in F3Mon (corresponding elog).
F3Mon log is attached f3mon_logtable_2024-10-16T09_50_46.602Z.txt.
This time a segmentation fault starting with PFTauTransverseImpactParameters::produce is also visible. Waiting for error stream file to attempt to reproduce.

EDIT:

also in this case the crash does not reproduce offline

#!/bin/bash -ex                                                                                                                                                                                            

#in CMSSW_14_0_15_patch1                                                                                                                                                                                   

# Define the directory                                                                                                                                                                                     
DIR="/store/group/tsg/FOG/error_stream_root/run386951/"

# Generate a comma-separated list of the full file paths                                                                                                                                                   
file_list=$(ls "/eos/cms$DIR" | awk -v dir="$DIR" '{print dir $0}' | paste -sd "," -)

# Print the result                                                                                                                                                                                         
echo "$file_list"

hltGetConfiguration run:386951 \
                    --globaltag 140X_dataRun3_HLT_v3 \
                    --data \
                    --no-prescale \
                    --no-output \
                    --max-events -1 \
                    --input $file_list  > hlt_386951.py

cat <<@EOF >> hlt_386951.py
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt_386951.py &> hlt_386951.log

PonIlya · 2024-10-21T09:56:17Z

I asked Frontier Support for help, but no obvious errors were recorded on the Frontier side on October 6th. I am reviewing the squid and tomcat logs to see what data was requested and transferred, and whether any changes were made to the data after the transfer.

mmusich · 2024-11-04T08:30:40Z

We had another instance of this issue in run-387713.
I attach the full log from the FU to this post.

old_hlt_run387713_pid2171967.log

while the pattern of the crash is somewhat as usual:

%MSG-w BeamSpotOnlineProducer:  BeamSpotOnlineProducer:hltOnlineBeamSpot  03-Nov-2024 18:01:11 CET Run: 387713 Event: 16658149
Online Beam Spot producer falls back to DB value because the ESProducer returned a fake beamspot 
%MSG

...

%MSG-e UnusableBeamSpot:  PrimaryVertexProducer:hltVerticesPF  03-Nov-2024 18:01:14 CET Run: 387713 Event: 17032673
Beamspot with invalid errors [            0           0-4.84531e-151
             0           0           0
  -4.84531e-151           0 1.06053e-07 ]
%MSG
----- Begin Fatal Exception 03-Nov-2024 18:01:14 CET-----------------------
An exception of category 'VertexException' occurred while
   [0] Processing  Event run: 387713 lumi: 9 event: 17032673 stream: 22
   [1] Running path 'HLT_ZeroBias_Beamspot_v16'
   [2] Calling method for module PrimaryVertexProducer/'hltVerticesPF'
Exception Message:
BasicSingleVertexState::could not invert error matrix 
----- End Fatal Exception -------------------------------------------------

unfortunately I don't see any useful information that I would have expected from #46395 @Dr15Jones FYI

Dr15Jones · 2024-11-04T11:46:50Z

unfortunately I don't see any useful information that I would have expected from #46395 @Dr15Jones FYI

It looks like the job died processing the first 24 events which, I assume, come from the first LuminosityBlock. The code change I made does not print information from the first request to the database, as that would flood the logs with a huge amount of information. As the next possible chance for an updated value from the database can't happen until we reach the second LuminosityBlock (since we only check for IOV changes on LuminosityBlock boundaries) I would not expect any printouts.

So one piece of information we have gained is if the problem is due to a bad value from the DB, that bad value can happen from the first read value.

PonIlya · 2024-11-08T16:02:06Z

I want to share the Frontier logs from the most recent case on 03-Nov-2024 at 18:01:14 CET Run: 387713, as this might help identify the issue. I didn’t find any unusual behavior based on the queries, but in the worst-case scenario, an outdated payload with an old IOV could be provided. Could this lead to an error? My understanding is that the payload would have to be fully incorrect to cause one.

It’s difficult to precisely match logs from HLT (old_hlt_run387713_pid2171967.log) and Frontier, but based on the approximate time, frontierKey = e8d37523-bd0b-4fe2-b9d5-f9990663645e, the CMSSW version CMSSW_14_1_4_patch3, and the query content, we can make an approximate correlation.

Since this request should fetch the latest payloads recently added, it should query the Frontier server because it won’t find the required data on the Squid nodes. The Frontier server tomcat logs should also show that the data was re-cached from the database.

For BeamSpotOnlineHLT, cached status was used at 17:53:05 (see Tomcat logs) and a check was then made to verify that the data was still current in the database.

In the cms_orcon_prod database, the first relevant IOV 1665206065299506 was inserted on 2024-11-03 at 13:13:09 CET.

According to the server logs, the query passed through:

via: 1.0 fu-c2b01-05-01/0 (squid/frontier-squid-5.9-2.1),
1.1 kvm-s3562-2-ip156-08/squid (squid/frontier-squid-5.9-1.1),
1.1 kvm-s3562-2-ip156-09/squid (squid/frontier-squid-5.9-1.1).

Below is the chain of hosts and the query completion time:
kvm-s3562-2-ip156-09 tomcat logs (11/03/24 17:58:01.101 CET +0100) --> kvm-s3562-2-ip156-09 (03/Nov/2024:17:58:01.102 +0100) --> kvm-s3562-2-ip156-08 03/Nov/2024:17:58:01.103 +0100) squids log ---> fu-c2b01-05-01 squid log (03/Nov/2024:17:58:01.104 +0100)

The logs were shortened and decoded using ~dbfrontier/bin/decodesquidlog on the Frontier server.

Dr15Jones · 2024-12-03T17:07:16Z

At this point, I'd suggest instrumenting PrimaryVertexProducer with more diagnostics when the exception happens. We could add a try { } catch(VertexException&) {} to the code and then annotate the exception with additional information, such as the beam spot info and/or the tracks being used.

mmusich · 2024-12-06T20:58:08Z

to the code and then annotate the exception with additional information, such as the beam spot info and/or the tracks being used.

what about something like this:

diff --git a/RecoVertex/PrimaryVertexProducer/interface/SequentialPrimaryVertexFitterAdapter.h b/RecoVertex/PrimaryVertexProducer/interface/SequentialPrimaryVertexFitterAdapter.h
index 3c75fbab200..70efbdf3b66 100644
--- a/RecoVertex/PrimaryVertexProducer/interface/SequentialPrimaryVertexFitterAdapter.h
+++ b/RecoVertex/PrimaryVertexProducer/interface/SequentialPrimaryVertexFitterAdapter.h
@@ -7,6 +7,8 @@
 
 */
 
+#include <sstream>
+
 #include "RecoVertex/VertexPrimitives/interface/TransientVertex.h"
 #include "TrackingTools/TransientTrack/interface/TransientTrack.h"
 #include "RecoVertex/PrimaryVertexProducer/interface/PrimaryVertexFitterBase.h"
@@ -27,7 +29,17 @@ public:
       const std::vector<reco::TransientTrack>& tracklist = cluster.originalTracks();
       TransientVertex v;
       if (useBeamConstraint && (tracklist.size() > 1)) {
-        v = fitter->vertex(tracklist, beamspot);
+        try {
+          v = fitter->vertex(tracklist, beamspot);
+        } catch (VertexException& ex) {
+          edm::Exception newex(edm::errors::StdException);
+          std::ostringstream beamspotInfo;
+          beamspotInfo << beamspot;  // Stream the beamspot information to a stringstream
+          newex << "An exception was thrown when processing SequentialPrimaryVertexFitterAdapter::fit() : "
+                << ex.what();
+          newex.addContext("Input BeamsSpot parameters: \n" + beamspotInfo.str());
+          throw newex;
+        }
       } else if (!(useBeamConstraint) && (tracklist.size() > 1)) {
         v = fitter->vertex(tracklist);
       }  // else: no fit ==> v.isValid()=False

when forcing the input beam spot to be default-constructed, I get this kind of annotated messages:

----- Begin Fatal Exception 06-Dec-2024 21:54:37 CET-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 386614 lumi: 55 event: 48610345 stream: 0
   [1] Running path 'MC_AK8PFJetPNet_v4'
   [2] Calling method for module PrimaryVertexProducer/'hltVerticesPF'
   [3] Input BeamsSpot parameters: 
-----------------------------------------------------
              Beam Spot Data

 Beam type    = -1
       X0     = 0 +/- 0 [cm]
       Y0     = 0 +/- 0 [cm]
       Z0     = 0 +/- 0 [cm]
 Sigma Z0     = 0 +/- 0 [cm]
 dxdz         = 0 +/- 0 [radians]
 dydz         = 0 +/- 0 [radians]
 Beam width X = 0 +/- 0 [cm]
 Beam width Y = 0 +/- 0 [cm]
 EmittanceX   = 0 [cm]
 EmittanceY   = 0 [cm]
 beta-star    = 0 [cm]
-----------------------------------------------------


Exception Message:
An exception was thrown when processing SequentialPrimaryVertexFitterAdapter::fit() : An exception of category 'VertexException' occurred.
Exception Message:
BasicSingleVertexState::could not invert error matrix 
----- End Fatal Exception -------------------------------------------------

Dr15Jones · 2024-12-06T21:13:34Z

There is no reason to create a new exception, you can do

        try {
          v = fitter->vertex(tracklist, beamspot);
        } catch (VertexException& ex) {
          std::ostringstream beamspotInfo;
          beamspotInfo << "while processing SequentialPrimaryVertexFitterAdapter::fit() with BeamSpot parameters: \n" << beamspot;
          ex.addContext("Input BeamsSpot parameters: \n" + beamspotInfo.str());
          throw;
        }

This preserves to original exception type and still gives the new information.

mmusich · 2024-12-07T10:46:39Z

This preserves to original exception type and still gives the new information.

thanks, I followed up at #46893

cmsbuild added the pending-assignment label Jul 25, 2024

mmusich mentioned this issue Jul 25, 2024

HLT crashes in Run 380399 #44923

Closed

cmsbuild added reconstruction-pending hlt-pending alca-pending pending-signatures tracking and removed pending-assignment labels Jul 25, 2024

mmusich mentioned this issue Aug 9, 2024

BeamSpotOnlineProducer: add more printouts in case of invalid errors #45679

Merged

missirol mentioned this issue Aug 16, 2024

Online-DQM occasionally not picking up the correct IOVs ? #45714

Open

mmusich mentioned this issue Aug 18, 2024

[14.0.X] BeamSpotOnlineProducer: add more printouts in case of invalid errors #45740

Merged

mmusich mentioned this issue Sep 12, 2024

HLT crash in run-368566 (BasicSingleVertexState error from PFTauPrimaryVertexProducer) #41914

Closed

mmusich mentioned this issue Oct 16, 2024

Issue message if DB IOV updated during job #46393

Merged

mmusich mentioned this issue Dec 7, 2024

Improve messages due VertexException in SequentialPrimaryVertexFitterAdapter and clean-up of BeamSpotOnlineProducer #46893

Merged

HLT crash in run 383631 - 'VertexException' and HcalDigisProducerGPU:hltHcalDigisGPU error #45555

HLT crash in run 383631 - 'VertexException' and HcalDigisProducerGPU:hltHcalDigisGPU error #45555

Comments

jalimena commented Jul 25, 2024 • edited Loading

cmsbuild commented Jul 25, 2024 • edited Loading

cmsbuild commented Jul 25, 2024

mmusich commented Jul 25, 2024 • edited Loading

dan131riley commented Jul 25, 2024

jalimena commented Jul 25, 2024 • edited Loading

mmusich commented Jul 25, 2024

dan131riley commented Jul 25, 2024

mmusich commented Jul 25, 2024

mmusich commented Jul 25, 2024

mmusich commented Jul 25, 2024

mmusich commented Jul 25, 2024

cmsbuild commented Jul 25, 2024

mmusich commented Jul 25, 2024 • edited Loading

mmusich commented Jul 25, 2024 • edited Loading

slava77 commented Jul 25, 2024

mmusich commented Jul 25, 2024

makortel commented Jul 25, 2024

makortel commented Jul 25, 2024

makortel commented Jul 25, 2024

mmusich commented Aug 5, 2024

PonIlya commented Aug 9, 2024

mmusich commented Aug 9, 2024 • edited Loading

PonIlya commented Aug 16, 2024

mmusich commented Aug 16, 2024

mmusich commented Oct 7, 2024 • edited Loading

mmusich commented Oct 8, 2024 • edited Loading

trocino commented Oct 9, 2024

arsahasransu commented Oct 13, 2024

Dr15Jones commented Oct 14, 2024

mmusich commented Oct 15, 2024

mmusich commented Oct 15, 2024

Dr15Jones commented Oct 15, 2024

mmusich commented Oct 15, 2024

Dr15Jones commented Oct 15, 2024

mmusich commented Oct 16, 2024 • edited Loading

PonIlya commented Oct 21, 2024

mmusich commented Nov 4, 2024

Dr15Jones commented Nov 4, 2024

PonIlya commented Nov 8, 2024

Dr15Jones commented Dec 3, 2024

mmusich commented Dec 6, 2024

Dr15Jones commented Dec 6, 2024

mmusich commented Dec 7, 2024

jalimena commented Jul 25, 2024 •

edited

Loading

cmsbuild commented Jul 25, 2024 •

edited

Loading

mmusich commented Jul 25, 2024 •

edited

Loading

jalimena commented Jul 25, 2024 •

edited

Loading

mmusich commented Jul 25, 2024 •

edited

Loading

mmusich commented Jul 25, 2024 •

edited

Loading

mmusich commented Aug 9, 2024 •

edited

Loading

mmusich commented Oct 7, 2024 •

edited

Loading

mmusich commented Oct 8, 2024 •

edited

Loading

mmusich commented Oct 16, 2024 •

edited

Loading