-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HLT crash in run 383631 - 'VertexException' and HcalDigisProducerGPU:hltHcalDigisGPU error #45555
Comments
cms-bot internal usage |
A new Issue was created by @jalimena. @Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
For record, also using repacked data with the following script: #!/bin/bash -ex
# CMSSW_14_0_11_MULTIARCHS
hltGetConfiguration run:383631 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input '/store/group/tsg/FOG/error_stream_root/run383631/run383631_ls0444_index000423_fu-c2b14-43-01_pid675389.root,/store/group/tsg/FOG/error_stream_root/run383631/run383631_ls0445_index000025_fu-c2b14
-43-01_pid675389.root,/store/group/tsg/FOG/error_stream_root/run383631/run383631_ls0666_index000047_fu-c2b14-43-01_pid675067.root,/store/group/tsg/FOG/error_stream_root/run383631/run383631_ls0445_index00
0002_fu-c2b14-43-01_pid675389.root,/store/group/tsg/FOG/error_stream_root/run383631/run383631_ls0666_index000027_fu-c2b14-43-01_pid675067.root,/store/group/tsg/FOG/error_stream_root/run383631/run383631_l
s0666_index000065_fu-c2b14-43-01_pid675067.root' > hlt.py
cat <<@EOF >> hlt.py
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF
cmsRun hlt.py &> hlt.log doesn't reproduce. |
@jalimena where can we find the full stack trace for all threads? |
sorry i have uploaded the complete stack trace from the DAQ shifter in the issue description. Let me know if it looks incomplete. |
As the "regular" reproducing script (see #45555 (comment)) doesn't lead to a crash I have explored the option of running filling with junk the memory on the host and device allocators (cf #44923 (comment)).
when doing the same in
Not sure if that's expected. [1] Click me#!/bin/bash -ex
# CMSSW_14_0_11_MULTIARCHS
hltGetConfiguration run:383631 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input '/store/group/tsg/FOG/error_stream_root/run383631/run383631_ls0444_index000423_fu-c2b14-43-01_pid675389.root,/store/group/tsg/FOG/error_stream_root/run383631/run383631_ls0445_index000025_fu-c2b14
-43-01_pid675389.root,/store/group/tsg/FOG/error_stream_root/run383631/run383631_ls0666_index000047_fu-c2b14-43-01_pid675067.root,/store/group/tsg/FOG/error_stream_root/run383631/run383631_ls0445_index00
0002_fu-c2b14-43-01_pid675389.root,/store/group/tsg/FOG/error_stream_root/run383631/run383631_ls0666_index000027_fu-c2b14-43-01_pid675067.root,/store/group/tsg/FOG/error_stream_root/run383631/run383631_l
s0666_index000065_fu-c2b14-43-01_pid675067.root' > hlt.py
cat <<@EOF >> hlt.py
del process.MessageLogger
process.load('FWCore.MessageService.MessageLogger_cfi')
process.MessageLogger.CUDAService = {}
process.MessageLogger.AlpakaService = {}
process.load('HeterogeneousCore.CUDAServices.CUDAService_cfi')
from HeterogeneousCore.AlpakaServices.AlpakaServiceCudaAsync_cfi import AlpakaServiceCudaAsync as _AlpakaServiceCudaAsync
process.AlpakaServiceCudaAsync = _AlpakaServiceCudaAsync.clone(
verbose = True,
hostAllocator = dict(
binGrowth = 2,
minBin = 8, # 256 bytes
maxBin = 30, # 1 GB
maxCachedBytes = 64*1024*1024*1024, # 64 GB
maxCachedFraction = 0.8, # or 80%, whatever is less
fillAllocations = True,
fillAllocationValue = 0xA5,
fillReallocations = True,
fillReallocationValue = 0x69,
fillDeallocations = True,
fillDeallocationValue = 0x5A,
fillCaches = True,
fillCacheValue = 0x96
),
deviceAllocator = dict(
binGrowth = 2,
minBin = 8, # 256 bytes
maxBin = 30, # 1 GB
maxCachedBytes = 8*1024*1024*1024, # 8 GB
maxCachedFraction = 0.8, # or 80%, whatever is less
fillAllocations = True,
fillAllocationValue = 0xA5,
fillReallocations = True,
fillReallocationValue = 0x69,
fillDeallocations = True,
fillDeallocationValue = 0x5A,
fillCaches = True,
fillCacheValue = 0x96
)
)
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF
cmsRun hlt.py &> hlt.log |
Thanks. There's a possibly relevant error from the PrimaryVertexProducer, and I guess a question for physics is whether having the BasicSingleVertexState throw on invalid errors is the desired behavior:
The detailed stack trace shows that stack trace interpretation is getting complicated by the async GPU calls--the |
I didn't notice at first glance:
So the plot thickens and looks to go in the direction of the already observed #41914 in which we suspected beamspot issues. |
assign hlt, reconstruction, alca |
type tracking |
@cms-sw/tracking-pog-l2 FYI |
New categories assigned: hlt,reconstruction,alca @Martin-Grunewald,@mmusich,@jfernan2,@mandrenguyen,@saumyaphor4252,@perrotta,@consuegs you have been requested to review this Pull request/Issue and eventually sign? Thanks |
spoon-feeding a null-covariance matrix fake BeamSpot to the HLT with: diff --git a/RecoVertex/BeamSpotProducer/plugins/BeamSpotOnlineProducer.cc b/RecoVertex/BeamSpotProducer/plugins/BeamSpotOnlineProducer.cc
index 83aa832cfa5..56f0c8948b6 100644
--- a/RecoVertex/BeamSpotProducer/plugins/BeamSpotOnlineProducer.cc
+++ b/RecoVertex/BeamSpotProducer/plugins/BeamSpotOnlineProducer.cc
@@ -108,7 +108,7 @@ void BeamSpotOnlineProducer::produce(Event& iEvent, const EventSetup& iSetup) {
edm::LogWarning("BeamSpotFromDB")
<< "Online Beam Spot producer falls back to DB value because the ESProducer returned a fake beamspot ";
}
- fallBackToDB = true;
+ //fallBackToDB = true;
} else {
// translate from BeamSpotObjects to reco::BeamSpot
// in case we need to switch to LHC reference frame
diff --git a/RecoVertex/BeamSpotProducer/plugins/OnlineBeamSpotESProducer.cc b/RecoVertex/BeamSpotProducer/plugins/OnlineBeamSpotESProducer.cc
index 0b8c4233c7c..2c42e2b6d63 100644
--- a/RecoVertex/BeamSpotProducer/plugins/OnlineBeamSpotESProducer.cc
+++ b/RecoVertex/BeamSpotProducer/plugins/OnlineBeamSpotESProducer.cc
@@ -55,13 +55,21 @@ OnlineBeamSpotESProducer::OnlineBeamSpotESProducer(const edm::ParameterSet& p)
fakeBS_.setPosition(0.0001, 0.0001, 0.0001);
fakeBS_.setType(-1);
// Set diagonal covariance, i.e. errors on the parameters
- fakeBS_.setCovariance(0, 0, 5e-10);
- fakeBS_.setCovariance(1, 1, 5e-10);
- fakeBS_.setCovariance(2, 2, 0.002);
- fakeBS_.setCovariance(3, 3, 0.002);
- fakeBS_.setCovariance(4, 4, 5e-11);
- fakeBS_.setCovariance(5, 5, 5e-11);
- fakeBS_.setCovariance(6, 6, 1e-09);
+ fakeBS_.setCovariance(0, 0, 0.);
+ fakeBS_.setCovariance(1, 1, 0.);
+ fakeBS_.setCovariance(2, 2, 0.);
+ fakeBS_.setCovariance(3, 3, 0.);
+ fakeBS_.setCovariance(4, 4, 0.);
+ fakeBS_.setCovariance(5, 5, 0.);
+ fakeBS_.setCovariance(6, 6, 0.);
bsHLTToken_ = cc.consumesFrom<BeamSpotOnlineObjects, BeamSpotOnlineHLTObjectsRcd>();
bsLegacyToken_ = cc.consumesFrom<BeamSpotOnlineObjects, BeamSpotOnlineLegacyObjectsRcd>();
@@ -179,7 +187,7 @@ std::shared_ptr<const BeamSpotObjects> OnlineBeamSpotESProducer::produce(const B
return std::shared_ptr<const BeamSpotObjects>(best, edm::do_nothing_deleter());
}
edm::LogWarning("OnlineBeamSpotESProducer")
- << "None of the Online BeamSpots in the ES is suitable, \n returning a fake one(fallback to PCL).";
+ << "None of the Online BeamSpots in the ES is suitable, \n returning a fake one (fallback to PCL).";
return std::shared_ptr<const BeamSpotObjects>(&fakeBS_, edm::do_nothing_deleter());
} I sort of reproduce the crash:
the questions are:
|
there's no such payload in the DB for run 383631: |
does it look like there was a glitch in the DB access or did the zeroes show up from another part of the logic in the BS producer? |
Possible, but it's not clear from where it would read it then (I would have expected a framework exception).
By looking at the code it's not clear to me from where it could come. |
It could indeed make sense to demote the exception
to |
After are more careful look I convinced myself this case has to stay as an exception. Generally we want to stop processing as soon as possible when this situation happens (e.g. in this case to prevent calling Modules in different streams throwing exceptions should be fine, and two independent modules processed by one stream throwing exceptions should also be fine. From the (core) framework perspective the I would add tests for the "Alpaka framework" that an Alpaka module throwing an exception in |
Written that, I wonder what the other exception
which I'd bet is the cause for std::terminate() here 🤦 . I'll fix that.
|
@cms-sw/db-l2 @PonIlya is it possible to have an audit of the P5 frontier squid at the time of the crash ( 25-Jul-2024 08:49:19 )? |
|
@PonIlya , thank you for the reply.
to be perfectly honest I am not sure, but there are indications that somehow the connection to the DB "glitched" during this period of time. As a DB expert you are perhaps in a better position to judge if there was any anomaly during that time. |
@mmusich Unfortunately, the logs are no longer available for audit. Let me know if a similar issue arises again so that I can respond more quickly. The logs are deleted once they reach a certain size. For example, as of today, only logs from August 7th are available. |
Logging here for the record. in run-386614 we observed a burst of HLT crashes, all around 02:44-02:45, on multiple FUs.
(the other crashes on the same FU come from the same PID). In total, 8 processes crashed (logs are attached below [1]) Interestingly in the logs there's this:
(which matches the content of the DQM plots) , which comes from here, but there is no warning from here which I would have expected to see as well (given the other kind of message from the vertex producer). @cms-sw/db-l2 @PonIlya FYI [1] [2] #!/bin/bash -ex
#in CMSSW_14_0_15_patch1
hltGetConfiguration run:386614 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input /store/group/tsg/FOG/error_stream_root/run386614/run386614_ls0055_index000003_fu-c2b02-33-01_pid4022333.root,/store/group/tsg/FOG/error_stream_root/run386614/run386614_ls0055_index000005_fu-c2b04-26-01_pid18256.root,/store/group/tsg/FOG/error_stream_root/run386614/run386614_ls0055_index000019_fu-c2b04-26-01_pid18256.root,/store/group/tsg/FOG/error_stream_root/run386614/run386614_ls0055_index000027_fu-c2b02-33-01_pid4022333.root,/store/group/tsg/FOG/error_stream_root/run386614/run386614_ls0055_index000033_fu-c2b04-26-01_pid18256.root,/store/group/tsg/FOG/error_stream_root/run386614/run386614_ls0055_index000062_fu-c2b02-12-01_pid399919.root,/store/group/tsg/FOG/error_stream_root/run386614/run386614_ls0055_index000078_fu-c2b02-12-01_pid399919.root,/store/group/tsg/FOG/error_stream_root/run386614/run386614_ls0055_index000156_fu-c2b02-33-01_pid4022422.root,/store/group/tsg/FOG/error_stream_root/run386614/run386614_ls0056_index000023_fu-c2b03-34-01_pid3391489.root,/store/group/tsg/FOG/error_stream_root/run386614/run386614_ls0056_index000046_fu-c2b03-34-01_pid3391489.root,/store/group/tsg/FOG/error_stream_root/run386614/run386614_ls0056_index000101_fu-c2b14-11-01_pid2066514.root,/store/group/tsg/FOG/error_stream_root/run386614/run386614_ls0056_index000138_fu-c2b01-36-01_pid984863.root,/store/group/tsg/FOG/error_stream_root/run386614/run386614_ls0056_index000152_fu-c2b01-36-01_pid984863.root,/store/group/tsg/FOG/error_stream_root/run386614/run386614_ls0057_index000266_fu-c2b01-12-01_pid3143597.root,/store/group/tsg/FOG/error_stream_root/run386614/run386614_ls0057_index000277_fu-c2b01-12-01_pid3143597.root > hlt_386614.py
cat <<@EOF >> hlt_386614.py
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF
cmsRun hlt_386614.py &> hlt_386614.log |
As far as I could dig, the only place in which a default constructed cmssw/DataFormats/BeamSpot/src/BeamSpot.cc Lines 31 to 35 in d4aca16
is here:
on the other hand it's not clear to me why in the following instructions in the routine cmssw/RecoVertex/PrimaryVertexProducer/plugins/PrimaryVertexProducer.cc Lines 160 to 166 in 1d0afd0
if the At any rate, skipping the fetching of the diff --git a/RecoVertex/PrimaryVertexProducer/interface/SequentialPrimaryVertexFitterAdapter.h b/RecoVertex/PrimaryVertexProducer/interface/SequentialPrimaryVertexFitterAdapter.h
index bd5f866e2f2..370786b4529 100644
--- a/RecoVertex/PrimaryVertexProducer/interface/SequentialPrimaryVertexFitterAdapter.h
+++ b/RecoVertex/PrimaryVertexProducer/interface/SequentialPrimaryVertexFitterAdapter.h
@@ -26,7 +26,7 @@ public:
for (auto& cluster : clusters) {
const std::vector<reco::TransientTrack>& tracklist = cluster.originalTracks();
TransientVertex v;
- if (useBeamConstraint && (tracklist.size() > 1)) {
+ if (useBeamConstraint && (tracklist.size() > 1) && beamspot.type() != reco::BeamSpot::Unknown) {
v = fitter->vertex(tracklist, beamspot);
} else if (!(useBeamConstraint) && (tracklist.size() > 1)) {
v = fitter->vertex(tracklist); Having said that, while this seems relatively safe to do in any case, I am not sure if we really want to keep processing the event in such cases (also probably the processing would crash elsewhere). |
For bookkeeping, I'm attaching the log files from run 386703. There were 14 instances (42 total alerts in F3Mon). |
If the CondDBESSource is fetching new IOVs while running, could we get a printout in here? cmssw/CondCore/CondDB/src/IOVProxy.cc Lines 295 to 318 in 5dd85e8
This appears to be where new IOVs would be fetched. Knowing |
are you suggesting to apply this patch in production and debugging live? We haven't been able to reproduce the issue offline so far. |
just for the record, tried to reproduce (unsuccessfully) also with the error stream files with: #!/bin/bash -ex
#in CMSSW_14_0_15_patch1
hltGetConfiguration run:386872 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input /store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0175_index000013_fu-c2b04-26-01_pid3160231.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0175_
index000033_fu-c2b04-26-01_pid3160231.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0175_index000147_fu-c2b01-11-01_pid832479.root,/store/group/tsg/FOG/error_stream_root/run386872/run
386872_ls0175_index000244_fu-c2b01-39-01_pid2435484.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0175_index000282_fu-c2b03-37-01_pid4070284.root,/store/group/tsg/FOG/error_stream_roo
t/run386872/run386872_ls0175_index000284_fu-c2b03-37-01_pid4070284.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0176_index000195_fu-c2b01-23-01_pid1248291.root,/store/group/tsg/FOG/e
rror_stream_root/run386872/run386872_ls0176_index000205_fu-c2b01-23-01_pid1248291.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0176_index000207_fu-c2b02-13-01_pid3339298.root,/store/
group/tsg/FOG/error_stream_root/run386872/run386872_ls0176_index000238_fu-c2b03-03-01_pid2506133.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0176_index000245_fu-c2b04-32-01_pid26231
4.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0176_index000267_fu-c2b04-33-01_pid186185.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0176_index000301_fu-c2b04-3
3-01_pid186185.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0176_index000306_fu-c2b01-40-01_pid1590507.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0176_index000
333_fu-c2b04-33-01_pid186133.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0176_index000362_fu-c2b03-10-01_pid2698052.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_l
s0176_index000386_fu-c2b03-10-01_pid2698052.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0177_index000181_fu-c2b03-16-01_pid195016.root,/store/group/tsg/FOG/error_stream_root/run3868
72/run386872_ls0177_index000194_fu-c2b03-16-01_pid195016.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0177_index000275_fu-c2b02-34-01_pid618556.root,/store/group/tsg/FOG/error_stream
_root/run386872/run386872_ls0177_index000295_fu-c2b02-34-01_pid618556.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0177_index000388_fu-c2b14-39-01_pid2868193.root,/store/group/tsg/FO
G/error_stream_root/run386872/run386872_ls0178_index000021_fu-c2b05-33-01_pid2821581.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0178_index000032_fu-c2b05-33-01_pid2821581.root,/sto
re/group/tsg/FOG/error_stream_root/run386872/run386872_ls0178_index000057_fu-c2b03-18-01_pid482065.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0178_index000092_fu-c2b03-18-01_pid482
065.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0178_index000153_fu-c2b05-18-01_pid875422.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0178_index000161_fu-c2b02
-27-01_pid2912926.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0178_index000181_fu-c2b01-16-01_pid185808.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0178_index0
00184_fu-c2b02-27-01_pid2912926.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872_ls0178_index000200_fu-c2b05-18-01_pid875422.root,/store/group/tsg/FOG/error_stream_root/run386872/run386872
_ls0178_index000350_fu-c2b03-10-01_pid2698167.root > hlt_386872.py
cat <<@EOF >> hlt_386872.py
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF
cmsRun hlt_386872.py &> hlt_386872.log |
Not quite. I think we should create a release with this and only this change and then run it in HLT (and all future standard production as well). |
I fail to see the difference, but by all means if you think it will help, please provide a PR for that. |
See #46393 |
More observed in run386951: 43 total HLT alerts in F3Mon (corresponding elog). EDIT: also in this case the crash does not reproduce offline #!/bin/bash -ex
#in CMSSW_14_0_15_patch1
# Define the directory
DIR="/store/group/tsg/FOG/error_stream_root/run386951/"
# Generate a comma-separated list of the full file paths
file_list=$(ls "/eos/cms$DIR" | awk -v dir="$DIR" '{print dir $0}' | paste -sd "," -)
# Print the result
echo "$file_list"
hltGetConfiguration run:386951 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input $file_list > hlt_386951.py
cat <<@EOF >> hlt_386951.py
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF
cmsRun hlt_386951.py &> hlt_386951.log |
I asked Frontier Support for help, but no obvious errors were recorded on the Frontier side on October 6th. I am reviewing the squid and tomcat logs to see what data was requested and transferred, and whether any changes were made to the data after the transfer. |
We had another instance of this issue in run-387713. old_hlt_run387713_pid2171967.log while the pattern of the crash is somewhat as usual: %MSG-w BeamSpotOnlineProducer: BeamSpotOnlineProducer:hltOnlineBeamSpot 03-Nov-2024 18:01:11 CET Run: 387713 Event: 16658149
Online Beam Spot producer falls back to DB value because the ESProducer returned a fake beamspot
%MSG
...
%MSG-e UnusableBeamSpot: PrimaryVertexProducer:hltVerticesPF 03-Nov-2024 18:01:14 CET Run: 387713 Event: 17032673
Beamspot with invalid errors [ 0 0-4.84531e-151
0 0 0
-4.84531e-151 0 1.06053e-07 ]
%MSG
----- Begin Fatal Exception 03-Nov-2024 18:01:14 CET-----------------------
An exception of category 'VertexException' occurred while
[0] Processing Event run: 387713 lumi: 9 event: 17032673 stream: 22
[1] Running path 'HLT_ZeroBias_Beamspot_v16'
[2] Calling method for module PrimaryVertexProducer/'hltVerticesPF'
Exception Message:
BasicSingleVertexState::could not invert error matrix
----- End Fatal Exception ------------------------------------------------- unfortunately I don't see any useful information that I would have expected from #46395 @Dr15Jones FYI |
It looks like the job died processing the first 24 events which, I assume, come from the first LuminosityBlock. The code change I made does not print information from the first request to the database, as that would flood the logs with a huge amount of information. As the next possible chance for an updated value from the database can't happen until we reach the second LuminosityBlock (since we only check for IOV changes on LuminosityBlock boundaries) I would not expect any printouts. So one piece of information we have gained is if the problem is due to a bad value from the DB, that bad value can happen from the first read value. |
I want to share the Frontier logs from the most recent case on 03-Nov-2024 at 18:01:14 CET Run: 387713, as this might help identify the issue. I didn’t find any unusual behavior based on the queries, but in the worst-case scenario, an outdated payload with an old IOV could be provided. Could this lead to an error? My understanding is that the payload would have to be fully incorrect to cause one. It’s difficult to precisely match logs from HLT (old_hlt_run387713_pid2171967.log) and Frontier, but based on the approximate time, frontierKey = e8d37523-bd0b-4fe2-b9d5-f9990663645e, the CMSSW version CMSSW_14_1_4_patch3, and the query content, we can make an approximate correlation. Since this request should fetch the latest payloads recently added, it should query the Frontier server because it won’t find the required data on the Squid nodes. The Frontier server tomcat logs should also show that the data was re-cached from the database. For BeamSpotOnlineHLT, cached status was used at 17:53:05 (see Tomcat logs) and a check was then made to verify that the data was still current in the database. In the cms_orcon_prod database, the first relevant IOV 1665206065299506 was inserted on 2024-11-03 at 13:13:09 CET. According to the server logs, the query passed through:
Below is the chain of hosts and the query completion time:
The logs were shortened and decoded using ~dbfrontier/bin/decodesquidlog on the Frontier server. |
At this point, I'd suggest instrumenting PrimaryVertexProducer with more diagnostics when the exception happens. We could add a |
what about something like this: diff --git a/RecoVertex/PrimaryVertexProducer/interface/SequentialPrimaryVertexFitterAdapter.h b/RecoVertex/PrimaryVertexProducer/interface/SequentialPrimaryVertexFitterAdapter.h
index 3c75fbab200..70efbdf3b66 100644
--- a/RecoVertex/PrimaryVertexProducer/interface/SequentialPrimaryVertexFitterAdapter.h
+++ b/RecoVertex/PrimaryVertexProducer/interface/SequentialPrimaryVertexFitterAdapter.h
@@ -7,6 +7,8 @@
*/
+#include <sstream>
+
#include "RecoVertex/VertexPrimitives/interface/TransientVertex.h"
#include "TrackingTools/TransientTrack/interface/TransientTrack.h"
#include "RecoVertex/PrimaryVertexProducer/interface/PrimaryVertexFitterBase.h"
@@ -27,7 +29,17 @@ public:
const std::vector<reco::TransientTrack>& tracklist = cluster.originalTracks();
TransientVertex v;
if (useBeamConstraint && (tracklist.size() > 1)) {
- v = fitter->vertex(tracklist, beamspot);
+ try {
+ v = fitter->vertex(tracklist, beamspot);
+ } catch (VertexException& ex) {
+ edm::Exception newex(edm::errors::StdException);
+ std::ostringstream beamspotInfo;
+ beamspotInfo << beamspot; // Stream the beamspot information to a stringstream
+ newex << "An exception was thrown when processing SequentialPrimaryVertexFitterAdapter::fit() : "
+ << ex.what();
+ newex.addContext("Input BeamsSpot parameters: \n" + beamspotInfo.str());
+ throw newex;
+ }
} else if (!(useBeamConstraint) && (tracklist.size() > 1)) {
v = fitter->vertex(tracklist);
} // else: no fit ==> v.isValid()=False when forcing the input beam spot to be default-constructed, I get this kind of annotated messages:
|
There is no reason to create a new exception, you can do try {
v = fitter->vertex(tracklist, beamspot);
} catch (VertexException& ex) {
std::ostringstream beamspotInfo;
beamspotInfo << "while processing SequentialPrimaryVertexFitterAdapter::fit() with BeamSpot parameters: \n" << beamspot;
ex.addContext("Input BeamsSpot parameters: \n" + beamspotInfo.str());
throw;
} This preserves to original exception type and still gives the new information. |
thanks, I followed up at #46893 |
In run 383631 (pp collisions, release
CMSSW_14_0_11_MULTIARCHS
), we got this error:Complete stack trace should be here:
old_hlt_run383631_pid675389.log
I tried to reproduce it, but no success:
Possibly related to #41914
@cms-sw/hlt-l2 FYI
@cms-sw/heterogeneous-l2 FYI
Juliette, for FOG
The text was updated successfully, but these errors were encountered: