-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HLT farm crash in run 379617 (part-2) #44786
Comments
@cms-sw/tracking-pog-l2 FYI |
cms-bot internal usage |
A new Issue was created by @mmusich. @rappoccio, @antoniovilela, @smuzaffar, @makortel, @Dr15Jones, @sextonkennedy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign hlt, heterogeneous, reconstruction |
FYI @AdrianoDee |
New categories assigned: hlt,heterogeneous,reconstruction @Martin-Grunewald,@mmusich,@fwyzard,@jfernan2,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks |
The failing assertion is a NaN check cmssw/RecoTracker/PixelSeeding/plugins/alpaka/BrokenLineFit.dev.cc Lines 163 to 167 in 653fed5
|
Curious. Unrelated to the crash, but I wonder if ALPAKA_ASSERT_ACC(not isnan(fast_fit(0)));
ALPAKA_ASSERT_ACC(not isnan(fast_fit(1)));
ALPAKA_ASSERT_ACC(not isnan(fast_fit(2)));
ALPAKA_ASSERT_ACC(not isnan(fast_fit(3))); wouldn't be easier to understand when reading the code ? |
Added the following printf in
Output on GPU.
Output on CPU.
On GPU, both |
The CUDA implementation (tested by just using a different HLT menu [1]) also produces the How to proceed ? Remove the [1] #!/bin/bash
jobLabel=test_cmssw44786_cuda
if [ ! -f "${jobLabel}"_cfg.py ]; then
https_proxy=http://cmsproxy.cms:3128/ \
hltGetConfiguration run:379660 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--paths AlCa_PFJet40_v* \
--max-events 1 \
--input root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/debug/240417_run379617/run379617_ls0329_index000242_fu-c2b02-12-01_pid3327112.root \
> "${jobLabel}"_cfg.py
cat <<@EOF >> "${jobLabel}"_cfg.py
del process.hltL1sZeroBias
if hasattr(process, 'HLTAnalyzerEndpath'):
del process.HLTAnalyzerEndpath
try:
del process.MessageLogger
process.load('FWCore.MessageLogger.MessageLogger_cfi')
process.MessageLogger.cerr.enableStatistics = False
except:
pass
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
process.source.skipEvents = cms.untracked.uint32( 86 )
process.options.wantSummary = True
@EOF
fi
CUDA_LAUNCH_BLOCKING=1 \
cmsRun "${jobLabel}"_cfg.py &> "${jobLabel}".log |
type tracking |
Naively I don't think that just removing the assert is a good approach (dust, carpet, you get the idea).
So the three points are basically aligned. |
while agreeing with the approach, we need to make soon an assessment because every day we spend discussing this, is one less day we take data with the menu V1.1 (and all its physics triggers updates).
|
I do not see any reason for
|
I am loath to adopt "temporary" solutions, because they have the undesirable tendency to stick around much longer than intended. |
|
Technically, I think you did :-p
Probably yes.
Technically a device-side assert does just that: it makes the kernel fail, which is caught by an asynchronous exception on the host.
OK - but is there already some code on the host that would catch it and ...
... do this ? |
Yes, the track would be discarded later in the chain cmssw/RecoTracker/PixelSeeding/plugins/alpaka/CAHitNtupletGeneratorKernelsImpl.h Lines 524 to 532 in 3fb8b0e
I checked that it is the case for the event there (when disabling the NaN checks). Note that this is not only a single event, it's actually a single triplet in the whole run. |
Curious - |
by the way - why ? 3 (almost) aligned hits would point to a very high pT track. Why should we invalidate it ? |
could be printed with full precision using "%a" instead of "%f"? |
I added the printf and run on CPU: got no output! |
edited the file in interface, not in inteface/alpaka.... |
found
this is on cpu I suppose
is not a subtle precision issue: the cross product of those three double precision numbers (any combination) is zero (on cpu as well) no matter how one computes them. On cpu the imputs have few bits different |
A track with zero curvature will sooner or later produce a NaN and somebody will reject it. |
for the record, we checked all the available error stream files for that run with the following script [1] in This particular issue should be dealt with by #44808 (if accepted). [1] #!/bin/bash -ex
# CMSSW_14_0_5_patch2
hltGetConfiguration run:379613 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input file:converted.root > hlt.py
cat <<@EOF >> hlt.py
process.options.numberOfThreads = 32
process.options.numberOfStreams = 32
@EOF
# Define a function to execute each iteration of the loop
process_file() {
inputfile="$1"
outputfile="${inputfile%.root}"
cp hlt.py hlt_${outputfile}.py
sed -i "s/file:converted\.root/\/store\/group\/tsg\/FOG\/debug\/240417_run379613\/${inputfile}/g" hlt_${outputfile}.py
cmsRun hlt_${outputfile}.py &> "${outputfile}.log"
}
# Export the function so it can be used by parallel
export -f process_file
# Find the root files and run the function in parallel using GNU Parallel
eos ls /eos/cms/store/group/tsg/FOG/debug/240417_run379613/ | grep '\.root$' | parallel -j 8 process_file |
+hlt
|
+heterogeneous |
While reviewing the whole list of error streamer files from run 379617 (related issue #44769) stored on
/eos/cms/store/group/tsg/FOG/debug/240417_run379617/
to ascertain ifCMSSW_14_0_5_patch2
fixed all of them using the following script [1] I've found a single instance which still crashes taking in input the file/eos/cms/store/group/tsg/FOG/debug/240417_run379617/run379617_ls0329_index000242_fu-c2b02-12-01_pid3327112.root
.To reproduce:
and then running:
On
lxplus-gpu
the following assertion is hit:while on
lxplus
(so on CPU) no crash is observed.@cms-sw/hlt-l2 FYI
@cms-sw/heterogeneous-l2 FYI
[1]
Click me
The text was updated successfully, but these errors were encountered: