-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HLT Farm crashes in run 378940 #44634
Comments
cms-bot internal usage |
A new Issue was created by @wonpoint4. @makortel, @sextonkennedy, @antoniovilela, @Dr15Jones, @smuzaffar, @rappoccio can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign hlt, heterogeneous |
New categories assigned: hlt,heterogeneous @Martin-Grunewald,@mmusich,@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks |
executing the reproducer with I see the following stack:
@cms-sw/pf-l2 FYI |
From the stack trace, it seems that an exception was thrown while another exception was being handled:
while
@mmusich, if you have time to look into this further, could you try running with a single stream / single thread, and post the full stack trace ? |
sure. Adding to the configuration file
I get the following stack attached: crash_run378940.log
|
Thanks. So, "cudaErrorIllegalAddress" is basically the GPU equivalent of "Segmentation violation" :-( What happens with the stack trace is that once we hit a CUDA error, we raise an exception and start unwinding the stack. While doing that we try to free some CUDA memory, but that call to do that also fails (because Of course this doesn't explain the reason for the error that we hit in the first place... that will need to be debugged. |
Here's a second reproducer (same input events). I see the seg-fault when running on CPU only too. #!/bin/bash -ex
# CMSSW_14_0_4
hltGetConfiguration run:378940 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input /store/group/tsg/FOG/debug/240405_run378940/files/run378940_ls0021_index000036_fu-c2b02-31-01_pid1363776.root \
> hlt.py
cat <<@EOF >> hlt.py
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
process.options.accelerators = ["*"]
@EOF
CUDA_LAUNCH_BLOCKING=1 \
cmsRun hlt.py &> hlt.log Stack trace here: hlt.log.
|
type pf |
would running in |
the trace was more informative when recompiled with |
Just to note that (see #44634 (comment)) I get a crash even on CPU, so I suspect the issue is unrelated to CUDA or GPUs (but it should be double-checked, of course). In that case, the title of the issue should be updated. @wonpoint4 |
I was wondering if the warning I reported above
generated here: cmssw/RecoParticleFlow/PFClusterProducer/plugins/alpaka/PFClusterSoAProducerKernel.dev.cc Lines 1308 to 1311 in f5861db
might give hints. |
It sort of makes sense to me that with this I am still investigating in the PF Alpaka Kernel since this number of rechit fractions seems strangely large when preceding events look more reasonable. |
I'm guessing that
@jsamudio could you check what is the actual SoA size in the event where the crash happens ? If this is overflow is the cause of the crash - what can be done to avoid it ? |
In the event where we see the crash we have As for adding an error and skipping the event, I understand the idea, but I don't know if I've seen an example of something similar to this before. Perhaps someone else has and could point me to an implementation? |
As a quick workaround, would it work to increase the 120 to something like 250 in the HLT menu ? Not as a long term solution, but to eliminate or at least reduce the online crashes, while a better solution is being investigated. |
Would this entail a configuration change or change in the code (new online release)? |
I think it's a configuration parameter.
|
answering myself: process.hltParticleFlowClusterHBHESoA = cms.EDProducer( "PFClusterSoAProducer@alpaka",
pfRecHits = cms.InputTag( "hltParticleFlowRecHitHBHESoA" ),
pfClusterParams = cms.ESInputTag( "hltESPPFClusterParams","" ),
topology = cms.ESInputTag( "hltESPPFRecHitHCALTopology","" ),
synchronise = cms.bool( False ),
- pfRecHitFractionAllocation = cms.int32( 120 ),
+ pfRecHitFractionAllocation = cms.int32( 250 ),
alpaka = cms.untracked.PSet( backend = cms.untracked.string( "" ) )
) |
FTR, I double-checked that #44634 (comment) avoids the crash in the reproducer, and the HLT throughput is not affected, so it looks like a good short-term solution. Two extra notes.
|
I took a stab on trying to have the error(s) reported properly via exceptions rather than crashes (caused by exceptions being thrown during stack unwinding caused by an exception). #44730 should improve the situation (especially when running with While developing the PR I started to wonder if Alpaka-specific exception type (or GPU runtime specific? or |
for the record this was also tracked at https://its.cern.ch/jira/browse/CMSHLT-3144 |
Proposed solutions:
In a #!/bin/bash -ex
#in CMSSW_14_0_15_patch1
hltGetConfiguration run:378940 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input /store/group/tsg/FOG/error_stream_root/run378940/run378940_ls0021_index000036_fu-c2b02-31-01_pid1363776.root > hlt_378940.py
cat <<@EOF >> hlt_378940.py
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF
cmsRun hlt_378940.py &> hlt_378940.log was still failing with the following messages: At the end of topoClusterContraction, found large *pcrhFracSize = 2220194
At the end of topoClusterContraction, found large *pcrhFracSize = 2213019
Out of range index in ViewTemplateFreeParams::operator[]
[...]
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_17-el8_amd64_gcc12/build/CMSSW_14_0_17-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/event/EventUnifor\
mCudaHipRt.hpp(66) 'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error : 'cudaErrorLaunchFailure': 'unspecified launch failure'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_17-el8_amd64_gcc12/build/CMSSW_14_0_17-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/mem/buf/BufUnifor\
mCudaHipRt.hpp(356) 'TApi::hostFree(ptr)' returned error : 'cudaErrorLaunchFailure': 'unspecified launch failure'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_17-el8_amd64_gcc12/build/CMSSW_14_0_17-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/event/EventUnifor\
mCudaHipRt.hpp(66) 'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error : 'cudaErrorLaunchFailure': 'unspecified launch failure'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_17-el8_amd64_gcc12/build/CMSSW_14_0_17-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/mem/buf/BufUnifor\
mCudaHipRt.hpp(356) 'TApi::hostFree(ptr)' returned error : 'cudaErrorLaunchFailure': 'unspecified launch failure'!
----- Begin Fatal Exception 07-Oct-2024 10:58:20 CEST-----------------------
An exception of category 'StdException' occurred while
[0] Processing Event run: 378940 lumi: 21 event: 5339574 stream: 0
[1] Running path 'DQM_HcalReconstruction_v7'
[2] Calling method for module alpaka_serial_sync::PFClusterSoAProducer/'hltParticleFlowClusterHBHESoACPUSerial'
Exception Message:
A std::exception was thrown.
Out of range index in ViewTemplateFreeParams::operator[]
----- End Fatal Exception ------------------------------------------------- whereas cherry-picking the commits from PR #46136 the job successfully finishes. [1] Click mecommit e119a60a1e01b4fe2f6444f43787ea92cc4f1911
Author: mmusich <[email protected]>
Date: Mon Oct 7 11:01:42 2024 +0200
re-introduce customizeHLTfor44591
diff --git a/HLTrigger/Configuration/python/customizeHLTforCMSSW.py b/HLTrigger/Configuration/python/customizeHLTforCMSSW.py
index f44657dfa5f..83e2966d8e0 100644
--- a/HLTrigger/Configuration/python/customizeHLTforCMSSW.py
+++ b/HLTrigger/Configuration/python/customizeHLTforCMSSW.py
@@ -261,6 +261,17 @@ def checkHLTfor43774(process):
return process
+
+def customizeHLTfor44591(process):
+ """
+ Customisation for running HLT with the updated btag info producers from the PR 44591
+ """
+ for type in ["DeepFlavourTagInfoProducer", "ParticleTransformerAK4TagInfoProducer", "DeepBoostedJetTagInfoProducer"]:
+ for producer in producers_by_type(process, type):
+ if hasattr(producer, 'unsubjet_map'):
+ delattr(producer, 'unsubjet_map')
+ return process
+
# CMSSW version specific customizations
def customizeHLTforCMSSW(process, menuType="GRun"):
@@ -270,5 +281,6 @@ def customizeHLTforCMSSW(process, menuType="GRun"):
# process = customiseFor12718(process)
process = checkHLTfor43774(process)
-
+ process = customizeHLTfor44591(process)
+
return process |
+1 |
This issue is fully signed and ready to be closed. |
@cmsbuild, please close |
Report the large numbers of GPU-related HLT crashes yesterday night (elog)
Here's the recipe how to reproduce the crashes. (tested with
CMSSW_14_0_4
onlxplus8-gpu
)@cms-sw/hlt-l2 FYI
@cms-sw/heterogeneous-l2 FYI
The text was updated successfully, but these errors were encountered: