-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HLT Farm crashes in run 378366~378369 #44541
Comments
cms-bot internal usage |
A new Issue was created by @wonpoint4. @antoniovilela, @smuzaffar, @rappoccio, @Dr15Jones, @sextonkennedy, @makortel can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign hlt, heterogeneous |
New categories assigned: hlt,heterogeneous @Martin-Grunewald,@mmusich,@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Running the reproducer with
FYI @cms-sw/hcal-dpg-l2 |
The problem has been caused by a change in the HCAL HB/HE raw data, i.e. a position of the trigger TS (=SOI for "Sample Of Interest") in 8-TS Digi array, which was done on Sunday night (March 24) and was originally planned only for local LED runs, but (unwantedly) stayed in subsequent GRs... Now it's reverted back to the nominal configuration. Thanks to the clarification of @mariadalfonso (who's in US for a Workshop) HCAL@GPU does imply both fixed number of TS (8) and SOI (4th TS). So, an additional protection/warning will be added to HCAL@GPU upon Maria's return back from US. |
For the record in neighboring runs there have been also crashes in the online DQM, see e.g.: 378366. |
@mmusich yes, the origin of DQM crashes is the same. |
... if and when we will have a full Alpaka implementation of the HCAL reconstruction, we will have a single code base to maintain :) |
I'll make sure the Alpaka implementation has some protection against different SOI/TS configurations |
Hi @abdoulline @lwang046 Eiko for DQM-DC |
Hi @syuvivida |
Hi @abdoulline Eiko |
@syuvivida |
@abdoulline We are now also seeing some failures in T0 Prompt processing jobs with similar symptoms. See |
@saumyaphor4252 @igv4321 FYI |
Just to add explicitly @mariadalfonso |
@cms-sw/hcal-dpg-l2
Will this be done for the CUDA implementation ? |
Hi @missirol , thanks for bringing this up, it's not included in #44910 yet. The issue seems to be a mis-configuration that MAHI does not currently support. I need more information about
Maybe @abdoulline or @mariadalfonso will have some idea about these questions? Then we can discuss weather to include these changes in #44910 |
@kakwok @mariadalfonso bool .soi() int presamples() Normally presamples == 3. Otherwise this is bad data originating from misconfigured HCAL (as it was back on March 24-25) which shouldn't happen. |
@abdoulline thanks for the comments and suggestions. IMHO there are various options that would work better than the current failure mode:
The LogError is fine - even though nobody will likely see it. |
@fwyzard But the source of the problem was a general HCAL misconfiguration and (I'd think) it better be spotted and fixed asap rather than to be mitigated somehow on the fly? Empty collection of RecHits would mean a large part of HCAL is out. It'd alter severely most of the triggers, I suppose. M0 is a very poor replacement of MAHI in HE (not only absence of PU mitigation, but also could induce an energy scale difference) and it uses TS window limits from DB (so they need to be re-adjusted on the fly...). Now (if it's not just about stopping the jobs) this issue may need to be discussed in HCAL DPG. |
I agree, but crashing the whole HLT farm is not the right way to detect the problem. I'm happy with any solution that makes it clear the data is bad, but does not require cleaning up about 200 HLT nodes. |
Was this again a Phase Scan ? Mahi on CPU can cope with shift and also extended number of timeslices i.e. from 8 to 10, but for the GPU-CUDA implementation is all kind of frozen. |
Hi Maria no there were no new instances of the issue since March 24-25 (HCAL misconfig). So, the goal is (1) not stop HLT farm (2) detect (make the problem to be known) asap if it happens, to reconfigure HCAL asap. |
just would like to draw your attention to Maria's suggestion:
|
@abdoulline The current PR is already very big. I would prefer to implement functional changes after integrating the current PR. This will make the validation and integration much easier. But let's keep this improvement in mind for the (near) future. |
just for the record, mahi @ alpaka still crashes: #!/bin/bash -ex
# List of run numbers
runs=(378366 378369)
# Base directory for input files on EOS
base_dir="/store/group/tsg/FOG/error_stream_root/run"
# Global tag for the HLT configuration
global_tag="140X_dataRun3_HLT_v3"
# EOS command (adjust this if necessary for your environment)
eos_cmd="eos"
# Loop over each run number
for run in "${runs[@]}"; do
# Set the MALLOC_CONF environment variable
# export MALLOC_CONF=junk:true
# Construct the input directory path
input_dir="${base_dir}${run}"
# Find all root files in the input directory on EOS
root_files=$(${eos_cmd} find -f "/eos/cms${input_dir}" -name "*.root" | awk '{print "root://eoscms.cern.ch/" $0}' | paste -sd, -)
# Check if there are any root files found
if [ -z "${root_files}" ]; then
echo "No root files found for run ${run} in directory ${input_dir}."
continue
fi
# Create filenames for the HLT configuration and log file
hlt_config_file="hlt_run${run}.py"
hlt_log_file="hlt_run${run}.log"
# Generate the HLT configuration file
hltGetConfiguration /online/collisions/2024/2e34/v1.4/HLT/V2 \
--globaltag ${global_tag} \
--data \
--eras Run3 \
--l1-emulator uGT \
--l1 L1Menu_Collisions2024_v1_3_0_xml \
--no-prescale \
--no-output \
--max-events -1 \
--input ${root_files} > ${hlt_config_file}
# Append additional options to the configuration file
cat <<@EOF >> ${hlt_config_file}
del process.MessageLogger
process.load('FWCore.MessageService.MessageLogger_cfi')
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF
# Run the HLT configuration with cmsRun and redirect output to log file
cmsRun ${hlt_config_file} &> ${hlt_log_file}
done results in:
@kakwok any plans about this? |
Has there been any change of Hcal configuration for number of TS in the
digi recently?
…On Wed, Jul 24, 2024, 20:15 Marco Musich ***@***.***> wrote:
The current PR is already very big. I would prefer to implement functional
changes after integrating the current PR. This will make the validation and
integration much easier. But let's keep this improvement in mind for the
(near) future.
just for the record, mahi @ alpaka still crashes:
#!/bin/bash -ex
# List of run numbers
runs=(378366 378369)
# Base directory for input files on EOS
base_dir="/store/group/tsg/FOG/error_stream_root/run"
# Global tag for the HLT configuration
global_tag="140X_dataRun3_HLT_v3"
# EOS command (adjust this if necessary for your environment)
eos_cmd="eos"
# Loop over each run numberfor run in "${runs[@]}"; do
# Set the MALLOC_CONF environment variable
# export MALLOC_CONF=junk:true
# Construct the input directory path
input_dir="${base_dir}${run}"
# Find all root files in the input directory on EOS
root_files=$(${eos_cmd} find -f "/eos/cms${input_dir}" -name "*.root" | awk '{print "root://eoscms.cern.ch/" $0}' | paste -sd, -)
# Check if there are any root files found
if [ -z "${root_files}" ]; then
echo "No root files found for run ${run} in directory ${input_dir}."
continue
fi
# Create filenames for the HLT configuration and log file
hlt_config_file="hlt_run${run}.py"
hlt_log_file="hlt_run${run}.log"
# Generate the HLT configuration file
hltGetConfiguration /online/collisions/2024/2e34/v1.4/HLT/V2 \
--globaltag ${global_tag} \
--data \
--eras Run3 \
--l1-emulator uGT \
--l1 L1Menu_Collisions2024_v1_3_0_xml \
--no-prescale \
--no-output \
--max-events -1 \
--input ${root_files} > ${hlt_config_file}
# Append additional options to the configuration file
cat ***@***.*** >> ${hlt_config_file}del process.MessageLoggerprocess.load('FWCore.MessageService.MessageLogger_cfi') process.options.wantSummary = Trueprocess.options.numberOfThreads = 1process.options.numberOfStreams = ***@***.***
# Run the HLT configuration with cmsRun and redirect output to log file
cmsRun ${hlt_config_file} &> ${hlt_log_file}
done
results in:
Thread 1 (Thread 0x7fe5a92e5640 (LWP 1447698) "cmsRun"):
#0 0x00007fe5a9ec0ac1 in poll () from /lib64/libc.so.6
#1 0x00007fe5a20660cf in full_read.constprop () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#2 <http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so#2> 0x00007fe5a201a1ec in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFW
CoreServicesPlugins.so
#3 0x00007fe5a201a370 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#4 <http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so#4> <signal handler called>
#5 0x00007fe4f1e0072e in alpaka::TaskKernelCpuSerial<std::integral_constant<unsigned long, 3ul>, unsigned int, alpaka_serial_sync::hcal::reconstruction::mahi::Kernel_prep_pulseMatrices_sameNumberOfSampl
es, float*, float*, float*, hcal::HcalMahiPulseOffsetsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, float*, hcal::HcalPhase1DigiSoALayout<128ul, false>::ConstView
TemplateFreeParams<128ul, false, true, true> const&, hcal::HcalPhase0DigiSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, hcal::HcalPhase1DigiSoALayout<128ul, false>
::ConstViewTemplateFreeParams<128ul, false, true, true> const&, signed char*, hcal::HcalMahiConditionsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, hcal::HcalReco
ParamWithPulseShapeT<alpaka::DevCpu>::ConstView const&, float const&, float const&, float const&, bool const&, float const&, float const&, float const&>::operator()() const () from /cvmfs/cms.cern.ch/el8
_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so
#6 <http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so#6> 0x00007fe4f1e09388 in alpaka_serial_sync::hcal::reconstruction::runMahiAsync(alpaka::QueueGenericThreadsBlocking<alpaka::DevCpu>&, hcal::HcalPhase1DigiSoALayout<128ul, false>::ConstViewTemplateFreePa
rams<128ul, false, true, true> const&, hcal::HcalPhase0DigiSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, hcal::HcalPhase1DigiSoALayout<128ul, false>::ConstViewTem
plateFreeParams<128ul, false, true, true> const&, hcal::HcalRecHitSoALayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, true>, hcal::HcalMahiConditionsSoALayout<128ul, false>::ConstViewTemp
lateFreeParams<128ul, false, true, true> const&, hcal::HcalSiPMCharacteristicsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, hcal::HcalRecoParamWithPulseShapeT<alp
aka::DevCpu>::ConstView const&, hcal::HcalMahiPulseOffsetsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, alpaka_serial_sync::hcal::reconstruction::ConfigParameters
const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so
#7 <http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so#7> 0x00007fe4f1ddde29 in alpaka_serial_sync::HBHERecHitProducerPortable::produce(alpaka_serial_sync::device::Event&, alpaka_serial_sync::device::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_g
cc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so
#8 <http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so#8> 0x00007fe4f1de03d3 in alpaka_serial_sync::stream::EDProducer<>::produce(edm::Event&, edm::EventSetup const&) [clone .lto_priv.0] () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MUL
TIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so
#9 <http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so#9> 0x00007fe5ac93b4cf in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12
/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#10 <http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so#10> 0x00007fe5ac91fc6c in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/
CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#11 <http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so#11> 0x00007fe5ac8a7f69 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::excepti
on_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchA
ctionType)1>::Context const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#12 <http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so#12> 0x00007fe5ac8a84d5 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_M
ULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#13 <http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so#13> 0x00007fe5aca5a1d8 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CM
SSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreConcurrency.so
#14 <http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreConcurrency.so#14> 0x00007fe5ab051281 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7fe5a7cdbe00) at /data/cmsbld/jenkins/worksp
ace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#15 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7fe5a7cdbe00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_
pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#16 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1
-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#17 0x00007fe5ac82942b in edm::FinalWaitingTask::wait() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#18 <http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so#18> 0x00007fe5ac83325d in edm::EventProcessor::processRuns() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#19 <http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so#19> 0x00007fe5ac8337c1 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#20 <http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so#20> 0x00000000004074ef in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#21 0x00007fe5ab03d9ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8
_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:688
#22 0x0000000000408ed2 in main::{lambda()#1}::operator()() const ()
#23 0x000000000040517c in main ()
Current Modules:
Module: alpaka_serial_sync::HBHERecHitProducerPortable:hltHbheRecoSoASerialSync (crashed)
A fatal system signal has occurred: segmentation violation
@kakwok <https://github.com/kakwok> any plans about this?
—
Reply to this email directly, view it on GitHub
<#44541 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADBUQPJPOZVAR52TDCY7OADZN7VMFAVCNFSM6AAAAABFHTTHFGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBYGYZTINJYGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I don't know, but just to be clear this is using old data (from run 378366~378369) back in March. I think the agreement was to try to protect it once we have mahi @ alpaka in release. |
Ah ok, then it's expected.
We concluded that was a configuration error, and agreed that protection
will be added in the next iteration.
…On Wed, Jul 24, 2024, 20:20 Marco Musich ***@***.***> wrote:
Has there been any change of Hcal configuration for number of TS in the
digi recently?
I don't know, but just to be clear this is using old data (from run
378366~378369) back in March. I think the agreement was to try to protect
it once we have mahi @ alpaka in release.
—
Reply to this email directly, view it on GitHub
<#44541 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADBUQPLLHTUQMZ7JBGMMAVTZN7WAFAVCNFSM6AAAAABFHTTHFGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBYGY2DIMZVHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
the question is about the plan (timeline) for the next iteration. |
Report the large numbers of GPU-related HLT crashes yesterday (elog)
Here's the recipe how to reproduce the crashes. (tested with
CMSSW_14_0_3
onlxplus8-gpu
)Here's the other way to reproduce the crashes.
vi after_menu.py
@cms-sw/hlt-l2 FYI
@cms-sw/heterogeneous-l2 FYI
The text was updated successfully, but these errors were encountered: