Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HLT crashes in Run 382461 #45312

Closed
trtomei opened this issue Jun 26, 2024 · 14 comments
Closed

HLT crashes in Run 382461 #45312

trtomei opened this issue Jun 26, 2024 · 14 comments

Comments

@trtomei
Copy link
Contributor

trtomei commented Jun 26, 2024

Crashes observed in collisions Run 382461. Error message:

----- Begin Fatal Exception 26-Jun-2024 14:33:41 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 382461 lumi: 2 event: 4698821 stream: 0
   [1] Running path 'DQM_EcalReconstruction_v10'
   [2] Calling method for module EcalUncalibRecHitProducerPortable@alpaka/'hltEcalUncalibRecHitSoA'
Exception Message:
A std::exception was thrown.
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_9_MULTIARCHS-el8_amd64_gcc12/build/CMSSW_14_0_9_MULTIARCHS-b\
uild/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/kernel/TaskKernelGpuUni\
formCudaHipRt.hpp(259) 'TApi::setDevice(queue.m_spQueueImpl->m_dev.getNativeHandle())' A previous API call (not th\
is one) set the error  : 'cudaErrorInvalidConfiguration': 'invalid configuration argument'!
----- End Fatal Exception -------------------------------------------------

Reproducer:

#!/bin/bash -ex

# CMSSW_14_0_9_patch1_MULTIARCHS

hltGetConfiguration run:382461 \
  --globaltag 140X_dataRun3_HLT_v3 \
  --data \
  --no-prescale \
  --no-output \
  --max-events -1 \
  --input \
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000928.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000929.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000930.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000931.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000932.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000933.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000934.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000935.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000936.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000937.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000938.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000939.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000940.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000941.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000942.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000943.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000944.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000945.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000946.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000947.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000948.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000949.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000950.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000951.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000952.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000953.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000954.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000955.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000956.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000957.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000958.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000959.root > hlt.py
  
cat <<@EOF >> hlt.py
process.options.wantSummary = True

process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt.py &> hlt.log

Notice that this run has no ECAL barrel, but part of the endcap. @fwyzard has noticed that this is probably related: the protection we implemented for empty ECAL events was on the total size, but there is one kernel that is barrel-only.

Best regards,
Thiago (for FOG)

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 26, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @trtomei.

@Dr15Jones, @antoniovilela, @makortel, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@mmusich
Copy link
Contributor

mmusich commented Jun 26, 2024

assign hlt, reconstruction, heterogeneous

@mmusich
Copy link
Contributor

mmusich commented Jun 26, 2024

type ecal

@mmusich
Copy link
Contributor

mmusich commented Jun 26, 2024

@cms-sw/ecal-dpg-l2 FYI

@cmsbuild
Copy link
Contributor

New categories assigned: hlt,reconstruction,heterogeneous

@Martin-Grunewald,@mmusich,@fwyzard,@jfernan2,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

@fwyzard
Copy link
Contributor

fwyzard commented Jun 26, 2024

Should be fixed by #45311 (14.1.x) / #45313 (14.0.x) / #45314 (14.0.9-patchX).

@mmusich
Copy link
Contributor

mmusich commented Jun 26, 2024

FWIW I confirm that:

cmsrel CMSSW_14_0_9_patch1_MULTIARCHS
cd CMSSW_14_0_9_patch1_MULTIARCHS/src/
git cms-init
cmsenv
git cms-addpkg RecoLocalCalo/EcalRecProducers
git remote add fwyzard [email protected]:fwyzard/cmssw.git; git fetch fwyzard
git cherry-pick d0f844fb548ac5bd7f8ee6b5daa6476809cb4033
scram b -j 20

tested with the reproducer at #45312 (comment) leads to no crashes.

@mmusich
Copy link
Contributor

mmusich commented Jun 27, 2024

@mmusich
Copy link
Contributor

mmusich commented Jun 27, 2024

+hlt

@jfernan2
Copy link
Contributor

jfernan2 commented Jul 2, 2024

+1

@makortel
Copy link
Contributor

makortel commented Jul 9, 2024

+heterogeneous

@makortel
Copy link
Contributor

makortel commented Jul 9, 2024

@cmsbuild, please close

@cmsbuild
Copy link
Contributor

cmsbuild commented Jul 9, 2024

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants