-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HLT farm crash in run 381543 #45136
Comments
cms-bot internal usage |
A new Issue was created by @mmusich. @rappoccio, @smuzaffar, @antoniovilela, @Dr15Jones, @makortel, @sextonkennedy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
@brallmond FYI |
type tau |
assign package RecoTauTag/HLTProducers |
New categories assigned: hlt @Martin-Grunewald,@mmusich you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Similar error was seen earlier in #44333 (comment) |
assign ml |
New categories assigned: ml @valsdav,@wpmccormack you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Thanks for the reproducer @mmusich, I can have a look at the TF inputs. |
I checked and this is indeed the case: in this point https://github.com/cms-sw/cmssw/blob/master/RecoTauTag/HLTProducers/src/L2TauTagNNProducerAlpaka.cc#L735, there is a call to the inference without checking the input This change patches the problem:
Should I open a PR for this @mmusich ? |
@valsdav, thanks for looking into this.
if your more general fix to the TF interface protects against this as well, then we should probably use that instead of patching client by client. |
I still think that the TF patch should be a safety net to avoid crashes but that the clients should check and avoid processing empty inputs. I can open a separate issue to track the "empty input protection" problem and list the packages that may be affected. In the meanwhile the TF PR is coming |
Hello, commenting from the Tau side as advised in the TSG meeting. I would be in favor of having both the general protection (TF patch) that valsdav has opened another issue to implement, as well as the specific guards that were implemented previously for the DeepTau module. I think it makes sense to add the guards to the L2NN since they have worked well in the DeepTau module. If I understand correctly, neither of those sets of guards will be necessary once the TF patch is merged, but they won't hurt to have in place. Thanks all for addressing the issue quickly. |
@brallmond @valsdav |
For the record, this issue led to 10 HLT crashes in run-381543 and 29 HLT crashes in run-381544. With the corresponding error files, we verified that using #45145 there are no crashes in these events [*]. I understand both protections will be implemented. Certainly, HLT needs to deploy online a new release with at least one of these protections before the end of the current LHC stop (so, before Jun ~15). [*] #!/bin/bash -ex
# CMSSW_14_0_7_patch2_MULTIARCHS
hltGetConfiguration run:381543 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input \
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0024_index000154_fu-c2b14-19-01_pid630325.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0056_index000226_fu-c2b14-19-01_pid630264.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0066_index000018_fu-c2b14-05-01_pid306574.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0152_index000306_fu-c2b14-39-01_pid629490.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0152_index000326_fu-c2b14-39-01_pid629490.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0152_index000345_fu-c2b14-39-01_pid629490.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0229_index000038_fu-c2b14-19-01_pid630079.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0229_index000055_fu-c2b14-19-01_pid630079.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0269_index000056_fu-c2b14-17-01_pid587667.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0269_index000108_fu-c2b14-17-01_pid587667.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0274_index000072_fu-c2b14-21-01_pid586556.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0274_index000097_fu-c2b14-21-01_pid586556.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0313_index000199_fu-c2b05-22-01_pid3462225.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0313_index000305_fu-c2b05-22-01_pid3462225.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0313_index000322_fu-c2b05-22-01_pid3462225.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0383_index000005_fu-c2b14-17-01_pid587644.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0383_index000006_fu-c2b14-17-01_pid587644.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0383_index000096_fu-c2b14-17-01_pid587644.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0437_index000170_fu-c2b14-43-01_pid628778.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0437_index000177_fu-c2b14-43-01_pid628778.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0502_index000042_fu-c2b14-33-01_pid627667.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0073_index000147_fu-c2b14-07-01_pid723678.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0073_index000396_fu-c2b14-25-01_pid624532.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0115_index000043_fu-c2b14-07-01_pid723823.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0115_index000064_fu-c2b14-07-01_pid723823.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0115_index000078_fu-c2b14-07-01_pid723823.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0178_index000310_fu-c2b14-17-01_pid626159.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0180_index000211_fu-c2b14-35-01_pid665686.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0187_index000409_fu-c2b14-15-01_pid667599.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0216_index000061_fu-c2b14-39-01_pid668710.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0216_index000109_fu-c2b14-39-01_pid668710.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0216_index000110_fu-c2b14-39-01_pid668710.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0272_index000144_fu-c2b14-43-01_pid675712.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0272_index000149_fu-c2b14-43-01_pid675712.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0273_index000030_fu-c2b14-37-01_pid667292.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0298_index000217_fu-c2b14-13-01_pid671154.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0298_index000221_fu-c2b14-13-01_pid671154.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0303_index000287_fu-c2b14-13-01_pid670560.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0303_index000318_fu-c2b14-13-01_pid670560.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0339_index000217_fu-c2b14-43-01_pid675735.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0339_index000237_fu-c2b14-43-01_pid675735.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0520_index000139_fu-c2b14-13-01_pid670950.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0744_index000034_fu-c2b14-43-01_pid676152.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0744_index000093_fu-c2b14-43-01_pid676152.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0799_index000298_fu-c2b14-19-01_pid669452.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0837_index000123_fu-c2b14-37-01_pid667329.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0837_index000133_fu-c2b14-37-01_pid667329.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0842_index000113_fu-c2b14-17-01_pid625748.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0842_index000124_fu-c2b14-17-01_pid625748.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0865_index000035_fu-c2b14-09-01_pid742325.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0957_index000254_fu-c2b14-41-01_pid624662.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1059_index000063_fu-c2b14-23-01_pid666512.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1059_index000067_fu-c2b14-23-01_pid666512.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1124_index000173_fu-c2b14-23-01_pid666558.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1371_index000089_fu-c2b14-11-01_pid736600.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1431_index000139_fu-c2b14-11-01_pid736723.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1459_index000206_fu-c2b14-15-01_pid667534.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1459_index000238_fu-c2b14-15-01_pid667534.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1559_index000104_fu-c2b14-07-01_pid732989.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1559_index000111_fu-c2b14-07-01_pid732989.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1584_index000066_fu-c2b14-23-01_pid730878.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1700_index000149_fu-c2b14-19-01_pid669200.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1910_index000060_fu-c2b14-17-01_pid626082.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1910_index000073_fu-c2b14-17-01_pid626082.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1916_index000196_fu-c2b14-19-01_pid669161.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls2174_index000141_fu-c2b14-11-01_pid737084.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls2174_index000145_fu-c2b14-11-01_pid737084.root \
> hlt.py
cat <<@EOF >> hlt.py
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF
cmsRun hlt.py &> hlt.log |
to speed up things (even if IMHO they're not really so necessary) I created:
and tested explicitly that the setup at #45136 (comment) doesn't crash for any of the error stream files for run-381543 and run-381544. |
The following fixes were implemented:
all of them are merged and will be available in the next CMSSW_14_0_X release. |
+ml |
This issue is fully signed and ready to be closed. |
@cmsbuild, please close |
Reporting the HLT farm crashes in run 381543.
To reproduce:
(to reproduce offline important go on
lxplus901
as the CPU micro-architecture matters)results in:
This looks reminiscent of #44333.
As additional information it looks like the crashes are happening only on the new HLT nodes that have a different CPU micro-architecture where the
AVX512F
AVX512_VNNI
instructions are present.I tested that:
lxplus8-gpu
withAMD EPYC 7313 16-Core Processor
it doesn't crashlxplus901
withIntel Xeon Processor (Icelake)
it does crashFYI: @cms-sw/hlt-l2 @trocino @mzarucki @trtomei
The text was updated successfully, but these errors were encountered: