Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HLT farm crash in run 381543 #45136

Closed
mmusich opened this issue Jun 4, 2024 · 23 comments
Closed

HLT farm crash in run 381543 #45136

mmusich opened this issue Jun 4, 2024 · 23 comments

Comments

@mmusich
Copy link
Contributor

mmusich commented Jun 4, 2024

Reporting the HLT farm crashes in run 381543.

To reproduce:

(to reproduce offline important go on lxplus901 as the CPU micro-architecture matters)

cmsrel CMSSW_14_0_7_patch1_MULTIARCHS
cd CMSSW_14_0_7_patch1_MULTIARCHS/src
cmsenv
#!/bin/bash -ex

# CMSSW_14_0_7_patch1

hltGetConfiguration run:381543 \
  --globaltag 140X_dataRun3_HLT_v3 \
  --data \
  --no-prescale \
  --no-output \
  --max-events -1 \
  --input /store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0024_index000154_fu-c2b14-19-01_pid630325.root,/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0056_index000226_fu-c2b1
4-19-01_pid630264.root,/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0066_index000018_fu-c2b14-05-01_pid306574.root,/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0152_index0
00306_fu-c2b14-39-01_pid629490.root,/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0152_index000326_fu-c2b14-39-01_pid629490.root,/store/group/tsg/FOG/error_stream_root/run381543/run381543_
ls0152_index000345_fu-c2b14-39-01_pid629490.root > hlt.py
  
cat <<@EOF >> hlt.py
process.options.wantSummary = True

process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt.py &> hlt.log

results in:

2024-06-04 18:19:19.053636: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you
 can ignore this message): INVALID_ARGUMENT: scale must have the same number of elements as the channels of x, got 80 and 31
	 [[{{node cnn_model/StatefulPartitionedCall/StatefulPartitionedCall/batch_normalization_CNN1x1_0/FusedBatchNormV3}}]]
----- Begin Fatal Exception 04-Jun-2024 18:19:19 CEST-----------------------
An exception of category 'InvalidRun' occurred while
   [0] Processing  Event run: 381543 lumi: 24 event: 22910599 stream: 0
   [1] Running path 'HLT_VBF_DiPFJet45_Mjj750_PNetTauhPFJet45_L2NN_eta2p3_v3'
   [2] Calling method for module L2TauNNProducerAlpaka/'hltL2TauTagNNProducer'
Exception Message:
error while running session: INVALID_ARGUMENT: scale must have the same number of elements as the channels of x, got 80 and 31
	 [[{{node cnn_model/StatefulPartitionedCall/StatefulPartitionedCall/batch_normalization_CNN1x1_0/FusedBatchNormV3}}]]
----- End Fatal Exception -------------------------------------------------

This looks reminiscent of #44333.
As additional information it looks like the crashes are happening only on the new HLT nodes that have a different CPU micro-architecture where the AVX512F AVX512_VNNI instructions are present.
I tested that:

  • on lxplus8-gpu with AMD EPYC 7313 16-Core Processor it doesn't crash
  • on lxplus901 with Intel Xeon Processor (Icelake) it does crash

FYI: @cms-sw/hlt-l2 @trocino @mzarucki @trtomei

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 4, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 4, 2024

A new Issue was created by @mmusich.

@rappoccio, @smuzaffar, @antoniovilela, @Dr15Jones, @makortel, @sextonkennedy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@mmusich
Copy link
Contributor Author

mmusich commented Jun 4, 2024

@brallmond FYI

@mmusich
Copy link
Contributor Author

mmusich commented Jun 4, 2024

type tau

@cmsbuild cmsbuild added the tau label Jun 4, 2024
@makortel
Copy link
Contributor

makortel commented Jun 4, 2024

assign package RecoTauTag/HLTProducers

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 4, 2024

New categories assigned: hlt

@Martin-Grunewald,@mmusich you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

makortel commented Jun 4, 2024

Similar error was seen earlier in #44333 (comment)

@mmusich
Copy link
Contributor Author

mmusich commented Jun 4, 2024

assign ml

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 4, 2024

New categories assigned: ml

@valsdav,@wpmccormack you have been requested to review this Pull request/Issue and eventually sign? Thanks

@valsdav
Copy link
Contributor

valsdav commented Jun 4, 2024

Thanks for the reproducer @mmusich, I can have a look at the TF inputs.

@mmusich
Copy link
Contributor Author

mmusich commented Jun 4, 2024

Do we need the same protections as #44455 in RecoTauTag/HLTProducers/src/L2TauTagNNProducerAlpaka.cc (suggestion from @missirol)

@valsdav
Copy link
Contributor

valsdav commented Jun 4, 2024

I checked and this is indeed the case: in this point https://github.com/cms-sw/cmssw/blob/master/RecoTauTag/HLTProducers/src/L2TauTagNNProducerAlpaka.cc#L735, there is a call to the inference without checking the input nTau.
I have a general fix to TensorFlow code here: I can prepare a PR tomorrow morning as it helps protecting us against this kind of problems.

This change patches the problem:

diff --git a/RecoTauTag/HLTProducers/src/L2TauTagNNProducerAlpaka.cc b/RecoTauTag/HLTProducers/src/L2TauTagNNProducerAlpaka.cc
index 9772366c6b2..91c5ceea6be 100644
--- a/RecoTauTag/HLTProducers/src/L2TauTagNNProducerAlpaka.cc
+++ b/RecoTauTag/HLTProducers/src/L2TauTagNNProducerAlpaka.cc
@@ -732,14 +732,18 @@ void L2TauNNProducerAlpaka::fillPatatracks(tensorflow::Tensor& cellGridMatrix,
 
 std::vector<float> L2TauNNProducerAlpaka::getTauScore(const tensorflow::Tensor& cellGridMatrix) {
-  std::vector<tensorflow::Tensor> pred_tensor;
-  tensorflow::run(L2cacheData_->session, {{inputTensorName_, cellGridMatrix}}, {outputTensorName_}, &pred_tensor);
   const int nTau = cellGridMatrix.shape().dim_size(0);
-  std::vector<float> pred_vector(nTau);
-  for (int tau_idx = 0; tau_idx < nTau; ++tau_idx) {
-    pred_vector[tau_idx] = pred_tensor[0].matrix<float>()(tau_idx, 0);
+  if (nTau == 0) {
+      return std::vector<float>();
+  }else{
+    std::vector<tensorflow::Tensor> pred_tensor;
+    tensorflow::run(L2cacheData_->session, {{inputTensorName_, cellGridMatrix}}, {outputTensorName_}, &pred_tensor);
+    std::vector<float> pred_vector(nTau);
+    for (int tau_idx = 0; tau_idx < nTau; ++tau_idx) {
+      pred_vector[tau_idx] = pred_tensor[0].matrix<float>()(tau_idx, 0);
+    }
+    
+    return pred_vector;
   }
-
-  return pred_vector;
 }

Should I open a PR for this @mmusich ?

@mmusich
Copy link
Contributor Author

mmusich commented Jun 4, 2024

@valsdav, thanks for looking into this.

Should I open a PR ?

if your more general fix to the TF interface protects against this as well, then we should probably use that instead of patching client by client.
Let me note that the L2TauTagNNProducerAlpaka could be a derived class from a template, to avoid code duplication from L2TauTagNNProducer.
That's something for @cms-sw/tau-pog-l2 to consider.
Finally, let me add that while a fix is highly desireable we're entering a technical stop so we don't need to push a hasty patch to avoid crashes online, but we have a bit of time for a better solution.

@valsdav
Copy link
Contributor

valsdav commented Jun 5, 2024

I still think that the TF patch should be a safety net to avoid crashes but that the clients should check and avoid processing empty inputs. I can open a separate issue to track the "empty input protection" problem and list the packages that may be affected. In the meanwhile the TF PR is coming

@brallmond
Copy link
Contributor

Hello, commenting from the Tau side as advised in the TSG meeting.

I would be in favor of having both the general protection (TF patch) that valsdav has opened another issue to implement, as well as the specific guards that were implemented previously for the DeepTau module. I think it makes sense to add the guards to the L2NN since they have worked well in the DeepTau module. If I understand correctly, neither of those sets of guards will be necessary once the TF patch is merged, but they won't hurt to have in place.

Thanks all for addressing the issue quickly.

@Martin-Grunewald
Copy link
Contributor

@brallmond
Indeed. Please provide L2NN PRs for 14_1 and 14_0.

@valsdav
We'd also need a TF backport to 14_0.

@missirol
Copy link
Contributor

missirol commented Jun 5, 2024

For the record, this issue led to 10 HLT crashes in run-381543 and 29 HLT crashes in run-381544. With the corresponding error files, we verified that using #45145 there are no crashes in these events [*].

I understand both protections will be implemented. Certainly, HLT needs to deploy online a new release with at least one of these protections before the end of the current LHC stop (so, before Jun ~15).

[*]

#!/bin/bash -ex

# CMSSW_14_0_7_patch2_MULTIARCHS

hltGetConfiguration run:381543 \
  --globaltag 140X_dataRun3_HLT_v3 \
  --data \
  --no-prescale \
  --no-output \
  --max-events -1 \
  --input \
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0024_index000154_fu-c2b14-19-01_pid630325.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0056_index000226_fu-c2b14-19-01_pid630264.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0066_index000018_fu-c2b14-05-01_pid306574.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0152_index000306_fu-c2b14-39-01_pid629490.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0152_index000326_fu-c2b14-39-01_pid629490.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0152_index000345_fu-c2b14-39-01_pid629490.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0229_index000038_fu-c2b14-19-01_pid630079.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0229_index000055_fu-c2b14-19-01_pid630079.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0269_index000056_fu-c2b14-17-01_pid587667.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0269_index000108_fu-c2b14-17-01_pid587667.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0274_index000072_fu-c2b14-21-01_pid586556.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0274_index000097_fu-c2b14-21-01_pid586556.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0313_index000199_fu-c2b05-22-01_pid3462225.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0313_index000305_fu-c2b05-22-01_pid3462225.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0313_index000322_fu-c2b05-22-01_pid3462225.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0383_index000005_fu-c2b14-17-01_pid587644.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0383_index000006_fu-c2b14-17-01_pid587644.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0383_index000096_fu-c2b14-17-01_pid587644.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0437_index000170_fu-c2b14-43-01_pid628778.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0437_index000177_fu-c2b14-43-01_pid628778.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0502_index000042_fu-c2b14-33-01_pid627667.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0073_index000147_fu-c2b14-07-01_pid723678.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0073_index000396_fu-c2b14-25-01_pid624532.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0115_index000043_fu-c2b14-07-01_pid723823.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0115_index000064_fu-c2b14-07-01_pid723823.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0115_index000078_fu-c2b14-07-01_pid723823.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0178_index000310_fu-c2b14-17-01_pid626159.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0180_index000211_fu-c2b14-35-01_pid665686.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0187_index000409_fu-c2b14-15-01_pid667599.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0216_index000061_fu-c2b14-39-01_pid668710.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0216_index000109_fu-c2b14-39-01_pid668710.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0216_index000110_fu-c2b14-39-01_pid668710.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0272_index000144_fu-c2b14-43-01_pid675712.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0272_index000149_fu-c2b14-43-01_pid675712.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0273_index000030_fu-c2b14-37-01_pid667292.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0298_index000217_fu-c2b14-13-01_pid671154.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0298_index000221_fu-c2b14-13-01_pid671154.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0303_index000287_fu-c2b14-13-01_pid670560.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0303_index000318_fu-c2b14-13-01_pid670560.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0339_index000217_fu-c2b14-43-01_pid675735.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0339_index000237_fu-c2b14-43-01_pid675735.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0520_index000139_fu-c2b14-13-01_pid670950.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0744_index000034_fu-c2b14-43-01_pid676152.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0744_index000093_fu-c2b14-43-01_pid676152.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0799_index000298_fu-c2b14-19-01_pid669452.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0837_index000123_fu-c2b14-37-01_pid667329.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0837_index000133_fu-c2b14-37-01_pid667329.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0842_index000113_fu-c2b14-17-01_pid625748.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0842_index000124_fu-c2b14-17-01_pid625748.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0865_index000035_fu-c2b14-09-01_pid742325.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0957_index000254_fu-c2b14-41-01_pid624662.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1059_index000063_fu-c2b14-23-01_pid666512.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1059_index000067_fu-c2b14-23-01_pid666512.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1124_index000173_fu-c2b14-23-01_pid666558.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1371_index000089_fu-c2b14-11-01_pid736600.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1431_index000139_fu-c2b14-11-01_pid736723.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1459_index000206_fu-c2b14-15-01_pid667534.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1459_index000238_fu-c2b14-15-01_pid667534.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1559_index000104_fu-c2b14-07-01_pid732989.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1559_index000111_fu-c2b14-07-01_pid732989.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1584_index000066_fu-c2b14-23-01_pid730878.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1700_index000149_fu-c2b14-19-01_pid669200.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1910_index000060_fu-c2b14-17-01_pid626082.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1910_index000073_fu-c2b14-17-01_pid626082.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1916_index000196_fu-c2b14-19-01_pid669161.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls2174_index000141_fu-c2b14-11-01_pid737084.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls2174_index000145_fu-c2b14-11-01_pid737084.root \
  > hlt.py

cat <<@EOF >> hlt.py
process.options.wantSummary = True

process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt.py &> hlt.log

@mmusich
Copy link
Contributor Author

mmusich commented Jun 5, 2024

Indeed. Please provide L2NN PRs for 14_1 and 14_0.

to speed up things (even if IMHO they're not really so necessary) I created:

and tested explicitly that the setup at #45136 (comment) doesn't crash for any of the error stream files for run-381543 and run-381544.

@mmusich
Copy link
Contributor Author

mmusich commented Jun 8, 2024

The following fixes were implemented:

all of them are merged and will be available in the next CMSSW_14_0_X release.

@mmusich
Copy link
Contributor Author

mmusich commented Jun 8, 2024

+hlt

@valsdav
Copy link
Contributor

valsdav commented Jun 8, 2024

+ml

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 8, 2024

This issue is fully signed and ready to be closed.

@mmusich
Copy link
Contributor Author

mmusich commented Jun 8, 2024

@cmsbuild, please close

@cmsbuild cmsbuild closed this as completed Jun 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants