Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PhotonXGBoostProducer related crashes in TSG IB tests #45235

Closed
mmusich opened this issue Jun 16, 2024 · 12 comments · Fixed by #45232
Closed

PhotonXGBoostProducer related crashes in TSG IB tests #45235

mmusich opened this issue Jun 16, 2024 · 12 comments · Fixed by #45232

Comments

@mmusich
Copy link
Contributor

mmusich commented Jun 16, 2024

This issue is to track better the discussion at #45085 (comment) and following.

PR #45085 and its backport to CMSSW_14_0_X #45158 caused instabilities in the TSG IB integration tests.
We had already several failures concerning HLT_DiphotonMVA14p25_Tight_Mass90_v1:

  • in CMSSW_14_1_X_2024-06-08-1100: log, with a segmentation fault:
Thread 1 (Thread 0x151142b06680 (LWP 3442432) "cmsRun"):
#0  0x0000151140bbaac1 in poll () from /lib64/libc.so.6
#1  0x000015113cbc0657 in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-06-09-0000/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2  0x000015113cbc0854 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-06-09-0000/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00001510e252886b in PhotonXGBoostProducer::produce(edm::StreamID, edm::Event&, edm::EventSetup const&) const () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-06-09-0000/lib/el8_amd64_gcc12/pluginRecoEgammaPhotonIdentificationPlugins.so
  • in CMSSW_14_1_X_2024-06-12-1900: log, the trigger path when run standalone gave different amount of fired events than when run in the whole menu;
  • in CMSSW_14_1_X_2024-06-13-1100: log, there is a crash with:
----- Begin Fatal Exception 14-Jun-2024 03:24:48 CEST-----------------------
An exception of category 'StdException' occurred while
[0] Processing  Event run: 1 lumi: 49 event: 4820 stream: 1
[1] Running path 'HLT_DiphotonMVA14p25_Tight_Mass90_v1'
[2] Calling method for module PhotonXGBoostProducer/'hltPhotonXGBoostProducer'
Exception Message:
A std::exception was thrown.
Feature is not set: rawEnergy
----- End Fatal Exception -------------------------------------------------   
  • in CMSSW_14_1_X_2024-06-15-1100: log with a crash:
----- Begin Fatal Exception 15-Jun-2024 14:35:19 CEST-----------------------
An exception of category 'StdException' occurred while
[0] Processing  Event run: 1 lumi: 49 event: 4811 stream: 0
[1] Running path 'HLT_DiphotonMVA14p25_Tight_Mass90_v1'
[2] Calling method for module PhotonXGBoostProducer/'hltPhotonXGBoostProducer'
Exception Message:
A std::exception was thrown.
Feature is not set: rawEnergy
----- End Fatal Exception -------------------------------------------------

the pattern is erratic because this PR was merged in CMSSW_14_1_X_2024-06-05-2300 and after that the HLT integration tests succeeded several times before starting to fail (CMSSW_14_1_X_2024-06-05-2300, CMSSW_14_1_X_2024-06-06-2300, CMSSW_14_1_X_2024-06-07-1100, CMSSW_14_1_X_2024-06-07-2300, CMSSW_14_1_X_2024-06-08-0600 and CMSSW_14_1_X_2024-06-08-1100). So far (as of June 16th) there hasn't been crashes in CMSSW_14_0_X IB tests.
A reproducer (both in master CMSSW_14_1_X and in CMSSW_14_0_X) is the following:

#!/bin/bash -ex                                                                                                                                                                                            

jobTag=threads4
hltMenu=/dev/CMSSW_14_0_0/GRun/V141

check_log () {
  grep '0 HLT_DiphotonMVA14p25_Tight_Mass90_v' $1 | grep TrigReport
}

run(){
  echo $2
  cp $1 $2.py
  cat <<EOF >> $2.py                                                                                                                                                                                       
                                                                                                                                                                                                           
process.options.numberOfThreads = 4                                                                                                                                                                        
process.options.numberOfStreams = 4                                                                                                                                                                        
                                                                                                                                                                                                           
process.hltOutputMinimal.fileName = '${2}.root'                                                                                                                                                            
EOF                                                                                                                                                                                                        
  cmsRun "${2}".py &> "${2}".log
  check_log "${2}".log
}

hltGetCmd="hltGetConfiguration ${hltMenu}"
hltGetCmd+=" --globaltag auto:run3_mc_GRun --mc --unprescale --output minimal --max-events -1"
hltGetCmd+=" --input /store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/STORM/debug/150724_xgboost/RelVal_Raw_GRun_MC.root"
                                                                                                                                                                                  
configLabel=hlt_"${jobTag}"_onlyDiphotonMVA14p25_Tight_Mass90                                                                                                                                                                                
${hltGetCmd} --paths HLT_DiphotonMVA14p25_Tight_Mass90_v1 > "${configLabel}".py
for job_i in {0..30}; do run "${configLabel}".py "${configLabel}"_"${job_i}"; done; unset job_i;

This setup crashes around 10% of times (e.g. 3 times out of 30 attempts).
Possible solutions are discussed at #45085 (comment) and #45085 (comment).

@mmusich
Copy link
Contributor Author

mmusich commented Jun 16, 2024

assign hlt, ml

@cmsbuild
Copy link
Contributor

New categories assigned: hlt,ml

@Martin-Grunewald,@mmusich,@valsdav,@wpmccormack you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 16, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @mmusich.

@Dr15Jones, @smuzaffar, @makortel, @rappoccio, @antoniovilela, @sextonkennedy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@missirol
Copy link
Contributor

https://github.com/missirol/cmssw/commits/devel_cmssw45235

8fc7a00 tries to build on #45085 (comment) reducing duplication of code inside XGBooster.

c33681d improves const correctness in PhotonXGBoostEstimator.

With this the reproducer gives 0 crashes in 20 tries.

@mmusich
Copy link
Contributor Author

mmusich commented Jun 16, 2024

With this the reproducer gives 0 crashes in 20 tries.

OK. I have taken a slightly different approach (that involves more changes to other packages though): mmusich@2abde0d

with this approach I get 0 crashes out of 30 tries and also passes scram b runtests

Pass    1s ... RecoEgamma/PhotonIdentification/RecoEgammaPhotonIdentificationTest

@missirol
Copy link
Contributor

Okay. Just to say that any of these solutions is okay with me. I leave it to experts to open PRs.

@mmusich
Copy link
Contributor Author

mmusich commented Jun 16, 2024

Okay. Just to say that any of these solutions is okay with me. I leave it to experts to open PRs.

I have repurposed #45232 with the commits at #45235 (comment)

@mmusich
Copy link
Contributor Author

mmusich commented Jun 17, 2024

I have repurposed #45232 with the commits at #45235 (comment)

#45237 is a backport of #45232 for data-taking purposes.

@mmusich
Copy link
Contributor Author

mmusich commented Jun 21, 2024

+hlt

@valsdav
Copy link
Contributor

valsdav commented Jun 21, 2024

+ml

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants