Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore TF and MXNet-based inference for DeepJet, DeepDoubleX and DeepAK8 #29172

Closed
wants to merge 1 commit into from

Conversation

hqucms
Copy link
Contributor

@hqucms hqucms commented Mar 10, 2020

PR description:

This PR is to address #28959. The TF and MXNet-based inference for DeepJet, DeepDoubleX and DeepAK8 is recovered and intended to be used on architectures that ONNXRuntime does not support (e.g., PowerPC). The new ONNXRuntime-based inference introduced in #28112 is still kept for x86 and ARM for better speed and lower memory cost. The switch between the two types of producers is implemented using SwitchProducer and by detecting the SCRAM_ARCH.

Needs cms-data/RecoBTag-Combined#27 (updated TF models to remove the training-only nodes; speed up the TF-based inference by 5-10%).

PR validation:

Tested in a JetHT2017D NanoAOD workflow and obtained consistent outputs between the TF/MXNet and the ONNXRuntime backends for affected taggers.

And enable ONNXRuntime for x86/arm only.
@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-29172/14144

  • This PR adds an extra 56KB to repository

  • There are other open Pull requests which might conflict with changes you have proposed:

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @hqucms (Huilin Qu) for master.

It involves the following packages:

PhysicsTools/ONNXRuntime
PhysicsTools/PatAlgos
PhysicsTools/TensorFlow
RecoBTag/FeatureTools
RecoBTag/ONNXRuntime
RecoBTag/TensorFlow

@perrotta, @cmsbuild, @santocch, @slava77 can you please review it and eventually sign? Thanks.
@rappoccio, @gouskos, @hatakeyamak, @emilbols, @peruzzim, @seemasharmafnal, @mmarionncern, @JyothsnaKomaragiri, @makortel, @smoortga, @jdolen, @ferencek, @jdamgov, @nhanvtran, @gkasieczka, @schoef, @andrzejnovak, @clelange, @riga, @ahinzmann, @mverzett, @gpetruc, @mariadalfonso this is something you requested to watch as well.
@davidlange6, @silviodonato, @fabiocos you are the release manager for this.

cms-bot commands are listed here

@slava77
Copy link
Contributor

slava77 commented Mar 10, 2020

Needs cms-data/RecoBTag-Combined#27 (updated TF models to remove the training-only nodes; speed up the TF-based inference by 5-10%).

is this "nice to have" or really required at run time?

@hqucms
Copy link
Contributor Author

hqucms commented Mar 10, 2020

Needs cms-data/RecoBTag-Combined#27 (updated TF models to remove the training-only nodes; speed up the TF-based inference by 5-10%).

is this "nice to have" or really required at run time?

It is required (to test the TF-based part, e.g., for PPC). For x86/ARM we use ONNX so it's not needed.

@slava77
Copy link
Contributor

slava77 commented Mar 10, 2020

@slava77
Copy link
Contributor

slava77 commented Mar 10, 2020

@cmsbuild please test for slc7_ppc64le_gcc820

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 10, 2020

The tests are being triggered in jenkins.
Tested with other pull request(s) cms-data/RecoBTag-Combined#27
Test Parameters:

@hqucms
Copy link
Contributor Author

hqucms commented Mar 10, 2020

One thing w/ the current implementation is that the SwitchProducer mechanism actually initializes both the TF and the ONNX producers and this creates some overhead (i.e., both the TF and the ONNX DNN models will be loaded as this is done in initializeGlobalCache). Is there a way to avoid this?

@slava77
Copy link
Contributor

slava77 commented Mar 10, 2020

One thing w/ the current implementation is that the SwitchProducer mechanism actually initializes both the TF and the ONNX producers and this creates some overhead (i.e., both the TF and the ONNX DNN models will be loaded as this is done in initializeGlobalCache). Is there a way to avoid this?

@makortel @Dr15Jones
do you know?

@cmsbuild
Copy link
Contributor

-1

Tested at: d8951bb

CMSSW: CMSSW_11_1_X_2020-03-09-2300
SCRAM_ARCH: slc7_ppc64le_gcc820

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-323b31/5117/git-log-recent-commits
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-323b31/5117/git-merge-result

You can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-323b31/5117/summary.html

I found follow errors while testing this PR

Failed tests: RelVals AddOn

  • RelVals:

When I ran the RelVals I found an error in the following workflows:
140.53 step2

runTheMatrix-results/140.53_RunHI2011+RunHI2011+RECOHID11+HARVESTDHI/step2_RunHI2011+RunHI2011+RECOHID11+HARVESTDHI.log

5.1 step1
runTheMatrix-results/5.1_TTbar+TTbarFS+HARVESTFS/step1_TTbar+TTbarFS+HARVESTFS.log

135.4 step1
runTheMatrix-results/135.4_ZEE_13+ZEEFS_13+HARVESTUP15FS+MINIAODMCUP15FS/step1_ZEE_13+ZEEFS_13+HARVESTUP15FS+MINIAODMCUP15FS.log

1001.0 step2
runTheMatrix-results/1001.0_RunMinBias2011A+RunMinBias2011A+TIER0EXP+ALCAEXP+ALCAHARVDSIPIXELCALRUN1+ALCAHARVD1+ALCAHARVD2+ALCAHARVD3+ALCAHARVD4+ALCAHARVD5/step2_RunMinBias2011A+RunMinBias2011A+TIER0EXP+ALCAEXP+ALCAHARVDSIPIXELCALRUN1+ALCAHARVD1+ALCAHARVD2+ALCAHARVD3+ALCAHARVD4+ALCAHARVD5.log

1000.0 step2
runTheMatrix-results/1000.0_RunMinBias2011A+RunMinBias2011A+TIER0+SKIMD+HARVESTDfst2+ALCASPLIT/step2_RunMinBias2011A+RunMinBias2011A+TIER0+SKIMD+HARVESTDfst2+ALCASPLIT.log

140.56 step2
runTheMatrix-results/140.56_RunHI2018+RunHI2018+RECOHID18+HARVESTDHI18/step2_RunHI2018+RunHI2018+RECOHID18+HARVESTDHI18.log

4.53 step3
runTheMatrix-results/4.53_RunPhoton2012B+RunPhoton2012B+HLTD+RECODR1reHLT+HARVESTDR1reHLT/step3_RunPhoton2012B+RunPhoton2012B+HLTD+RECODR1reHLT+HARVESTDR1reHLT.log

136.731 step3
runTheMatrix-results/136.731_RunSinglePh2016B+RunSinglePh2016B+HLTDR2_2016+RECODR2_2016reHLT_skimSinglePh_HIPM+HARVESTDR2/step3_RunSinglePh2016B+RunSinglePh2016B+HLTDR2_2016+RECODR2_2016reHLT_skimSinglePh_HIPM+HARVESTDR2.log

158.0 step2
runTheMatrix-results/158.0_HydjetQ_B12_5020GeV_2018_ppReco+HydjetQ_B12_5020GeV_2018_ppReco+DIGIHI2018PPRECO+RECOHI2018PPRECO+ALCARECOHI2018PPRECO+HARVESTHI2018PPRECO/step2_HydjetQ_B12_5020GeV_2018_ppReco+HydjetQ_B12_5020GeV_2018_ppReco+DIGIHI2018PPRECO+RECOHI2018PPRECO+ALCARECOHI2018PPRECO+HARVESTHI2018PPRECO.log

136.793 step3
runTheMatrix-results/136.793_RunDoubleEG2017C+RunDoubleEG2017C+HLTDR2_2017+RECODR2_2017reHLT_skimDoubleEG_Prompt+HARVEST2017/step3_RunDoubleEG2017C+RunDoubleEG2017C+HLTDR2_2017+RECODR2_2017reHLT_skimDoubleEG_Prompt+HARVEST2017.log

136.874 step3
runTheMatrix-results/136.874_RunEGamma2018C+RunEGamma2018C+HLTDR2_2018+RECODR2_2018reHLT_skimEGamma_Offline_L1TEgDQM+HARVEST2018_L1TEgDQM/step3_RunEGamma2018C+RunEGamma2018C+HLTDR2_2018+RECODR2_2018reHLT_skimEGamma_Offline_L1TEgDQM+HARVEST2018_L1TEgDQM.log

1330.0 step3
runTheMatrix-results/1330.0_ZMM_13+ZMM_13+DIGIUP15+RECOUP15_L1TMuDQM+HARVESTUP15_L1TMuDQM+NANOUP15/step3_ZMM_13+ZMM_13+DIGIUP15+RECOUP15_L1TMuDQM+HARVESTUP15_L1TMuDQM+NANOUP15.log

1306.0 step3
runTheMatrix-results/1306.0_SingleMuPt1_UP15+SingleMuPt1_UP15+DIGIUP15+RECOUP15+HARVESTUP15/step3_SingleMuPt1_UP15+SingleMuPt1_UP15+DIGIUP15+RECOUP15+HARVESTUP15.log

10042.0 step3
runTheMatrix-results/10042.0_ZMM_13+ZMM_13TeV_TuneCUETP8M1_2017_GenSimFull+DigiFull_2017+RecoFull_2017+HARVESTFull_2017+ALCAFull_2017+NanoFull_2017/step3_ZMM_13+ZMM_13TeV_TuneCUETP8M1_2017_GenSimFull+DigiFull_2017+RecoFull_2017+HARVESTFull_2017+ALCAFull_2017+NanoFull_2017.log

9.0 step3
runTheMatrix-results/9.0_Higgs200ChargedTaus+Higgs200ChargedTaus+DIGI+RECO+HARVEST/step3_Higgs200ChargedTaus+Higgs200ChargedTaus+DIGI+RECO+HARVEST.log

11634.0 step2
runTheMatrix-results/11634.0_TTbar_14TeV+TTbar_14TeV_TuneCP5_2021_GenSimFull+DigiFull_2021+RecoFull_2021+HARVESTFull_2021+ALCAFull_2021/step2_TTbar_14TeV+TTbar_14TeV_TuneCP5_2021_GenSimFull+DigiFull_2021+RecoFull_2021+HARVESTFull_2021+ALCAFull_2021.log

12434.0 step2
runTheMatrix-results/12434.0_TTbar_14TeV+TTbar_14TeV_TuneCP5_2023_GenSimFull+DigiFull_2023+RecoFull_2023+HARVESTFull_2023+ALCAFull_2023/step2_TTbar_14TeV+TTbar_14TeV_TuneCP5_2023_GenSimFull+DigiFull_2023+RecoFull_2023+HARVESTFull_2023+ALCAFull_2023.log

25.0 step3
runTheMatrix-results/25.0_TTbar+TTbar+DIGI+RECOAlCaCalo+HARVEST+ALCATT/step3_TTbar+TTbar+DIGI+RECOAlCaCalo+HARVEST+ALCATT.log

10824.0 step3
runTheMatrix-results/10824.0_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2018_GenSimFull+DigiFull_2018+RecoFull_2018+HARVESTFull_2018+ALCAFull_2018+NanoFull_2018/step3_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2018_GenSimFull+DigiFull_2018+RecoFull_2018+HARVESTFull_2018+ALCAFull_2018+NanoFull_2018.log

10024.0 step3
runTheMatrix-results/10024.0_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2017_GenSimFull+DigiFull_2017+RecoFull_2017+HARVESTFull_2017+ALCAFull_2017+NanoFull_2017/step3_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2017_GenSimFull+DigiFull_2017+RecoFull_2017+HARVESTFull_2017+ALCAFull_2017+NanoFull_2017.log

25202.0 step3
runTheMatrix-results/25202.0_TTbar_13+TTbar_13+DIGIUP15_PU25+RECOUP15_PU25+HARVESTUP15_PU25+NANOUP15_PU25/step3_TTbar_13+TTbar_13+DIGIUP15_PU25+RECOUP15_PU25+HARVESTUP15_PU25+NANOUP15_PU25.log

10224.0 step3
runTheMatrix-results/10224.0_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2017PU_GenSimFull+DigiFullPU_2017PU+RecoFullPU_2017PU+HARVESTFullPU_2017PU+NanoFull_2017PU/step3_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2017PU_GenSimFull+DigiFullPU_2017PU+RecoFullPU_2017PU+HARVESTFullPU_2017PU+NanoFull_2017PU.log

20034.0 step3
runTheMatrix-results/20034.0_TTbar_14TeV+TTbar_14TeV_TuneCP5_2026D35_GenSimHLBeamSpotFull14+DigiFullTrigger_2026D35+RecoFullGlobal_2026D35+HARVESTFullGlobal_2026D35/step3_TTbar_14TeV+TTbar_14TeV_TuneCP5_2026D35_GenSimHLBeamSpotFull14+DigiFullTrigger_2026D35+RecoFullGlobal_2026D35+HARVESTFullGlobal_2026D35.log

20434.0 step3
runTheMatrix-results/20434.0_TTbar_14TeV+TTbar_14TeV_TuneCP5_2026D41_GenSimHLBeamSpotFull14+DigiFullTrigger_2026D41+RecoFullGlobal_2026D41+HARVESTFullGlobal_2026D41/step3_TTbar_14TeV+TTbar_14TeV_TuneCP5_2026D41_GenSimHLBeamSpotFull14+DigiFullTrigger_2026D41+RecoFullGlobal_2026D41+HARVESTFullGlobal_2026D41.log

21234.0 step3
runTheMatrix-results/21234.0_TTbar_14TeV+TTbar_14TeV_TuneCP5_2026D44_GenSimHLBeamSpotFull14+DigiFullTrigger_2026D44+RecoFullGlobal_2026D44+HARVESTFullGlobal_2026D44/step3_TTbar_14TeV+TTbar_14TeV_TuneCP5_2026D44_GenSimHLBeamSpotFull14+DigiFullTrigger_2026D44+RecoFullGlobal_2026D44+HARVESTFullGlobal_2026D44.log

23234.0 step3
runTheMatrix-results/23234.0_TTbar_14TeV+TTbar_14TeV_TuneCP5_2026D49_GenSimHLBeamSpotFull14+DigiFullTrigger_2026D49+RecoFullGlobal_2026D49+HARVESTFullGlobal_2026D49/step3_TTbar_14TeV+TTbar_14TeV_TuneCP5_2026D49_GenSimHLBeamSpotFull14+DigiFullTrigger_2026D49+RecoFullGlobal_2026D49+HARVESTFullGlobal_2026D49.log

250202.181 step4
runTheMatrix-results/250202.181_TTbar_13UP18+TTbar_13UP18+PREMIXUP18_PU25+DIGIPRMXLOCALUP18_PU25+RECOPRMXUP18_PU25+HARVESTUP18_PU25/step4_TTbar_13UP18+TTbar_13UP18+PREMIXUP18_PU25+DIGIPRMXLOCALUP18_PU25+RECOPRMXUP18_PU25+HARVESTUP18_PU25.log

  • AddOn:

I found errors in the following addon tests:

cmsDriver.py TTbar_8TeV_TuneCUETP8M1_cfi --conditions auto:run1_mc --fast -n 100 --eventcontent AODSIM,DQM --relval 100000,1000 -s GEN,SIM,RECOBEFMIX,DIGI:pdigi_valid,L1,DIGI2RAW,L1Reco,RECO,EI,VALIDATION --customise=HLTrigger/Configuration/CustomConfigs.L1THLT --datatier GEN-SIM-DIGI-RECO,DQMIO --beamspot Realistic8TeVCollision : FAILED - time: date Tue Mar 10 23:26:50 2020-date Tue Mar 10 23:24:22 2020 s - exit: 16896
cmsRun /cvmfs/cms-ib.cern.ch/week1/slc7_ppc64le_gcc820/cms/cmssw/CMSSW_11_1_X_2020-03-09-2300/src/HLTrigger/Configuration/test/OnLine_HLT_Fake2.py realData=True globalTag=@ inputFiles=@ : FAILED - time: date Tue Mar 10 23:27:36 2020-date Tue Mar 10 23:24:27 2020 s - exit: 16896
cmsDriver.py RelVal -s HLT:Fake2,RAW2DIGI,L1Reco,RECO --data --scenario=pp -n 10 --conditions auto:run2_data_Fake2 --relval 9000,50 --datatier "RAW-HLT-RECO" --eventcontent FEVTDEBUGHLT --customise=HLTrigger/Configuration/CustomConfigs.L1THLT --era Run2_2016 --processName=HLTRECO --filein file:RelVal_Raw_Fake2_DATA.root --fileout file:RelVal_Raw_Fake2_DATA_HLT_RECO.root : FAILED - time: date Tue Mar 10 23:27:36 2020-date Tue Mar 10 23:24:27 2020 s - exit: 16896
cmsRun /cvmfs/cms-ib.cern.ch/week1/slc7_ppc64le_gcc820/cms/cmssw/CMSSW_11_1_X_2020-03-09-2300/src/HLTrigger/Configuration/test/OnLine_HLT_Fake1.py realData=False globalTag=@ inputFiles=@ : FAILED - time: date Tue Mar 10 23:31:19 2020-date Tue Mar 10 23:24:28 2020 s - exit: 16896
cmsDriver.py RelVal -s HLT:Fake1,RAW2DIGI,L1Reco,RECO --mc --scenario=pp -n 10 --conditions auto:run2_mc_Fake1 --relval 9000,50 --datatier "RAW-HLT-RECO" --eventcontent FEVTDEBUGHLT --customise=HLTrigger/Configuration/CustomConfigs.L1THLT --era Run2_25ns --processName=HLTRECO --filein file:RelVal_Raw_Fake1_MC.root --fileout file:RelVal_Raw_Fake1_MC_HLT_RECO.root : FAILED - time: date Tue Mar 10 23:31:19 2020-date Tue Mar 10 23:24:28 2020 s - exit: 16896
cmsRun /cvmfs/cms-ib.cern.ch/week1/slc7_ppc64le_gcc820/cms/cmssw/CMSSW_11_1_X_2020-03-09-2300/src/HLTrigger/Configuration/test/OnLine_HLT_2018.py realData=False globalTag=@ inputFiles=@ : FAILED - time: date Tue Mar 10 23:36:28 2020-date Tue Mar 10 23:24:31 2020 s - exit: 16896
cmsDriver.py RelVal -s HLT:2018,RAW2DIGI,L1Reco,RECO --mc --scenario=pp -n 10 --conditions auto:run2_mc_2018 --relval 9000,50 --datatier "RAW-HLT-RECO" --eventcontent FEVTDEBUGHLT --customise=HLTrigger/Configuration/CustomConfigs.L1THLT --era Run2_2018 --processName=HLTRECO --filein file:RelVal_Raw_2018_MC.root --fileout file:RelVal_Raw_2018_MC_HLT_RECO.root : FAILED - time: date Tue Mar 10 23:36:28 2020-date Tue Mar 10 23:24:31 2020 s - exit: 16896
cmsRun /cvmfs/cms-ib.cern.ch/week1/slc7_ppc64le_gcc820/cms/cmssw/CMSSW_11_1_X_2020-03-09-2300/src/HLTrigger/Configuration/test/OnLine_HLT_HIon.py realData=True globalTag=@ inputFiles=@ : FAILED - time: date Tue Mar 10 23:30:07 2020-date Tue Mar 10 23:24:36 2020 s - exit: 16896
cmsDriver.py RelVal -s HLT:HIon,RAW2DIGI,L1Reco,RECO --data --scenario=pp -n 10 --conditions auto:run2_data_HIon --relval 9000,50 --datatier "RAW-HLT-RECO" --eventcontent FEVTDEBUGHLT --customise=HLTrigger/Configuration/CustomConfigs.L1THLT --era Run2_2018_pp_on_AA --processName=HLTRECO --filein file:RelVal_Raw_HIon_DATA.root --fileout file:RelVal_Raw_HIon_DATA_HLT_RECO.root : FAILED - time: date Tue Mar 10 23:30:07 2020-date Tue Mar 10 23:24:36 2020 s - exit: 16896
cmsRun /cvmfs/cms-ib.cern.ch/week1/slc7_ppc64le_gcc820/cms/cmssw/CMSSW_11_1_X_2020-03-09-2300/src/HLTrigger/Configuration/test/OnLine_HLT_GRun.py realData=False globalTag=@ inputFiles=@ : FAILED - time: date Tue Mar 10 23:35:25 2020-date Tue Mar 10 23:24:39 2020 s - exit: 16896
cmsDriver.py RelVal -s HLT:GRun,RAW2DIGI,L1Reco,RECO --mc --scenario=pp -n 10 --conditions auto:run2_mc_GRun --relval 9000,50 --datatier "RAW-HLT-RECO" --eventcontent FEVTDEBUGHLT --customise=HLTrigger/Configuration/CustomConfigs.L1THLT --era Run3 --processName=HLTRECO --filein file:RelVal_Raw_GRun_MC.root --fileout file:RelVal_Raw_GRun_MC_HLT_RECO.root : FAILED - time: date Tue Mar 10 23:35:25 2020-date Tue Mar 10 23:24:39 2020 s - exit: 16896
cmsDriver.py TTbar_13TeV_TuneCUETP8M1_cfi --conditions auto:run2_mc_l1stage1 --fast -n 100 --eventcontent AODSIM,DQM --relval 100000,1000 -s GEN,SIM,RECOBEFMIX,DIGI:pdigi_valid,L1,DIGI2RAW,L1Reco,RECO,EI,VALIDATION --customise=HLTrigger/Configuration/CustomConfigs.L1THLT --datatier GEN-SIM-DIGI-RECO,DQMIO --beamspot NominalCollision2015 --era Run2_25ns : FAILED - time: date Tue Mar 10 23:27:33 2020-date Tue Mar 10 23:24:44 2020 s - exit: 16896
cmsRun /cvmfs/cms-ib.cern.ch/week1/slc7_ppc64le_gcc820/cms/cmssw/CMSSW_11_1_X_2020-03-09-2300/src/HLTrigger/Configuration/test/OnLine_HLT_2018.py realData=True globalTag=@ inputFiles=@ : FAILED - time: date Tue Mar 10 23:33:47 2020-date Tue Mar 10 23:24:49 2020 s - exit: 16896
cmsDriver.py RelVal -s HLT:2018,RAW2DIGI,L1Reco,RECO --data --scenario=pp -n 10 --conditions auto:run2_data_2018 --relval 9000,50 --datatier "RAW-HLT-RECO" --eventcontent FEVTDEBUGHLT --customise=HLTrigger/Configuration/CustomConfigs.L1THLT --era Run2_2018 --processName=HLTRECO --filein file:RelVal_Raw_2018_DATA.root --fileout file:RelVal_Raw_2018_DATA_HLT_RECO.root : FAILED - time: date Tue Mar 10 23:33:47 2020-date Tue Mar 10 23:24:49 2020 s - exit: 16896
cmsRun /cvmfs/cms-ib.cern.ch/week1/slc7_ppc64le_gcc820/cms/cmssw/CMSSW_11_1_X_2020-03-09-2300/src/HLTrigger/Configuration/test/OnLine_HLT_PRef.py realData=False globalTag=@ inputFiles=@ : FAILED - time: date Tue Mar 10 23:31:21 2020-date Tue Mar 10 23:24:50 2020 s - exit: 16896
cmsDriver.py RelVal -s HLT:PRef,RAW2DIGI,L1Reco,RECO --mc --scenario=pp -n 10 --conditions auto:run2_mc_PRef --relval 9000,50 --datatier "RAW-HLT-RECO" --eventcontent FEVTDEBUGHLT --customise=HLTrigger/Configuration/CustomConfigs.L1THLT --era Run3 --processName=HLTRECO --filein file:RelVal_Raw_PRef_MC.root --fileout file:RelVal_Raw_PRef_MC_HLT_RECO.root : FAILED - time: date Tue Mar 10 23:31:21 2020-date Tue Mar 10 23:24:50 2020 s - exit: 16896
cmsDriver.py TTbar_13TeV_TuneCUETP8M1_cfi --conditions auto:run2_mc --fast -n 100 --eventcontent AODSIM,DQM --relval 100000,1000 -s GEN,SIM,RECOBEFMIX,DIGI:pdigi_valid,L1,DIGI2RAW,L1Reco,RECO,EI,VALIDATION --customise=HLTrigger/Configuration/CustomConfigs.L1THLT --datatier GEN-SIM-DIGI-RECO,DQMIO --beamspot NominalCollision2015 --era Run2_2016 : FAILED - time: date Tue Mar 10 23:27:26 2020-date Tue Mar 10 23:24:53 2020 s - exit: 16896
cmsRun /cvmfs/cms-ib.cern.ch/week1/slc7_ppc64le_gcc820/cms/cmssw/CMSSW_11_1_X_2020-03-09-2300/src/HLTrigger/Configuration/test/OnLine_HLT_PIon.py realData=True globalTag=@ inputFiles=@ : FAILED - time: date Tue Mar 10 23:27:40 2020-date Tue Mar 10 23:24:57 2020 s - exit: 16896
cmsDriver.py RelVal -s HLT:PIon,RAW2DIGI,L1Reco,RECO --data --scenario=pp -n 10 --conditions auto:run2_data_PIon --relval 9000,50 --datatier "RAW-HLT-RECO" --eventcontent FEVTDEBUGHLT --customise=HLTrigger/Configuration/CustomConfigs.L1THLT --era Run2_2018 --processName=HLTRECO --filein file:RelVal_Raw_PIon_DATA.root --fileout file:RelVal_Raw_PIon_DATA_HLT_RECO.root : FAILED - time: date Tue Mar 10 23:27:40 2020-date Tue Mar 10 23:24:57 2020 s - exit: 16896
cmsRun /cvmfs/cms-ib.cern.ch/week1/slc7_ppc64le_gcc820/cms/cmssw/CMSSW_11_1_X_2020-03-09-2300/src/HLTrigger/Configuration/test/OnLine_HLT_Fake.py realData=False globalTag=@ inputFiles=@ : FAILED - time: date Tue Mar 10 23:31:13 2020-date Tue Mar 10 23:24:59 2020 s - exit: 16896
cmsDriver.py RelVal -s HLT:Fake,RAW2DIGI,L1Reco,RECO --mc --scenario=pp -n 10 --conditions auto:run1_mc_Fake --relval 9000,50 --datatier "RAW-HLT-RECO" --eventcontent FEVTDEBUGHLT --customise=HLTrigger/Configuration/CustomConfigs.L1THLT --processName=HLTRECO --filein file:RelVal_Raw_Fake_MC.root --fileout file:RelVal_Raw_Fake_MC_HLT_RECO.root : FAILED - time: date Tue Mar 10 23:31:13 2020-date Tue Mar 10 23:24:59 2020 s - exit: 16896
cmsRun /cvmfs/cms-ib.cern.ch/week1/slc7_ppc64le_gcc820/cms/cmssw/CMSSW_11_1_X_2020-03-09-2300/src/HLTrigger/Configuration/test/OnLine_HLT_HIon.py realData=False globalTag=@ inputFiles=@ : FAILED - time: date Tue Mar 10 23:32:53 2020-date Tue Mar 10 23:25:03 2020 s - exit: 16896
cmsDriver.py RelVal -s HLT:HIon,RAW2DIGI,L1Reco,RECO --mc --scenario=pp -n 10 --conditions auto:run2_mc_HIon --relval 9000,50 --datatier "RAW-HLT-RECO" --eventcontent FEVTDEBUGHLT --customise=HLTrigger/Configuration/CustomConfigs.L1THLT --era Run2_2018_pp_on_AA --processName=HLTRECO --filein file:RelVal_Raw_HIon_MC.root --fileout file:RelVal_Raw_HIon_MC_HLT_RECO.root : FAILED - time: date Tue Mar 10 23:32:53 2020-date Tue Mar 10 23:25:03 2020 s - exit: 16896
cmsRun /cvmfs/cms-ib.cern.ch/week1/slc7_ppc64le_gcc820/cms/cmssw/CMSSW_11_1_X_2020-03-09-2300/src/HLTrigger/Configuration/test/OnLine_HLT_Fake2.py realData=False globalTag=@ inputFiles=@ : FAILED - time: date Tue Mar 10 23:31:20 2020-date Tue Mar 10 23:25:04 2020 s - exit: 16896
cmsDriver.py RelVal -s HLT:Fake2,RAW2DIGI,L1Reco,RECO --mc --scenario=pp -n 10 --conditions auto:run2_mc_Fake2 --relval 9000,50 --datatier "RAW-HLT-RECO" --eventcontent FEVTDEBUGHLT --customise=HLTrigger/Configuration/CustomConfigs.L1THLT --era Run2_2016 --processName=HLTRECO --filein file:RelVal_Raw_Fake2_MC.root --fileout file:RelVal_Raw_Fake2_MC_HLT_RECO.root : FAILED - time: date Tue Mar 10 23:31:20 2020-date Tue Mar 10 23:25:04 2020 s - exit: 16896
cmsRun /cvmfs/cms-ib.cern.ch/week1/slc7_ppc64le_gcc820/cms/cmssw/CMSSW_11_1_X_2020-03-09-2300/src/HLTrigger/Configuration/test/OnLine_HLT_PRef.py realData=True globalTag=@ inputFiles=@ : FAILED - time: date Tue Mar 10 23:28:31 2020-date Tue Mar 10 23:25:07 2020 s - exit: 16896
cmsDriver.py RelVal -s HLT:PRef,RAW2DIGI,L1Reco,RECO --data --scenario=pp -n 10 --conditions auto:run2_data_PRef --relval 9000,50 --datatier "RAW-HLT-RECO" --eventcontent FEVTDEBUGHLT --customise=HLTrigger/Configuration/CustomConfigs.L1THLT --era Run2_2018 --processName=HLTRECO --filein file:RelVal_Raw_PRef_DATA.root --fileout file:RelVal_Raw_PRef_DATA_HLT_RECO.root : FAILED - time: date Tue Mar 10 23:28:31 2020-date Tue Mar 10 23:25:07 2020 s - exit: 16896
cmsRun /cvmfs/cms-ib.cern.ch/week1/slc7_ppc64le_gcc820/cms/cmssw/CMSSW_11_1_X_2020-03-09-2300/src/HLTrigger/Configuration/test/OnLine_HLT_Fake.py realData=True globalTag=@ inputFiles=@ : FAILED - time: date Tue Mar 10 23:27:59 2020-date Tue Mar 10 23:25:12 2020 s - exit: 16896
cmsDriver.py RelVal -s HLT:Fake,RAW2DIGI,L1Reco,RECO --data --scenario=pp -n 10 --conditions auto:run1_data_Fake --relval 9000,50 --datatier "RAW-HLT-RECO" --eventcontent FEVTDEBUGHLT --customise=HLTrigger/Configuration/CustomConfigs.L1THLT --processName=HLTRECO --filein file:RelVal_Raw_Fake_DATA.root --fileout file:RelVal_Raw_Fake_DATA_HLT_RECO.root : FAILED - time: date Tue Mar 10 23:27:59 2020-date Tue Mar 10 23:25:12 2020 s - exit: 16896
cmsRun /cvmfs/cms-ib.cern.ch/week1/slc7_ppc64le_gcc820/cms/cmssw/CMSSW_11_1_X_2020-03-09-2300/src/HLTrigger/Configuration/test/OnLine_HLT_GRun.py realData=True globalTag=@ inputFiles=@ : FAILED - time: date Tue Mar 10 23:34:05 2020-date Tue Mar 10 23:25:14 2020 s - exit: 16896
cmsDriver.py RelVal -s HLT:GRun,RAW2DIGI,L1Reco,RECO --data --scenario=pp -n 10 --conditions auto:run2_data_GRun --relval 9000,50 --datatier "RAW-HLT-RECO" --eventcontent FEVTDEBUGHLT --customise=HLTrigger/Configuration/CustomConfigs.L1THLT --era Run2_2018 --processName=HLTRECO --filein file:RelVal_Raw_GRun_DATA.root --fileout file:RelVal_Raw_GRun_DATA_HLT_RECO.root : FAILED - time: date Tue Mar 10 23:34:05 2020-date Tue Mar 10 23:25:14 2020 s - exit: 16896
cmsRun /cvmfs/cms-ib.cern.ch/week1/slc7_ppc64le_gcc820/cms/cmssw/CMSSW_11_1_X_2020-03-09-2300/src/HLTrigger/Configuration/test/OnLine_HLT_Fake1.py realData=True globalTag=@ inputFiles=@ : FAILED - time: date Tue Mar 10 23:27:41 2020-date Tue Mar 10 23:25:16 2020 s - exit: 16896
cmsDriver.py RelVal -s HLT:Fake1,RAW2DIGI,L1Reco,RECO --data --scenario=pp -n 10 --conditions auto:run2_data_Fake1 --relval 9000,50 --datatier "RAW-HLT-RECO" --eventcontent FEVTDEBUGHLT --customise=HLTrigger/Configuration/CustomConfigs.L1THLT --era Run2_25ns --processName=HLTRECO --filein file:RelVal_Raw_Fake1_DATA.root --fileout file:RelVal_Raw_Fake1_DATA_HLT_RECO.root : FAILED - time: date Tue Mar 10 23:27:41 2020-date Tue Mar 10 23:25:16 2020 s - exit: 16896
cmsRun /cvmfs/cms-ib.cern.ch/week1/slc7_ppc64le_gcc820/cms/cmssw/CMSSW_11_1_X_2020-03-09-2300/src/HLTrigger/Configuration/test/OnLine_HLT_PIon.py realData=False globalTag=@ inputFiles=@ : FAILED - time: date Tue Mar 10 23:31:59 2020-date Tue Mar 10 23:25:20 2020 s - exit: 16896
cmsDriver.py RelVal -s HLT:PIon,RAW2DIGI,L1Reco,RECO --mc --scenario=pp -n 10 --conditions auto:run2_mc_PIon --relval 9000,50 --datatier "RAW-HLT-RECO" --eventcontent FEVTDEBUGHLT --customise=HLTrigger/Configuration/CustomConfigs.L1THLT --era Run3 --processName=HLTRECO --filein file:RelVal_Raw_PIon_MC.root --fileout file:RelVal_Raw_PIon_MC_HLT_RECO.root : FAILED - time: date Tue Mar 10 23:31:59 2020-date Tue Mar 10 23:25:20 2020 s - exit: 16896

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-323b31/5117/git-log-recent-commits
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-323b31/5117/git-merge-result

@cmsbuild
Copy link
Contributor

Comparison not run due to runTheMatrix errors (RelVals and Igprof tests were also skipped)

**kwargs
)

def cloneAll(self, **params):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't this be called clone?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't this be called clone?

That would override the base class SwitchProducer.clone(), which does the clone but requires to explicitly pass the dictionaries for all cases, just like the implementation does. On the other hand see that for the use cases where the case configurations are (nearly) identical, the base SwitchProducer approach is a bit cumbersome. I could think of adding a special parameter to the base class clone() that the modifications would be applied to all cases, e.g. something along switchProducer.clone(..., applyToAllCases_ = True).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could think of adding a special parameter to the base class clone() that the modifications would be applied to all cases, e.g. something along switchProducer.clone(..., applyToAllCases_ = True).

@makortel I think it would be nice to have this:)

global _onnxrt_enabled_cached
if _onnxrt_enabled_cached is None:
import os
_onnxrt_enabled_cached = ('amd64' in os.environ['SCRAM_ARCH'] or 'aarch64' in os.environ['SCRAM_ARCH']) and ('CMS_DISABLE_ONNXRUNTIME' not in os.environ)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add linebreaks for readability (to fit in under 100 chars)

@@ -720,6 +721,15 @@ def setupBTagging(process, jetSource, pfCandidates, explicitJTA, pvSource, svSou
process,
task
)
elif isinstance(getattr(btag, btagDiscr), SwitchProducerONNX):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this branch is needed only because of cloneAll in the name.
It would be better to avoid a special method case

Comment on lines +7 to +9
_flav_names = ['probTbcq', 'probTbqq', 'probTbc', 'probTbq', 'probWcq', 'probWqq',
'probZbb', 'probZcc', 'probZqq', 'probHbb', 'probHcc', 'probHqqqq',
'probQCDbb', 'probQCDcc', 'probQCDb', 'probQCDc', 'probQCDothers']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is apparently a copy from deepBoostedJetONNXJetTagsProducer.flav_names

desc.add<std::vector<std::string>>("flav_names",
std::vector<std::string>{
"probTbcq",
"probTbqq",
"probTbc",
"probTbq",
"probWcq",
"probWqq",
"probZbb",
"probZcc",
"probZqq",
"probHbb",
"probHcc",
"probHqqqq",
"probQCDbb",
"probQCDcc",
"probQCDb",
"probQCDc",
"probQCDothers",

why an explicit copy is needed here?

param_path = 'RecoBTag/Combined/data/DeepBoostedJet/V02/full/resnet-0000.params',
),
onnx = deepBoostedJetONNXJetTagsProducer.clone(
flav_names = _flav_names,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here please use what's already defined in the fillDescriptions (IIUC, in this _flav_names is identical to that)

)

# mass-decorrelated DeepAK8
pfMassDecorrelatedDeepBoostedJetTags = SwitchProducerONNX(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better clone from pfDeepBoostedJetTags to minimize repeated copies of the same parameters


pfMassIndependentDeepDoubleBvLJetTags = SwitchProducerONNX(
native = pfDeepDoubleBvLTFJetTags.clone(
model_path = 'RecoBTag/Combined/data/DeepDoubleX/94X/V01/DDB_mass_independent.pb'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to apply the regular clone pattern here as in other modules?
pfMassIndependentDeepDoubleBvLJetTags = pfDeepDoubleBvLJetTags.clone(native = dict(model_path = ...


#include <algorithm>

class DeepDoubleXTFJetTagsProducer : public edm::stream::EDProducer<edm::GlobalCache<tensorflow::GraphDef>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given the overlap with DeepDoubleXONNXJetTagsProducer, I think that some common base or template is needed.

@slava77
Copy link
Contributor

slava77 commented Apr 7, 2020

One thing w/ the current implementation is that the SwitchProducer mechanism actually initializes both the TF and the ONNX producers and this creates some overhead (i.e., both the TF and the ONNX DNN models will be loaded as this is done in initializeGlobalCache). Is there a way to avoid this?

What is the cost of loading both models in memory for the switch producer case in MB for the full setup?
Is it limited to the GlobalCache initialization and would scale only per process?
I can think of deferring the actual model initialization to the first call: the global cache will be lightweight, but the first call (in the ::produce) will do a call_once and load the full model.

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 7, 2020

+1
Tested at: d8951bb
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-323b31/5578/summary.html
CMSSW: CMSSW_11_1_X_2020-04-07-1100
SCRAM_ARCH: slc7_amd64_gcc820

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 7, 2020

Comparison job queued.

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 8, 2020

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-323b31/5578/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 41 differences found in the comparisons
  • DQMHistoTests: Total files compared: 34
  • DQMHistoTests: Total histograms compared: 2692110
  • DQMHistoTests: Total failures: 46
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2691745
  • DQMHistoTests: Total skipped: 319
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 33 files compared)
  • Checked 147 log files, 16 edm output root files, 34 DQM output files

Comment on lines +33 to +36
return super(SwitchProducerONNX, self).clone(
native = self.native.clone(**params),
onnx = self.onnx.clone(**params),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return super(SwitchProducerONNX, self).clone(
native = self.native.clone(**params),
onnx = self.onnx.clone(**params),
)
return super(SwitchProducerONNX, self).clone(
native = params,
onnx = params,
)

just to note that this should work as well, and be a little bit more efficient (less cloning).

@hqucms
Copy link
Contributor Author

hqucms commented Apr 8, 2020

@slava77 A general question is how we would like to move forward w/ this PR given that ppc support will probably come for ONNXRuntime (though not clear when). Shall we wait a bit more for that, or do you prefer to proceed with this PR now?

@slava77
Copy link
Contributor

slava77 commented Apr 8, 2020

@slava77 A general question is how we would like to move forward w/ this PR given that ppc support will probably come for ONNXRuntime (though not clear when). Shall we wait a bit more for that, or do you prefer to proceed with this PR now?

"in a few days" was mentioned in the ONNX issue at the end of February. I can imagine that the current priorities are elsewhere. Even with the expectation that this PR is useful only somewhat temporarily, it also has some R&D feature/motivation.
I'm not going to push for this at this point; perhaps the importance is to be revisited on about 4-week scale, when we are near closing the 11_1_X.

@hqucms
Copy link
Contributor Author

hqucms commented Apr 8, 2020

"in a few days" was mentioned in the ONNX issue at the end of February. I can imagine that the current priorities are elsewhere. Even with the expectation that this PR is useful only somewhat temporarily, it also has some R&D feature/motivation.
I'm not going to push for this at this point; perhaps the importance is to be revisited on about 4-week scale, when we are near closing the 11_1_X.

Sounds like a good plan to me! I am a bit busy these days with some other things, so then I will come back to this after a week or two.

@cmsbuild cmsbuild mentioned this pull request Apr 16, 2020
@slava77 slava77 mentioned this pull request Apr 18, 2020
41 tasks
@silviodonato
Copy link
Contributor

Kind reminder for @hqucms

@cmsbuild cmsbuild mentioned this pull request Apr 24, 2020
@silviodonato
Copy link
Contributor

Kind reminder for @hqucms

@hqucms
Copy link
Contributor Author

hqucms commented Apr 28, 2020

With ppc support added to ONNXRuntime after cms-sw/cmsdist#5743, I guess this PR is no longer needed and can be closed?

@slava77
Copy link
Contributor

slava77 commented Apr 29, 2020

With ppc support added to ONNXRuntime after cms-sw/cmsdist#5743, I guess this PR is no longer needed and can be closed?

right, quite likely.
let's give it a few days before closing.

@slava77
Copy link
Contributor

slava77 commented Apr 30, 2020

-1
PPC IBs looked OK in the last two days.
So, it should be fine to close this PR for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants