-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restore TF and MXNet-based inference for DeepJet, DeepDoubleX and DeepAK8 #29172
Conversation
And enable ONNXRuntime for x86/arm only.
The code-checks are being triggered in jenkins. |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-29172/14144
|
A new Pull Request was created by @hqucms (Huilin Qu) for master. It involves the following packages: PhysicsTools/ONNXRuntime @perrotta, @cmsbuild, @santocch, @slava77 can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
is this "nice to have" or really required at run time? |
It is required (to test the TF-based part, e.g., for PPC). For x86/ARM we use ONNX so it's not needed. |
test parameters |
@cmsbuild please test for slc7_ppc64le_gcc820 |
The tests are being triggered in jenkins.
|
One thing w/ the current implementation is that the |
@makortel @Dr15Jones |
-1 Tested at: d8951bb CMSSW: CMSSW_11_1_X_2020-03-09-2300 The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:
You can see more details here: You can see the results of the tests here: I found follow errors while testing this PR Failed tests: RelVals AddOn
When I ran the RelVals I found an error in the following workflows: runTheMatrix-results/140.53_RunHI2011+RunHI2011+RECOHID11+HARVESTDHI/step2_RunHI2011+RunHI2011+RECOHID11+HARVESTDHI.log5.1 step1 runTheMatrix-results/5.1_TTbar+TTbarFS+HARVESTFS/step1_TTbar+TTbarFS+HARVESTFS.log135.4 step1 runTheMatrix-results/135.4_ZEE_13+ZEEFS_13+HARVESTUP15FS+MINIAODMCUP15FS/step1_ZEE_13+ZEEFS_13+HARVESTUP15FS+MINIAODMCUP15FS.log1001.0 step2 runTheMatrix-results/1001.0_RunMinBias2011A+RunMinBias2011A+TIER0EXP+ALCAEXP+ALCAHARVDSIPIXELCALRUN1+ALCAHARVD1+ALCAHARVD2+ALCAHARVD3+ALCAHARVD4+ALCAHARVD5/step2_RunMinBias2011A+RunMinBias2011A+TIER0EXP+ALCAEXP+ALCAHARVDSIPIXELCALRUN1+ALCAHARVD1+ALCAHARVD2+ALCAHARVD3+ALCAHARVD4+ALCAHARVD5.log1000.0 step2 runTheMatrix-results/1000.0_RunMinBias2011A+RunMinBias2011A+TIER0+SKIMD+HARVESTDfst2+ALCASPLIT/step2_RunMinBias2011A+RunMinBias2011A+TIER0+SKIMD+HARVESTDfst2+ALCASPLIT.log140.56 step2 runTheMatrix-results/140.56_RunHI2018+RunHI2018+RECOHID18+HARVESTDHI18/step2_RunHI2018+RunHI2018+RECOHID18+HARVESTDHI18.log4.53 step3 runTheMatrix-results/4.53_RunPhoton2012B+RunPhoton2012B+HLTD+RECODR1reHLT+HARVESTDR1reHLT/step3_RunPhoton2012B+RunPhoton2012B+HLTD+RECODR1reHLT+HARVESTDR1reHLT.log136.731 step3 runTheMatrix-results/136.731_RunSinglePh2016B+RunSinglePh2016B+HLTDR2_2016+RECODR2_2016reHLT_skimSinglePh_HIPM+HARVESTDR2/step3_RunSinglePh2016B+RunSinglePh2016B+HLTDR2_2016+RECODR2_2016reHLT_skimSinglePh_HIPM+HARVESTDR2.log158.0 step2 runTheMatrix-results/158.0_HydjetQ_B12_5020GeV_2018_ppReco+HydjetQ_B12_5020GeV_2018_ppReco+DIGIHI2018PPRECO+RECOHI2018PPRECO+ALCARECOHI2018PPRECO+HARVESTHI2018PPRECO/step2_HydjetQ_B12_5020GeV_2018_ppReco+HydjetQ_B12_5020GeV_2018_ppReco+DIGIHI2018PPRECO+RECOHI2018PPRECO+ALCARECOHI2018PPRECO+HARVESTHI2018PPRECO.log136.793 step3 runTheMatrix-results/136.793_RunDoubleEG2017C+RunDoubleEG2017C+HLTDR2_2017+RECODR2_2017reHLT_skimDoubleEG_Prompt+HARVEST2017/step3_RunDoubleEG2017C+RunDoubleEG2017C+HLTDR2_2017+RECODR2_2017reHLT_skimDoubleEG_Prompt+HARVEST2017.log136.874 step3 runTheMatrix-results/136.874_RunEGamma2018C+RunEGamma2018C+HLTDR2_2018+RECODR2_2018reHLT_skimEGamma_Offline_L1TEgDQM+HARVEST2018_L1TEgDQM/step3_RunEGamma2018C+RunEGamma2018C+HLTDR2_2018+RECODR2_2018reHLT_skimEGamma_Offline_L1TEgDQM+HARVEST2018_L1TEgDQM.log1330.0 step3 runTheMatrix-results/1330.0_ZMM_13+ZMM_13+DIGIUP15+RECOUP15_L1TMuDQM+HARVESTUP15_L1TMuDQM+NANOUP15/step3_ZMM_13+ZMM_13+DIGIUP15+RECOUP15_L1TMuDQM+HARVESTUP15_L1TMuDQM+NANOUP15.log1306.0 step3 runTheMatrix-results/1306.0_SingleMuPt1_UP15+SingleMuPt1_UP15+DIGIUP15+RECOUP15+HARVESTUP15/step3_SingleMuPt1_UP15+SingleMuPt1_UP15+DIGIUP15+RECOUP15+HARVESTUP15.log10042.0 step3 runTheMatrix-results/10042.0_ZMM_13+ZMM_13TeV_TuneCUETP8M1_2017_GenSimFull+DigiFull_2017+RecoFull_2017+HARVESTFull_2017+ALCAFull_2017+NanoFull_2017/step3_ZMM_13+ZMM_13TeV_TuneCUETP8M1_2017_GenSimFull+DigiFull_2017+RecoFull_2017+HARVESTFull_2017+ALCAFull_2017+NanoFull_2017.log9.0 step3 runTheMatrix-results/9.0_Higgs200ChargedTaus+Higgs200ChargedTaus+DIGI+RECO+HARVEST/step3_Higgs200ChargedTaus+Higgs200ChargedTaus+DIGI+RECO+HARVEST.log11634.0 step2 runTheMatrix-results/11634.0_TTbar_14TeV+TTbar_14TeV_TuneCP5_2021_GenSimFull+DigiFull_2021+RecoFull_2021+HARVESTFull_2021+ALCAFull_2021/step2_TTbar_14TeV+TTbar_14TeV_TuneCP5_2021_GenSimFull+DigiFull_2021+RecoFull_2021+HARVESTFull_2021+ALCAFull_2021.log12434.0 step2 runTheMatrix-results/12434.0_TTbar_14TeV+TTbar_14TeV_TuneCP5_2023_GenSimFull+DigiFull_2023+RecoFull_2023+HARVESTFull_2023+ALCAFull_2023/step2_TTbar_14TeV+TTbar_14TeV_TuneCP5_2023_GenSimFull+DigiFull_2023+RecoFull_2023+HARVESTFull_2023+ALCAFull_2023.log25.0 step3 runTheMatrix-results/25.0_TTbar+TTbar+DIGI+RECOAlCaCalo+HARVEST+ALCATT/step3_TTbar+TTbar+DIGI+RECOAlCaCalo+HARVEST+ALCATT.log10824.0 step3 runTheMatrix-results/10824.0_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2018_GenSimFull+DigiFull_2018+RecoFull_2018+HARVESTFull_2018+ALCAFull_2018+NanoFull_2018/step3_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2018_GenSimFull+DigiFull_2018+RecoFull_2018+HARVESTFull_2018+ALCAFull_2018+NanoFull_2018.log10024.0 step3 runTheMatrix-results/10024.0_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2017_GenSimFull+DigiFull_2017+RecoFull_2017+HARVESTFull_2017+ALCAFull_2017+NanoFull_2017/step3_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2017_GenSimFull+DigiFull_2017+RecoFull_2017+HARVESTFull_2017+ALCAFull_2017+NanoFull_2017.log25202.0 step3 runTheMatrix-results/25202.0_TTbar_13+TTbar_13+DIGIUP15_PU25+RECOUP15_PU25+HARVESTUP15_PU25+NANOUP15_PU25/step3_TTbar_13+TTbar_13+DIGIUP15_PU25+RECOUP15_PU25+HARVESTUP15_PU25+NANOUP15_PU25.log10224.0 step3 runTheMatrix-results/10224.0_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2017PU_GenSimFull+DigiFullPU_2017PU+RecoFullPU_2017PU+HARVESTFullPU_2017PU+NanoFull_2017PU/step3_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2017PU_GenSimFull+DigiFullPU_2017PU+RecoFullPU_2017PU+HARVESTFullPU_2017PU+NanoFull_2017PU.log20034.0 step3 runTheMatrix-results/20034.0_TTbar_14TeV+TTbar_14TeV_TuneCP5_2026D35_GenSimHLBeamSpotFull14+DigiFullTrigger_2026D35+RecoFullGlobal_2026D35+HARVESTFullGlobal_2026D35/step3_TTbar_14TeV+TTbar_14TeV_TuneCP5_2026D35_GenSimHLBeamSpotFull14+DigiFullTrigger_2026D35+RecoFullGlobal_2026D35+HARVESTFullGlobal_2026D35.log20434.0 step3 runTheMatrix-results/20434.0_TTbar_14TeV+TTbar_14TeV_TuneCP5_2026D41_GenSimHLBeamSpotFull14+DigiFullTrigger_2026D41+RecoFullGlobal_2026D41+HARVESTFullGlobal_2026D41/step3_TTbar_14TeV+TTbar_14TeV_TuneCP5_2026D41_GenSimHLBeamSpotFull14+DigiFullTrigger_2026D41+RecoFullGlobal_2026D41+HARVESTFullGlobal_2026D41.log21234.0 step3 runTheMatrix-results/21234.0_TTbar_14TeV+TTbar_14TeV_TuneCP5_2026D44_GenSimHLBeamSpotFull14+DigiFullTrigger_2026D44+RecoFullGlobal_2026D44+HARVESTFullGlobal_2026D44/step3_TTbar_14TeV+TTbar_14TeV_TuneCP5_2026D44_GenSimHLBeamSpotFull14+DigiFullTrigger_2026D44+RecoFullGlobal_2026D44+HARVESTFullGlobal_2026D44.log23234.0 step3 runTheMatrix-results/23234.0_TTbar_14TeV+TTbar_14TeV_TuneCP5_2026D49_GenSimHLBeamSpotFull14+DigiFullTrigger_2026D49+RecoFullGlobal_2026D49+HARVESTFullGlobal_2026D49/step3_TTbar_14TeV+TTbar_14TeV_TuneCP5_2026D49_GenSimHLBeamSpotFull14+DigiFullTrigger_2026D49+RecoFullGlobal_2026D49+HARVESTFullGlobal_2026D49.log250202.181 step4 runTheMatrix-results/250202.181_TTbar_13UP18+TTbar_13UP18+PREMIXUP18_PU25+DIGIPRMXLOCALUP18_PU25+RECOPRMXUP18_PU25+HARVESTUP18_PU25/step4_TTbar_13UP18+TTbar_13UP18+PREMIXUP18_PU25+DIGIPRMXLOCALUP18_PU25+RECOPRMXUP18_PU25+HARVESTUP18_PU25.log
I found errors in the following addon tests: cmsDriver.py TTbar_8TeV_TuneCUETP8M1_cfi --conditions auto:run1_mc --fast -n 100 --eventcontent AODSIM,DQM --relval 100000,1000 -s GEN,SIM,RECOBEFMIX,DIGI:pdigi_valid,L1,DIGI2RAW,L1Reco,RECO,EI,VALIDATION --customise=HLTrigger/Configuration/CustomConfigs.L1THLT --datatier GEN-SIM-DIGI-RECO,DQMIO --beamspot Realistic8TeVCollision : FAILED - time: date Tue Mar 10 23:26:50 2020-date Tue Mar 10 23:24:22 2020 s - exit: 16896 The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:
You can see more details here: |
Comparison not run due to runTheMatrix errors (RelVals and Igprof tests were also skipped) |
**kwargs | ||
) | ||
|
||
def cloneAll(self, **params): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't this be called clone
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't this be called
clone
?
That would override the base class SwitchProducer.clone()
, which does the clone but requires to explicitly pass the dictionaries for all cases, just like the implementation does. On the other hand see that for the use cases where the case configurations are (nearly) identical, the base SwitchProducer
approach is a bit cumbersome. I could think of adding a special parameter to the base class clone()
that the modifications would be applied to all cases, e.g. something along switchProducer.clone(..., applyToAllCases_ = True)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could think of adding a special parameter to the base class
clone()
that the modifications would be applied to all cases, e.g. something alongswitchProducer.clone(..., applyToAllCases_ = True)
.
@makortel I think it would be nice to have this:)
global _onnxrt_enabled_cached | ||
if _onnxrt_enabled_cached is None: | ||
import os | ||
_onnxrt_enabled_cached = ('amd64' in os.environ['SCRAM_ARCH'] or 'aarch64' in os.environ['SCRAM_ARCH']) and ('CMS_DISABLE_ONNXRUNTIME' not in os.environ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add linebreaks for readability (to fit in under 100 chars)
@@ -720,6 +721,15 @@ def setupBTagging(process, jetSource, pfCandidates, explicitJTA, pvSource, svSou | |||
process, | |||
task | |||
) | |||
elif isinstance(getattr(btag, btagDiscr), SwitchProducerONNX): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, this branch is needed only because of cloneAll
in the name.
It would be better to avoid a special method case
_flav_names = ['probTbcq', 'probTbqq', 'probTbc', 'probTbq', 'probWcq', 'probWqq', | ||
'probZbb', 'probZcc', 'probZqq', 'probHbb', 'probHcc', 'probHqqqq', | ||
'probQCDbb', 'probQCDcc', 'probQCDb', 'probQCDc', 'probQCDothers'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is apparently a copy from deepBoostedJetONNXJetTagsProducer.flav_names
cmssw/RecoBTag/ONNXRuntime/plugins/DeepBoostedJetONNXJetTagsProducer.cc
Lines 143 to 161 in d8951bb
desc.add<std::vector<std::string>>("flav_names", | |
std::vector<std::string>{ | |
"probTbcq", | |
"probTbqq", | |
"probTbc", | |
"probTbq", | |
"probWcq", | |
"probWqq", | |
"probZbb", | |
"probZcc", | |
"probZqq", | |
"probHbb", | |
"probHcc", | |
"probHqqqq", | |
"probQCDbb", | |
"probQCDcc", | |
"probQCDb", | |
"probQCDc", | |
"probQCDothers", |
why an explicit copy is needed here?
param_path = 'RecoBTag/Combined/data/DeepBoostedJet/V02/full/resnet-0000.params', | ||
), | ||
onnx = deepBoostedJetONNXJetTagsProducer.clone( | ||
flav_names = _flav_names, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here please use what's already defined in the fillDescriptions (IIUC, in this _flav_names is identical to that)
) | ||
|
||
# mass-decorrelated DeepAK8 | ||
pfMassDecorrelatedDeepBoostedJetTags = SwitchProducerONNX( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better clone from pfDeepBoostedJetTags to minimize repeated copies of the same parameters
|
||
pfMassIndependentDeepDoubleBvLJetTags = SwitchProducerONNX( | ||
native = pfDeepDoubleBvLTFJetTags.clone( | ||
model_path = 'RecoBTag/Combined/data/DeepDoubleX/94X/V01/DDB_mass_independent.pb' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to apply the regular clone pattern here as in other modules?
pfMassIndependentDeepDoubleBvLJetTags = pfDeepDoubleBvLJetTags.clone(native = dict(model_path = ...
|
||
#include <algorithm> | ||
|
||
class DeepDoubleXTFJetTagsProducer : public edm::stream::EDProducer<edm::GlobalCache<tensorflow::GraphDef>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given the overlap with DeepDoubleXONNXJetTagsProducer, I think that some common base or template is needed.
What is the cost of loading both models in memory for the switch producer case in MB for the full setup? |
+1 |
Comparison job queued. |
Comparison is ready Comparison Summary:
|
return super(SwitchProducerONNX, self).clone( | ||
native = self.native.clone(**params), | ||
onnx = self.onnx.clone(**params), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return super(SwitchProducerONNX, self).clone( | |
native = self.native.clone(**params), | |
onnx = self.onnx.clone(**params), | |
) | |
return super(SwitchProducerONNX, self).clone( | |
native = params, | |
onnx = params, | |
) |
just to note that this should work as well, and be a little bit more efficient (less cloning).
@slava77 A general question is how we would like to move forward w/ this PR given that ppc support will probably come for ONNXRuntime (though not clear when). Shall we wait a bit more for that, or do you prefer to proceed with this PR now? |
"in a few days" was mentioned in the ONNX issue at the end of February. I can imagine that the current priorities are elsewhere. Even with the expectation that this PR is useful only somewhat temporarily, it also has some R&D feature/motivation. |
Sounds like a good plan to me! I am a bit busy these days with some other things, so then I will come back to this after a week or two. |
Kind reminder for @hqucms |
Kind reminder for @hqucms |
With ppc support added to ONNXRuntime after cms-sw/cmsdist#5743, I guess this PR is no longer needed and can be closed? |
right, quite likely. |
-1 |
PR description:
This PR is to address #28959. The TF and MXNet-based inference for DeepJet, DeepDoubleX and DeepAK8 is recovered and intended to be used on architectures that ONNXRuntime does not support (e.g., PowerPC). The new ONNXRuntime-based inference introduced in #28112 is still kept for x86 and ARM for better speed and lower memory cost. The switch between the two types of producers is implemented using
SwitchProducer
and by detecting theSCRAM_ARCH
.Needs cms-data/RecoBTag-Combined#27 (updated TF models to remove the training-only nodes; speed up the TF-based inference by 5-10%).
PR validation:
Tested in a JetHT2017D NanoAOD workflow and obtained consistent outputs between the TF/MXNet and the ONNXRuntime backends for affected taggers.