Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long startup time for simplest workflows #43062

Open
felicepantaleo opened this issue Oct 19, 2023 · 42 comments
Open

Long startup time for simplest workflows #43062

felicepantaleo opened this issue Oct 19, 2023 · 42 comments

Comments

@felicepantaleo
Copy link
Contributor

felicepantaleo commented Oct 19, 2023

Developing reconstruction algorithms for Phase-2 is becoming more and more difficult and frustrating as CMSSW startup times have increased in the last few years.
This is what happens when we want to reconstruct 10 events of a single charged pion shot in front of HGCAL with a single thread.
The workflow is 24896.0_CloseByPGun_CE_E_Front_120um+2026D98 using CMSSW_13_2_0_pre3
It takes 4.5 minutes to launch the job, 30 seconds to reconstruct 10 events with a single thread.
Reducing the startup time would reduce drastically the number of coffees we have every day while increasing morale.

$ date && cmsRun step3_RAW2DIGI_RECO_RECOSIM_PAT_VALIDATION_DQM.py 
**Don Okt 19 10:55:37 CEST 2023**
19-Oct-2023 10:57:51 CEST  Initiating request to open file file:step2_16065463_0.root
19-Oct-2023 10:58:06 CEST  Successfully opened file file:step2_16065463_0.root
2023-10-19 10:58:48.203903: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Begin processing the 1st record. Run 1, Event 1, LumiSection 1 on stream 0 at 19-Oct-2023 11:00:50.708 CEST
%MSG-w UndefinedPartonMode:  HadronAndPartonSelector:selectedHadronsAndPartons  19-Oct-2023 11:00:52 CEST Run: 1 Event: 1
Could not automatically determine the hadronizer type and set the correct parton selection mode. Parton-based jet flavour will not be defined.
%MSG
#--------------------------------------------------------------------------
#                         FastJet release 3.4.1
#                 M. Cacciari, G.P. Salam and G. Soyez                  
#     A software package for jet finding and analysis at colliders      
#                           http://fastjet.fr                           
#                                                                             
# Please cite EPJC72(2012)1896 [arXiv:1111.6097] if you use this package
# for scientific work and optionally PLB641(2006)57 [hep-ph/0512210].   
#                                                                       
# FastJet is provided without warranty under the GNU GPL v2 or higher.  
# It uses T. Chan's closest pair algorithm, S. Fortune's Voronoi code
# and 3rd party plugin jet algorithms. See COPYING file for details.
#--------------------------------------------------------------------------
%MSG-w UndefinedPartonMode:  HadronAndPartonSelector:patJetPartons  19-Oct-2023 11:00:52 CEST Run: 1 Event: 1
Could not automatically determine the hadronizer type and set the correct parton selection mode. Parton-based jet flavour will not be defined.
%MSG
%MSG-w UndefinedPartonMode:   HadronAndPartonSelector:selectedHadronsAndPartonsForGenJetsFlavourInfos  19-Oct-2023 11:01:05 CEST Run: 1 Event: 1
Could not automatically determine the hadronizer type and set the correct parton selection mode. Parton-based jet flavour will not be defined.
%MSG
Begin processing the 2nd record. Run 1, Event 2, LumiSection 1 on stream 0 at 19-Oct-2023 11:01:05.542 CEST
Begin processing the 3rd record. Run 1, Event 3, LumiSection 1 on stream 0 at 19-Oct-2023 11:01:08.168 CEST
Begin processing the 4th record. Run 1, Event 4, LumiSection 1 on stream 0 at 19-Oct-2023 11:01:10.733 CEST
Begin processing the 5th record. Run 1, Event 5, LumiSection 1 on stream 0 at 19-Oct-2023 11:01:13.286 CEST
Begin processing the 6th record. Run 1, Event 6, LumiSection 1 on stream 0 at 19-Oct-2023 11:01:15.435 CEST
Begin processing the 7th record. Run 1, Event 8, LumiSection 1 on stream 0 at 19-Oct-2023 11:01:17.857 CEST
Begin processing the 8th record. Run 1, Event 7, LumiSection 1 on stream 0 at 19-Oct-2023 11:01:20.059 CEST
Begin processing the 9th record. Run 1, Event 9, LumiSection 1 on stream 0 at 19-Oct-2023 11:01:22.279 CEST
Begin processing the 10th record. Run 1, Event 11, LumiSection 1 on stream 0 at 19-Oct-2023 11:01:25.115 CEST

@waredjeb @makortel fyi

@cmsbuild
Copy link
Contributor

A new Issue was created by @felicepantaleo Felice Pantaleo.

@antoniovilela, @makortel, @Dr15Jones, @rappoccio, @smuzaffar, @sextonkennedy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign core,upgrade

@cmsbuild
Copy link
Contributor

New categories assigned: core,upgrade

@Dr15Jones,@AdrianoDee,@makortel,@smuzaffar,@srimanob you have been requested to review this Pull request/Issue and eventually sign? Thanks

@Dr15Jones
Copy link
Contributor

So I used my recent changes to the Tracer service to gather statistics on your job. What I found was the following break down

 51s parsing python
  8.4s setting up source
 19.7s constructing the 2200 modules
  0.05s in begin job
  0.05s in begin stream
143s in CondDBESSource to find the conditions for the first Run
  0.0003s source reading begin run
 24s in global Begin Run
    1.8s in dqmCSCClient
    2.1s in XMLIdealGeometryESSource
    0.9s in hcalParameters
    1.6s in EcalBarrelGeometryEP
    0.9s in hgcalEEParametersInitialize
    1.1s in hgcalHESiParametersInitialize
    1.2s in gemGeometry
    2.7s in TrackerAlignmentRcd
    1.0s in ecalMonitorTask
    4.2s in VolumeBasedMagneticFieldESProducer
    1.0s in trackerGeometry

  0.2s in CondDBESSource to find conditions for first lumi

50s in stream begin Run
    48.3s in mtdNumberingGeometry
 0.002s source reading begin lumi
 0.02s in global begin lumi
 0.2s in stream begin lumi
 0.004s source reading first event

317s from start to start processing first event

So the obvious speed ups would be to

@srimanob
Copy link
Contributor

Thanks very much @Dr15Jones for analysis.
FYI @cms-sw/mtd-dpg-l2 @fabiocos on mtdNumberingGeometry

@makortel
Copy link
Contributor

I investigated the first 3 steps of the workflow with VTune (as @felicepantaleo told me it felt like all the steps have long startup time). On step 1 (GEN-SIM) the job divides clearly into three phases
image

The first phase is the overall job startup (shown in the top thread), taking about 102 seconds wall clock time, and 85 seconds of CPU time. The CPU time divides into

  • 2 seconds of reading the configuration
  • 19 seconds of constructing the EventProcessor (i.e. module construction and all that)
  • 63 seconds spent in EventSetup
    • 40 seconds spent in MuonOffsetESProducer
    • 5 seconds in HcalSimParametersESModule
    • 4 seconds in VolumeBasedMagneticFieldESProducerFromDB
    • 4 seconds in XMLIdealGeometryESSource
    • 4 seconds in HGCalParametersESModule
    • 3 seconds in EcalSimParametersESModule
    • rest is 1 second or below

Here is flame graph of what happens in MuonOffsetESProducer
image

The question is then, does it really need to be this expensive? @cms-sw/geometry-l2


Second phase (middle thread in the timeline) is the Geant4 initialization in the OscarMTMasterThread, taking about 50 seconds. I don't know if there is anything that could be easily done there, but here is the flame graph of that anyway @cms-sw/simulation-l2
image


Third phase (mostly in the bottom thread in the timeline) is the actual event processing (most time spent in Geant4, so shows in a separate non-TBB thread), taking about 45 seconds for 10 events.

@makortel
Copy link
Contributor

assign geometry

@cmsbuild
Copy link
Contributor

New categories assigned: geometry

@Dr15Jones,@civanch,@bsunanda,@makortel,@mdhildreth you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

The step 2 (DIGI-HLT) took in total 240 seconds wall clock time and 220 second CPU time.

The configuration defines 901 EDModules, 208 ESModules, 57 (End)Paths, 116 Sequences, and 157 Tasks.

The startup phase (72 seconds CPU time) divides into

  • 5 seconds of reading configuration
  • 46 seconds in EventProcessor constructor
    • 27 seconds in StreamSchedule constructor, i.e. loading libraries, creating modules, registering data products (of which a lot of time is spent in cling), and all that
      • 12 seconds spent in Maker::makeModule, of which 8 seconds in creating edm::stream modules. This 8 seconds increases with the number of streams, but is run serially!
        • 3 seconds in L1TPFCaloProducer constructor
          • Most of the time seems to be spent in reading data from a ROOT file via l1tpf::corrector::init_()
        • 2 seconds in DTTrigPhase2Prod constructor
          • Most of the time seems to be spent in GlobalCoordsObtainer::generate_luts()
        • 2 seconds in l1tpf::PFClusterProducerFromHGC3DClusters constructor
          • Most of the time seems to be spent in creating cut parser, and creating expression parser in l1tpf::HGC3DClusterEgID constructor
        • 0.3 seconds in MixingModule constructor, rest is below that
    • 1 second constructing the source

@cms-sw/l1-l2 Is there anything that could be done to speed up the constructors of the aforementioned 3 L1T modules?


Next is beginRun transition (including most, if not all, of EventSetup), about 100 seconds of CPU time, divided into

  • EventSetup
    • 58 seconds in MTDGeometricTimingDetESModule::produce() (which @Dr15Jones also noted above)
    • 7.0 seconds in getting the CondDB data
      • notably 4 seconds for SiPixelGainCalibrationOffline, rest are below 1 second
    • 4.3 seconds in magneticfield::VolumeBasedMagneticFieldESProducerFromDB::produce()
    • 4.3 seconds in XMLIdealGeometryESSource::produceGeom()
    • 4.0 seconds in HGCalParametersESModule::produce()
    • 3.3 seconds in TrackerDigiGeometryESModule::produce()
    • 2.7 seconds in CaloGeometryEP<EcalBarrelGeometry, DDCompactView>::produceAligned()
    • 2.0 seconds in DTGeometryESModule::produce()
    • 1.7 seconds in CSCGeometryESModule::produce()
    • 1.7 seconds in GEMGeometryESModule::produce()
    • 1.5 seconds in HcalParametersESModule::produce()
    • 1.2 seconds in HGCalGeometryESProducer::produce()
    • 1.1 seconds in HcalTPGCoderULUT::produce()
    • 1.0 seconds in RPCGeometryESModule::produce()
    • 0.6 seconds in MuonGeometryConstantsESModule::produce()
    • rest is smaller
  • EDModule beginRun
    • 3 seconds in trklet::TrackletEventProcessor::init()

Then comes the event data processing, 63 seconds for 10 events

  • 4.6 seconds in OutputModule writing the data
  • 56 seconds in EDModules
    • 45 seconds in MixingModule
      • 41 seconds in HGCDigiProducer::finalizeEvent()
      • 2.7 seconds in HcalDigiProducer::finalizeEvent()
      • 0.5 seconds in EcalDigiProducer::finalizeEvent()
    • rest:
image

@makortel
Copy link
Contributor

assign l1

@cmsbuild
Copy link
Contributor

New categories assigned: l1

@epalencia,@aloeliger you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

The step 3 (RECO+PAT+VALIDATION+DQM) took 290 seconds wall clock time with 260 seconds of CPU time.

The configuration defines 5431(!) EDModules, 483 ESModules, 72 (End)Paths, 1291 Sequences, and 758 Tasks.


The startup phase took about 170 seconds wall clock time (150 seconds CPU time), divided into (in CPU time)

  • 70 seconds in reading the configuration
    • 50 seconds is spent in miniAOD_customizeAllMC() customization function
      • 32 seconds is spent in toModify()
        • 19 seconds in _add_jetsPuppi()
          • 19 seconds in applyDeepBtagging()
            • 17 seconds in setupBTagging()
              • 16 seconds in addToProcessAndTask()
        • 12 seconds _add_metPuppi()
      • 18 seconds is spent in runMetCorAndUncForMiniAODProduction()
      • Based on the profile, most of the time seems to be spent in assignments like process.foo = cms.EDModule(...), where the process.foo already exists. In this case the framework goes to replace the module in all Tasks/Sequence/Schedule
        • 35 seconds spent in _replaceInTasks()
        • 12 seconds spent in _replaceInSequences()
        • 1 second is spent in _replaceInSchedule()
        • 1 second is spent in _delattrFromSetattr()
  • 79 seconds in constructing EventProcessor
    • 39 seconds in constructing Schedule
      • 25 seconds in edm::Factory::makeModule()
        • 15 seconds in constructing EDModules
          • 5 seconds in constructing edm::stream EDModules. This 5 seconds increases with the number of streams, but is run serially!
            • 3 seconds is spent in MuonIdProducer constructor, in reading ROOT file via MuonCaloCompatibility::configure()
        • 9 seconds in loading shared libraries
    • 15 seconds in creating EventSetup modules
      • 13 seconds in loading the shared libraries
    • 13 seconds in creating source

@cms-sw/xpog-l2 I found that the miniAOD_customizeAllMC() being expensive was reported already in #20220, maybe it would be finally time to speed it up?


The beginRun transition took about 93 seconds of wall clock time (87 seconds CPU time), divided into (in CPU time)

  • EventSetup
    • 52 seconds in MTDGeometricTimingDetESModule::produce() (also noted earlier)
    • 4.5 seconds in XMLIdealGeometryESSource::produceGeom()
    • 4.4 seconds in magneticfield::VolumeBasedMagneticFieldESProducerFromDB::produce()
    • 3.6 seconds in HGCalParametersESModule::produce()
    • 2.7 seconds in CaloGeometryEP<EcalBarrelGeometry, DDCompactView>::produceAligned()
    • 1.7 seconds in GEMGeometryESModule::produce()
    • 1.6 seconds in TrackerDigiGeometryESModule::produce()
    • 1.3 seconds in HcalParametersESModule::produce()
    • 1.2 seconds in HGCalGeometryESProducer::produce()
    • 1.0 seconds in DTGeometryESModule::produce()
    • rest are below 1 second
  • EDModules
    • 4.2 seconds in edm::one EDModules
      • 2.3 seconds in cscdqm::Dispatcher::book()
      • 1.5 seconds in EcalDQMonitorTask::bookHistograms()
    • 1.5 seconds in edm::stream EDModules
    • 0.7 seconds in edm::global EDModules

Event data processing and shutdown took about 23 seconds in wall clock time (22.5 seconds in CPU time), divided into (in CPU time)

  • EventSetup
    • 6.7 seconds in getting the CondDB data
      • 3.7 seconds for SiPixelGainCalibrationOffline in SiPixelGainCalibrationOfflineRcd
      • 0.9 seconds for GBRForestD in GBRDWrapperRcd
      • 0.7 seconds for GBRForest in GBRWrapperRcd
      • rest is less
  • 5.5 seconds in EDModules
    • 0.7 seconds in edm::one modules (looks like DQM)
    • 4.7 seconds in edm::global and edm::stream modules (i.e. reconstruction etc, nothing seems to really smell here)
  • 3.6 seconds in DelayedReader (i.e. reading input data)
  • 2.5 seconds in output module writing data
  • 2.5 seconds in closing the output file

@makortel
Copy link
Contributor

assign xpog

@cmsbuild
Copy link
Contributor

New categories assigned: xpog

@vlimant,@simonepigazzini you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

makortel commented Oct 20, 2023

assign db

For the cost of

143s in CondDBESSource to find the conditions for the first Run

in
#43062 (comment)

@cmsbuild
Copy link
Contributor

New categories assigned: db

@francescobrivio,@saumyaphor4252,@perrotta,@consuegs you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

+core

I think our part (in the analysis) is done

@davidlange6
Copy link
Contributor

re MuonOffsetESProducer and mtdNumberingGeometry - are these not just the unlikely module that builds the entire geometry? [which might be much faster if it were in the db]

@Dr15Jones
Copy link
Contributor

As a test I did

> time python3 -c 'import step3_RAW2DIGI_RECO_RECOSIM_PAT_VALIDATION_DQM'
45.100u 1.567s 0:49.83 93.6%    0+0k 24+0io 0pf+0w

Then in python I did

> python3 -i step3_RAW2DIGI_RECO_RECOSIM_PAT_VALIDATION_DQM.py
>>> import pickle
>>> pickle.dump(process, open('step3.pkl', 'wb'))

and then

[cdj@cmslpc-el8-heavy01 24896.0_CloseByPGun_CE_E_Front_120um+2026D98]$ time python3 -c 'import pickle; process = pickle.load(open("step3.pkl", "rb"));'
1.833u 0.153s 0:02.16 91.6%     0+0k 0+0io 0pf+0w

so using a pickle file is 25x faster

@Dr15Jones
Copy link
Contributor

@davidlange6

re MuonOffsetESProducer and mtdNumberingGeometry - are these not just the unlikely module that builds the entire geometry?

it also seems that Phase 2 is still using the old DD instead of DD4Hep.

@davidlange6
Copy link
Contributor

I believe that is true - (so startup will get slower with DD4Hep I guess)

@makortel
Copy link
Contributor

@Dr15Jones asked for the flame graph for MTDGeometricTimingDetESModule::produce()

image

@davidlange6
Copy link
Contributor

looking at some of the others
putting a
theSiPixelGainCalibrationOffline.reserve(detUnitDimensions.first*(detUnitDimensions.second + detUnitDimensions.first/80);

here

std::vector<char> theSiPixelGainCalibrationOffline;

probably helps a lot (I don't really understand the resize(...) followed by a memcpy going on, so there might be another factor to gain in this code

@srimanob
Copy link
Contributor

@davidlange6

re MuonOffsetESProducer and mtdNumberingGeometry - are these not just the unlikely module that builds the entire geometry?

it also seems that Phase 2 is still using the old DD instead of DD4Hep.

Do we really gain something in performance with dd4hep? It was also a question I get from CHEP, but I never have a clear result of gaining.

By the way, we are trying to move. Last validation still see few issues, and we plan to do again with coming 13_3_0_pre4

@Dr15Jones
Copy link
Contributor

@srimanob

Do we really gain something in performance with dd4hep?

My take is we'd rather spend time optimizing code using DD4Hep then spend the time on the obsolete DD.

@mmusich
Copy link
Contributor

mmusich commented Oct 20, 2023

here
cmssw/CalibTracker/SiPixelESProducers/plugins/SiPixelFakeGainOfflineESSource.cc

This should not even be executed in phase2

@makortel
Copy link
Contributor

here
cmssw/CalibTracker/SiPixelESProducers/plugins/SiPixelFakeGainOfflineESSource.cc

This should not even be executed in phase2

I confirm that in this workflow the SiPixelGainCalibrationOffline comes from the CondDB via cond::persistency::PayloadProxy<SiPixelGainCalibrationOffline>::loadPayload() etc.

@mmusich
Copy link
Contributor

mmusich commented Oct 20, 2023

I confirm that in this workflow the SiPixelGainCalibrationOffline comes from the CondDB

I thought we cleaned that here #42794

@makortel
Copy link
Contributor

I confirm that in this workflow the SiPixelGainCalibrationOffline comes from the CondDB

I thought we cleaned that here #42794

The reported behavior was in CMSSW_13_2_0_pre3 (the job used 131X_mcRun4_realistic_v6 Global Tag), i.e. before that cleanup PR. So one message to @felicepantaleo and @waredjeb would be to move to more recent CMSSW.

@aloeliger
Copy link
Contributor

The step 2 (DIGI-HLT) took in total 240 seconds wall clock time and 220 second CPU time.

The configuration defines 901 EDModules, 208 ESModules, 57 (End)Paths, 116 Sequences, and 157 Tasks.

The startup phase (72 seconds CPU time) divides into

  • 5 seconds of reading configuration

  • 46 seconds in EventProcessor constructor

    • 27 seconds in StreamSchedule constructor, i.e. loading libraries, creating modules, registering data products (of which a lot of time is spent in cling), and all that

      • 12 seconds spent in Maker::makeModule, of which 8 seconds in creating edm::stream modules. This 8 seconds increases with the number of streams, but is run serially!

        • 3 seconds in L1TPFCaloProducer constructor

          • Most of the time seems to be spent in reading data from a ROOT file via l1tpf::corrector::init_()
        • 2 seconds in DTTrigPhase2Prod constructor

          • Most of the time seems to be spent in GlobalCoordsObtainer::generate_luts()
        • 2 seconds in l1tpf::PFClusterProducerFromHGC3DClusters constructor

          • Most of the time seems to be spent in creating cut parser, and creating expression parser in l1tpf::HGC3DClusterEgID constructor
        • 0.3 seconds in MixingModule constructor, rest is below that

    • 1 second constructing the source

@cms-sw/l1-l2 Is there anything that could be done to speed up the constructors of the aforementioned 3 L1T modules?

@gpetruc

Yours is the name I recall in connection with the correlator team and Phase2L1ParticleFlow (That is correct right? My apologies if not). From my look at the L1TPFCaloProducer code, 3 correctors get produced from a string, an int, and a debug bool:

emCorrector_(iConfig.getParameter<std::string>("emCorrector"), -1, debug_),
hcCorrector_(iConfig.getParameter<std::string>("hcCorrector"), -1, debug_),
hadCorrector_(iConfig.getParameter<std::string>("hadCorrector"),
iConfig.getParameter<double>("hadCorrectorEmfMax"),
debug_),

Which calls this constructor here:

l1tpf::corrector::corrector(const std::string &filename, float emfMax, bool debug, bool emulate)
: emfMax_(emfMax), emulate_(emulate) {
if (!filename.empty())
init_(filename, "", debug, emulate);
}
, calling init_ with two strings and two bools here:
void l1tpf::corrector::init_(const std::string &filename, const std::string &directory, bool debug, bool emulate) {
std::string resolvedFileName = filename;
if (filename[0] != '/') {
#ifdef CMSSW_GIT_HASH
resolvedFileName = edm::FileInPath(filename).fullPath();
#else
resolvedFileName = std::filesystem::absolute(filename);
#endif
}
TFile *lFile = TFile::Open(resolvedFileName.c_str());
if (!lFile || lFile->IsZombie()) {
#ifdef CMSSW_GIT_HASH
throw cms::Exception("Configuration", "cannot read file " + filename);
#else
throw std::runtime_error("Cannot read file " + filename);
#endif
}
TDirectory *dir = directory.empty() ? lFile : lFile->GetDirectory(directory.c_str());
if (!dir) {
#ifdef CMSSW_GIT_HASH
throw cms::Exception("Configuration", "cannot find directory '" + directory + "' in file " + filename);
#else
throw std::runtime_error("Cannot find directory '" + directory + "' in file " + filename);
#endif
}
init_(dir, debug);
if (emulate)
initEmulation_(dir, debug);
lFile->Close();
}

This is mostly file checks which eventually calls another init_ here:
void l1tpf::corrector::init_(TDirectory *lFile, bool debug) {
TH1 *index = (TH1 *)lFile->Get("INDEX");
if (!index) {
#ifdef CMSSW_GIT_HASH
throw cms::Exception("Configuration")
<< "invalid input file " << lFile->GetPath() << ": INDEX histogram not found.\n";
#else
std::stringstream ss;
ss << "invalid input file " << lFile->GetPath() << ": INDEX histogram nit found.\n";
throw std::runtime_error(ss.str());
#endif
}
index_.reset((TH1 *)index->Clone());
index_->SetDirectory(nullptr);
is2d_ = index_->InheritsFrom("TH2");
std::unordered_map<std::string, TGraph *> graphs;
TKey *key;
TIter nextkey(lFile->GetListOfKeys());
while ((key = (TKey *)nextkey())) {
if (strncmp(key->GetName(), "eta_", 4) == 0) {
TGraph *gr = (TGraph *)key->ReadObj();
if (!gr->TestBit(TGraph::kIsSortedX))
gr->Sort();
graphs[key->GetName()] = gr;
}
}
neta_ = index_->GetNbinsX();
nemf_ = (is2d_ ? index_->GetNbinsY() : 1);
corrections_.resize(neta_ * nemf_);
std::fill(corrections_.begin(), corrections_.end(), nullptr);
char buff[32];
for (unsigned int ieta = 0; ieta < neta_; ++ieta) {
for (unsigned int iemf = 0; iemf < nemf_; ++iemf) {
if (is2d_) {
snprintf(buff, 31, "eta_bin%d_emf_bin%d", ieta + 1, iemf + 1);
} else {
snprintf(buff, 31, "eta_bin%d", ieta + 1);
}
TGraph *graph = graphs[buff];
if (debug) {
#ifdef CMSSW_GIT_HASH
edm::LogPrint("corrector") << " eta bin " << ieta << " emf bin " << iemf << " graph " << buff
<< (graph ? " <valid>" : " <nil>") << "\n";
#else
std::cout << " eta bin " << ieta << " emf bin " << iemf << " graph " << buff
<< (graph ? " <valid>" : " <nil>") << "\n";
#endif
}
if (graph) {
corrections_[ieta * nemf_ + iemf] = (TGraph *)graph->Clone();
}
if (std::abs(index_->GetXaxis()->GetBinCenter(ieta + 1)) > 3.0) {
break; // no EMF bins beyond eta = 3
}
}
}
}
and if the emulation flag is set, it calls another initEmulation_ here:
void l1tpf::corrector::initEmulation_(TDirectory *lFile, bool debug) {
std::unordered_map<std::string, TH1 *> hists;
TKey *key;
TIter nextkey(lFile->GetListOfKeys());
while ((key = (TKey *)nextkey())) {
if (strncmp(key->GetName(), "emul_eta", 8) == 0) {
TH1 *hist = (TH1 *)key->ReadObj();
hists[key->GetName()] = hist;
}
}
neta_ = index_->GetNbinsX();
correctionsEmulated_.resize(neta_);
std::fill(correctionsEmulated_.begin(), correctionsEmulated_.end(), nullptr);
char buff[32];
for (unsigned int ieta = 0; ieta < neta_; ++ieta) {
snprintf(buff, 31, "emul_eta_bin%d", ieta + 1);
TH1 *hist = hists[buff];
if (debug)
#ifdef CMSSW_GIT_HASH
edm::LogPrint("corrector") << " eta bin " << ieta << " hist " << buff << (hist ? " <valid>" : " <nil>") << "\n";
#else
std::cout << " eta bin " << ieta << " hist " << buff << (hist ? " <valid>" : " <nil>") << "\n";
#endif
if (hist) {
correctionsEmulated_[ieta] = (TH1 *)hist->Clone();
correctionsEmulated_[ieta]->SetDirectory(nullptr);
}
}
}

Notably, each of the init functions does some loop through TFile keys, which would be my first instinct to check for slow down in this constructor.

Would it be possible to ask if you or one of the people responsible for Phase2L1ParticleFlow could take a look at this and see if where the slowdown occurs and if it could be reworked?

@NTrevisani

You are listed as having most recently added this line:

globalcoordsobtainer_->generate_luts();
to DTTriggerPhase2Prod

Which in turn calls a for loop doing a long list of mathematical functions here (listed with the atan calculation):

void GlobalCoordsObtainer::generate_luts() {
for (auto& global_constant : global_constants) {
int sgn = 1;
DTChamberId ChId(global_constant.chid);
// typical hasPosRF function
if (ChId.wheel() > 0 || (ChId.wheel() == 0 && ChId.sector() % 4 > 1)) {
sgn = -1;
}
auto phi1 = calc_atan_lut(12,
6,
(1. / 16) / (global_constant.sl1.perp * 10),
global_constant.sl1.x_phi0 / global_constant.sl1.perp,
1. / std::pow(2, 17),
10,
3,
12,
20,
sgn);
auto phi3 = calc_atan_lut(12,
6,
(1. / 16) / (global_constant.sl3.perp * 10),
global_constant.sl3.x_phi0 / global_constant.sl3.perp,
1. / std::pow(2, 17),
10,
3,
12,
20,
sgn);
double max_x_phi0 = global_constant.sl1.x_phi0;
if (global_constant.sl3.x_phi0 > max_x_phi0) {
max_x_phi0 = global_constant.sl3.x_phi0;
}
auto phic = calc_atan_lut(12,
6,
(1. / 16) / ((global_constant.sl1.perp + global_constant.sl3.perp) / .2),
max_x_phi0 / ((global_constant.sl1.perp + global_constant.sl3.perp) / 2),
1. / std::pow(2, 17),
10,
3,
12,
20,
sgn);
auto phib = calc_atan_lut(9, 6, 1. / 4096, 0., 4. / std::pow(2, 13), 10, 3, 10, 16, sgn);
luts[global_constant.chid] = {phic, phi1, phi3, phib};
}
}
std::map<int, lut_value> GlobalCoordsObtainer::calc_atan_lut(int msb_num,
int lsb_num,
double in_res,
double abscissa_0,
double out_res,
int a_extra_bits,
int b_extra_bits,
int a_size,
int b_size,
int sgn) {
// Calculates the coefficients needed to calculate the arc-tan function in fw
// by doing a piece-wise linear approximation.
// In fw, input (x) and output (y) are integers, these conversions are needed
// t = x*in_res - abscissa_0
// phi = arctan(t)
// y = phi/out_res
// => y = arctan(x*in_res - abcissa_0)/out_res
// The linear function is approximated as
// y = a*x_lsb + b
// Where a, b = func(x_msb) are the coefficients stored in the lut
// a is stored as unsigned, b as signed, with their respective sizes a_size, b_size,
// previously shifted left by a_extra_bits and b_extra_bits, respectively
long int a_min = -std::pow(2, (a_size - 1));
long int a_max = std::pow(2, (a_size - 1)) - 1;
long int b_min = -std::pow(2, (b_size - 1));
long int b_max = std::pow(2, (b_size - 1)) - 1;
std::map<int, lut_value> lut;
for (long int x_msb = -(long int)std::pow(2, msb_num - 1); x_msb < (long int)std::pow(2, msb_num - 1); x_msb++) {
int x1 = ((x_msb) << lsb_num);
int x2 = ((x_msb + 1) << lsb_num) - 1;
double t1 = x1 * in_res - abscissa_0;
double t2 = x2 * in_res - abscissa_0;
double phi1 = sgn * atan(t1);
double phi2 = sgn * atan(t2);
double y1 = phi1 / out_res;
double y2 = phi2 / out_res;
// we want to find a, b so that the error in the extremes is the same as the error in the center
// so the error in the extremes will be the same, so the "a" is determined by those points
double a = (y2 - y1) / (x2 - x1);
// by derivating the error function and equaling to 0, you get this is the point
// towards the interval center with the highest error
// err_f = y - (a*x+b) = sgn*arctan(x*in_res - abcissa_0)/out_res - (a*x+b)
// d(err_f)/dx = sgn*(1/(1+(x*in_res - abcissa_0)^2))*in_res/out_res - a
// d(err_f)/dx = 0 => x_max_err = (sqrt(in_res/out_res/a-1) + abscissa_0)/in_res
// There is sign ambiguity in the sqrt operation. The sqrt has units of t (adimensional).
// It is resolved by setting the sqrt to have the same sign as (t1+t2)/2
double t_max_err = sqrt(sgn * in_res / out_res / a - 1);
if ((t1 + t2) < 0) {
t_max_err *= -1;
}
double x_max_err = (t_max_err + abscissa_0) / in_res;
double phi_max_err = sgn * atan(t_max_err);
double y_max_err = phi_max_err / out_res;
// once you have the point of max error, the "b" parameter is chosen as the average between
// those two numbers, which makes the error at the center be equal in absolute value
// to the error in the extremes
// units: rad
double b = (y1 + y_max_err - a * (x_max_err - x1)) / 2;
// increase b in 1/2 of y_lsb, so that fw truncate operation on the of the result
// is equivalent to a round function instead of a floor function
b += 0.5;
// shift left and round
long int a_int = (long int)(round(a * (pow(2, a_extra_bits))));
long int b_int = (long int)(round(b * (pow(2, b_extra_bits))));
// tuck a, b constants into the bit size of the output (un)signed integer
std::vector<long int> as = {a_min, a_int, a_max};
std::vector<long int> bs = {b_min, b_int, b_max};
std::sort(as.begin(), as.end());
std::sort(bs.begin(), bs.end());
a_int = as.at(1);
b_int = bs.at(1);
// // convert a, b to two's complement
// auto a_signed = a_int % (long int) (pow(2, a_size));
// auto b_signed = b_int % (long int) (pow(2, b_size));
// convert x_msb to two's complement signed
int index = to_two_comp(x_msb, msb_num);
lut[index] = {a_int, b_int};
}
return lut;
}

which seems a primary candidate for slowdown in this function. Could you, or someone responsible for this code look into the performance of this, and see if it could be reworked? If not, would it maybe be possible to cache or look up the results of these math functions/loops? At a quick pass these seem to be calculated against some constants.

@rekovic & @cerminar

You are listed as authors on PFClusterProducerFromHGC3DClusters and HGC3DClusterEgID. This one I cannot find a likely candidate for slow-down on, other than perhaps the instantiation of the TMVA and the creation of another corrector (see the first section).

Could you, or someone responsible for this code look into the performance of this and see if it could be reworked?

@gpetruc
Copy link
Contributor

gpetruc commented Oct 20, 2023 via email

@vlimant
Copy link
Contributor

vlimant commented Oct 24, 2023

it is on our to-do list to review how PAT is constructed and simplify it together with dropping miniAOD_customizeAllMC

@Dr15Jones
Copy link
Contributor

As an additional test, I dumped the data associated to the global tag used by the job into an HDF5 file by doing

conddb2hdf5.py --compressor zlib --output 133X_mcRun4_realistic_v1.h5conf 133X_mcRun4_realistic_v1

As the tool only handles full global tags and not individual tags, I changed the cmsRun configuration to read the global tag from the HDF5 file but had the individual tags still come from CondDBESSource. The change to the configuration was the addition of the following lines

process.GlobalTag.globaltag = ""
process.H5Tag = cms.ESSource("CondHDF5ESSource", filename = cms.untracked.string("133X_mcRun4_realistic_v1.h5conf"), globalTag = cms.string("133X_mcRun4_realistic_v1"),
                             excludeRecords = cms.vstring("SiPixelGenErrorDBObjectRcd", "SiPixelLorentzAngleRcd",
                                                                    "SiPixelLorentzAngleSimRcd","SiPixelTemplateDBObjectRcd",
                                                                    "TrackerAlignmentErrorExtendedRcd",
                                                                    "TrackerAlignmentRcd",
                                                                    "TrackerSurfaceDeformationRcd"))

Running the original job at FNAL, it took 140s for the job to retrieve all the IOVs during the begin Run phase. With this change, it now takes 1.5s.

@VinInn
Copy link
Contributor

VinInn commented Oct 25, 2023

So for local tests one can first dump the global tag and then run.
since which version H5Tag is supported?

@Dr15Jones
Copy link
Contributor

The CondCore/CondHDF5ESSource package was first introduced in CMSSW_13_3_0_pre2. See #42431.

@davidlange6
Copy link
Contributor

@Dr15Jones - what is the role of excludeRecords here? Just an ability to skip known to be not needed records? Is that responsible for much of the (very interesting..) speedup

@Dr15Jones
Copy link
Contributor

what is the role of excludeRecords here? Just an ability to skip known to be not needed records? Is that responsible for much of the (very interesting..) speedup

For the configuration given, the CondDBESSource overrides the GlobalTag for those records and gives explicit tags to be used (see below). The CondHDF5ESSource (and its accompanying scripts) only handle using a GlobalTag (they could be extended to handle explicit tags as well) so I needed to still get these explicit tags (and their associated Records) from CondDBESSource. In order to avoid having HDF5 and the DB fight each other over which one would deliver the Record (one could es an ESAlias to explicitly set which one wins) I decided to just have the CondHDF5ESSource not deliver those Record at all.

    toGet = cms.VPSet(
        cms.PSet(
            connect = cms.string('frontier://FrontierProd/CMS_CONDITIONS'),
            record = cms.string('SiPixelGenErrorDBObjectRcd'),
            snapshotTime = cms.string('2021-04-17 20:00:00'),
            tag = cms.string('SiPixelGenErrorDBObject_phase2_IT_v7.0.2_25x100_v2_mc')
        ),
        cms.PSet(
            connect = cms.string('frontier://FrontierProd/CMS_CONDITIONS'),
            record = cms.string('SiPixelLorentzAngleRcd'),
            snapshotTime = cms.string('2021-03-16 20:00:00.000'),
            tag = cms.string('SiPixelLorentzAngle_phase2_T25_v0_mc')
        ),
        cms.PSet(
            connect = cms.string('frontier://FrontierProd/CMS_CONDITIONS'),
            label = cms.untracked.string('forWidth'),
            record = cms.string('SiPixelLorentzAngleRcd'),
            snapshotTime = cms.string('2020-02-23 14:00:00.000'),
            tag = cms.string('SiPixelLorentzAngle_phase2_T19_mc_forWidthEmpty')
        ),
        cms.PSet(
            connect = cms.string('frontier://FrontierProd/CMS_CONDITIONS'),
            label = cms.untracked.string('fromAlignment'),
            record = cms.string('SiPixelLorentzAngleRcd'),
            snapshotTime = cms.string('2020-02-23 14:00:00.000'),
            tag = cms.string('SiPixelLorentzAngle_phase2_T19_mc_forWidthEmpty')
        ),
        cms.PSet(
            connect = cms.string('frontier://FrontierProd/CMS_CONDITIONS'),
            record = cms.string('SiPixelLorentzAngleSimRcd'),
            snapshotTime = cms.string('2021-03-16 20:00:00.000'),
            tag = cms.string('SiPixelSimLorentzAngle_phase2_T25_v0_mc')
        ),
        cms.PSet(
            connect = cms.string('frontier://FrontierProd/CMS_CONDITIONS'),
            record = cms.string('SiPixelTemplateDBObjectRcd'),
            snapshotTime = cms.string('2021-04-17 20:00:00'),
            tag = cms.string('SiPixelTemplateDBObject_phase2_IT_v7.0.2_25x100_v2_mc')
        ),
        cms.PSet(
            connect = cms.string('frontier://FrontierProd/CMS_CONDITIONS'),
            record = cms.string('TrackerAlignmentErrorExtendedRcd'),
            snapshotTime = cms.string('2023-03-16 15:30:00'),
            tag = cms.string('TrackerAlignmentErrorsExtended_Upgrade2026_T25_design_v0')
        ),
        cms.PSet(
            connect = cms.string('frontier://FrontierProd/CMS_CONDITIONS'),
            record = cms.string('TrackerAlignmentRcd'),
            snapshotTime = cms.string('2023-03-16 15:30:00'),
            tag = cms.string('TrackerAlignment_Upgrade2026_T25_design_v0')
        ),
        cms.PSet(
            connect = cms.string('frontier://FrontierProd/CMS_CONDITIONS'),
            record = cms.string('TrackerSurfaceDeformationRcd'),
            snapshotTime = cms.string('2023-03-16 15:30:00'),
            tag = cms.string('TrackerSurfaceDeformations_Upgrade2026_Zero')
        )
    )

@fabiocos
Copy link
Contributor

for the MTD geometry construction part, please have a look at #43124 . The DD4hep version is definitely less problematic as far as MTD is concerned, but not the production default yet. Profiling 24807.911 with IgProf:

    5.7      20.92  MTDGeometricTimingDetESModule::produce(IdealGeometryRecord const&) [122]
    5.7      20.92  DDCmsMTDConstruction::construct(cms::DDCompactView const&) [123]

@fwyzard
Copy link
Contributor

fwyzard commented Oct 26, 2023

so using a pickle file is 25x faster

@Dr15Jones I wasn't able to understand... is the pickled file faster than the original configuration build by cmsDriver.py, or faster than the fully expanded one from edmConfigDump ?

@Dr15Jones
Copy link
Contributor

I wasn't able to understand... is the pickled file faster than the original configuration build by cmsDriver.py, or faster than the fully expanded one from edmConfigDump ?

Faster than the original one from cmsDriver.py.

@civanch
Copy link
Contributor

civanch commented Nov 6, 2023

+geometry

it seems that the main problem in geometry connected with MTD is fixed.

@Dr15Jones
Copy link
Contributor

Dr15Jones commented Nov 8, 2023

If I use CMSSW_13_3_X_2023-11-08-1100, use a pickled configuration file and use the HDF5 conditions code then for a 8 thread job (which can do some initialization in parallel) it takes 68 seconds to go from starting the job to the first event. This job processed 100 events.

Screenshot 2023-11-08 at 10 39 53 AM

[The image is from the new tracer log viewer application]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests