Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

igprof pp profiling outputs no longer work #33297

Closed
jpata opened this issue Mar 30, 2021 · 48 comments
Closed

igprof pp profiling outputs no longer work #33297

jpata opened this issue Mar 30, 2021 · 48 comments

Comments

@jpata
Copy link
Contributor

jpata commented Mar 30, 2021

There seems to be an issue with igprof pp (CPU profiling) in 11_3_X between pre4 and pre5, starting at least from March 14. The stack in 11_3_X is no longer capturing main properly and the cumulative number is not meaningful. @slava77 pointed it out to me yesterday. Things look fine in 11_2_X.

11_3_X: https://cmssdt.cern.ch/SDT/cgi-bin/igprof-navigator/CMSSW_11_3_X_2021-03-14-0000/slc7_amd64_gcc900/pp/23434.21_TTbar_14TeV+2026D49PU_ProdLike+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTriggerPU+RecoGlobalPU+MiniAODPU/step3___580bd4d7c9a1fca4d329fb1738ee817d___10_EndOfJob
11_2_X: https://cmssdt.cern.ch/SDT/cgi-bin/igprof-navigator/CMSSW_11_2_X_2021-03-14-0000/slc7_amd64_gcc900/pp/23434.21_TTbar_14TeV+2026D49PU_ProdLike+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTriggerPU+RecoGlobalPU+MiniAODPU/step3___580bd4d7c9a1fca4d329fb1738ee817d___10_EndOfJob

In mattermost, Daniel Sherman Riley pointed to this as a possible culprit: #32804

Perhaps something needs to be adjusted on the igprof side, but it'd be nice to get the profiles working again.

@cmsbuild
Copy link
Contributor

A new Issue was created by @jpata Joosep Pata.

@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign core

@cmsbuild
Copy link
Contributor

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

I agree with @dan131riley that #32804 is the likely culprit (given that we saw TBB to "move stack between threads", I'd easily believe IgProf to get confused by that, among others).

@makortel
Copy link
Contributor

@jpata Is this specific to pp? Or in other words, do mp profiles work properly?

@Dr15Jones
Copy link
Contributor

The “moving between threads” should also have been stopped when I later attend the use of a tbb::task_arena call.

@makortel
Copy link
Contributor

makortel commented Mar 31, 2021

I poked around between 11_2_X and 11_3_X IgProf reports and noticed something

In CMSSW_11_2_X_2021-03-30-2300 23434.21

But in CMSSW_11_3_X_2021-03-30-2300 23434.21

Could it be that in 11_3_X the 10_EndOfJob files somehow get overridden by the 1 files? @gartung Could you take a look?

In the GEN-SIM job (TTbar_14TeV_TuneCP5...) the 1 and 10_EndOfJob results look different, in step4 they look identical.

@makortel
Copy link
Contributor

Thanks Joosep. Perhaps the symptom should be interpreted then as time accounting effectively being lost some time before the capture point of 1 profiles, and nothing being accumulated after that.

It's still curious that in the GEN-SIM job time does get added after capture point of 1 profile.

@gartung
Copy link
Member

gartung commented Mar 31, 2021

The output is completely controlled by the the runTheMatrix.py switch --profile prof which enables igprof as a service.
https://github.com/cms-sw/cms-bot/blob/9cab720ce0ad62820694846d1153b03f48d21ebe/run-ib-igprof#L19
For the memory profiling the commands are generated with --dryRun and then cmsRunGlibC is used instead.

@gartung
Copy link
Member

gartung commented Mar 31, 2021

The output name of the sqlite files is derived from the igprof service output name here
https://github.com/cms-sw/cms-bot/blob/9cab720ce0ad62820694846d1153b03f48d21ebe/run-ib-igprof#L41

@gartung
Copy link
Member

gartung commented Mar 31, 2021

Luckily I kept the original gzipped tarfile output from igprof. I you click on Logs from the igprof top page you get a directory listing like this where you can download the original igprof output.
https://cmssdt.cern.ch/SDT/jenkins-artifacts/igprof/CMSSW_11_3_X_2021-03-30-2300/slc7_amd64_gcc900/pp/23434.21_TTbar_14TeV+2026D49PU_ProdLike+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTriggerPU+RecoGlobalPU+MiniAODPU/

@gartung
Copy link
Member

gartung commented Mar 31, 2021

If Igprof is doing something strange you should be able to investigate with the original output.

@slava77
Copy link
Contributor

slava77 commented Mar 31, 2021

@gartung
Copy link
Member

gartung commented Mar 31, 2021

The run-ib-profining job run on cms-oc-gpu-01 while the run-ib-igprof job is run on cmsprofile-01 or cmsprofile-02 which are 4 core VM's. The release-run-reco-profling jobs are run on vocms11 which is a 32 core bare metal machine.
Since Igprof is using sampling some of the calls might be missed on cms-oc-gpu-01 because of single thread frequency scaling.

@gartung
Copy link
Member

gartung commented Mar 31, 2021

I would like to move the run-ib-profiling job and the run-ib-igprof jobs to vocms11 so that the hardware is the same when making comparisons.

@jpata
Copy link
Contributor Author

jpata commented Apr 1, 2021

I would like to move the run-ib-profiling job and the run-ib-igprof jobs to vocms11 so that the hardware is the same when making comparisons.

@gartung No problem for reco side, I think! Feel free to do what you need in jenkins. Just not to side-track this issue here, we can discuss in mattermost in case of need.

@makortel
Copy link
Contributor

makortel commented Apr 5, 2021

Apparently we still observe stacks being moved between threads #33289 (comment), I would guess that to confuse IgProf.

@dan131riley
Copy link

I'm running some tests now that get the thread-id at the top of main() and assert at then end that it hasn't changed. No hits so far, so if this is happening I don't think it is common enough to account for the igrpof changes.

@gartung
Copy link
Member

gartung commented Apr 7, 2021

I noticed that the runTheMatrix.py --profile generates the command
igprof -t cmsRun
which is to contrain which process is profiled.
The scripts in the cms-cmpwg/profiling repo do not use this option.
If I compare the profile produced with the -t option
to the profile produced without the -t option
The percentage for main is higher although the tbb threads still have a higher percentage.

@gartung
Copy link
Member

gartung commented Apr 9, 2021

I think I might know what is causing the time difference.
Could this call to rdtsc
https://github.com/oneapi-src/oneTBB/blob/9e15720bc7744f85dff611d34d65e9099e077da4/src/tbb/scheduler_common.h#L188
be interfering with this call to rdtsc
https://github.com/igprof/igprof/blob/03d16c15bd970256affcfde9e510a8404856b233/src/macros.h#L70
The assembly instruction rdtsc loads the current value of the processor's time-stamp counter into the EDX:EAX registers.
https://en.wikipedia.org/wiki/Time_Stamp_Counter
This is used by igprof to get the number of tics
https://github.com/igprof/igprof/blob/03d16c15bd970256affcfde9e510a8404856b233/src/profile-calls.cc#L70

@gartung
Copy link
Member

gartung commented Apr 9, 2021

@gartung
Copy link
Member

gartung commented Apr 9, 2021

@gartung
Copy link
Member

gartung commented Apr 9, 2021

In igprof the ticks are returned by the add function which is used to add the time in various memory allocation calls
https://github.com/igprof/igprof/blob/03d16c15bd970256affcfde9e510a8404856b233/src/profile-calls.cc#L58

@gartung
Copy link
Member

gartung commented Apr 9, 2021

I think changing the rdtsc calls to libc's clock calls could avoid the problem
https://www.gnu.org/software/libc/manual/html_node/CPU-Time.html

@gartung
Copy link
Member

gartung commented Apr 9, 2021

I tried this patch with igprof and generated the text output
https://gartung.web.cern.ch/gartung/igprof.cmsRun.24441.1618006530.571135.txt
The main function still only shows 60% cumulative. Looking down the file to “Call tree profile (cumulative)” I see

Call tree profile (cumulative)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Rank    % total       Self       Self / Children   Function
[1]       100.0      87.76       0.00 / 87.76      <spontaneous>
           66.0  .........      57.91 / 57.91        _start [4]
           33.4  .........      29.29 / 29.29        __start_context [20]
            0.5  .........       0.41 / 0.41         clone [723]
            0.1  .........       0.10 / 0.13         PFDisplacedVertexCandidateFinder::associate(std::_List_iterator<edm::RefToBase<reco::Track> >, std::_List_iterator<edm::RefToBase<reco::Track> >, reco::PFDisplacedVertexCandidate&) [clone .localalias] [1362]
            0.0  .........       0.01 / 0.01         @{libz.so.1.2.8+78012} [4193]
            0.0  .........       0.00 / 0.00         @?0x33706574{<dynamically-generated>} [5825]
            0.0  .........       0.00 / 0.00         @{libz.so.1.2.8+77964} [5826]
            0.0  .........       0.00 / 0.00         @{libz.so.1.2.8+77995} [5827]
            0.0  .........       0.00 / 0.00         @{libz.so.1.2.8+77999} [5828]
            0.0  .........       0.00 / 0.00         @{libz.so.1.2.8+78007} [5829]

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
            0.0  .........       0.00 / 0.25         tbb::internal::generic_scheduler::local_spawn_root_and_wait(tbb::task*, tbb::task*&) [988]
            0.0  .........       0.00 / 0.36         edm::eventsetup::synchronousEventSetupForInstance(edm::IOVSyncValue const&, tbb::task_group&, edm::eventsetup::EventSetupsController&) [806]
            0.0  .........       0.03 / 0.03         tbb::internal::task_group_base::wait() [3225]
           33.4  .........      29.29 / 29.29        tbb::internal::co_local_wait_for_all(void*) [21]
           37.2  .........      32.66 / 32.66        edm::EventProcessor::processLumis(std::shared_ptr<void> const&) [11]
[2]        70.6      62.00       0.00 / 62.00      tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*)
           70.6  .........      62.00 / 62.00        tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop(tbb::internal::context_guard_helper<false>&, tbb::task*, long) [3]

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
           70.6  .........      62.00 / 62.00        tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*) [2]
[3]        70.6      62.00       0.00 / 62.00      tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop(tbb::internal::context_guard_helper<false>&, tbb::task*, long)
           35.7  .........      31.36 / 31.36        tbb::internal::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute() [16]
           34.7  .........      30.45 / 30.45        tbb::internal::function_task<edm::SerialTaskQueue::spawn(edm::SerialTaskQueue::TaskBase&)::{lambda()#1}>::execute() [19]
            0.2  .........       0.16 / 0.16         tbb::internal::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute() [1221]
            0.0  .........       0.01 / 0.01         tbb::internal::function_task<edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)1> >(edm::WaitingTaskHolder, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>::execute() [4779]
            0.0  .........       0.01 / 0.01         tbb::internal::function_task<edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTaskHolder, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>::execute() [4789]
            0.0  .........       0.00 / 0.00         tbb::interface9::internal::start_for<tbb::blocked_range<unsigned int>, tbb::internal::parallel_for_body<std::function<void (unsigned int)>, unsigned int>, tbb::auto_partitioner const>::execute() [7092]

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
           66.0  .........      57.91 / 87.76        <spontaneous> [1]
[4]        66.0      57.91       0.00 / 57.91      _start
           66.0  .........      57.91 / 57.91        __libc_start_main [5]

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
           66.0  .........      57.91 / 57.91        _start [4]
[5]        66.0      57.91       0.00 / 57.91      __libc_start_main
           66.0  .........      57.91 / 57.91        main [6]
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
           66.0  .........      57.91 / 57.91        __libc_start_main [5]
[6]        66.0      57.91       0.00 / 57.91      main
           66.0  .........      57.91 / 57.91        main::{lambda()#1}::operator()() const [7]

and lower down at [20] I see the __start_context where the “missing” cumulative total from spontaneous goes and the main::lambda eventually appears

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
           33.4  .........      29.29 / 87.76        <spontaneous> [1]
[20]       33.4      29.29       0.00 / 29.29      __start_context
           33.4  .........      29.29 / 29.29        tbb::internal::co_local_wait_for_all(void*) [21]

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Rank    % total       Self       Self / Children   Function
           33.4  .........      29.29 / 29.29        __start_context [20]
[21]       33.4      29.29       0.00 / 29.29      tbb::internal::co_local_wait_for_all(void*)
           33.4  .........      29.29 / 62.00        tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*) [2]

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
           24.8  .........      21.78 / 55.16        tbb::interface7::internal::delegated_function<main::{lambda()#1}::operator()() const::{lambda()#1} const, void>::operator()() const [9]
[22]       24.8      21.78       0.00 / 21.78      edm::EventProcessor::EventProcessor(std::shared_ptr<edm::ProcessDesc>, edm::ServiceToken const&, edm::serviceregistry::ServiceLegacy)
           24.8  .........      21.78 / 21.78        edm::EventProcessor::init(std::shared_ptr<edm::ProcessDesc>&, edm::ServiceToken const&, edm::serviceregistry::ServiceLegacy) [23]

@gartung
Copy link
Member

gartung commented Apr 9, 2021

A similar breakdown of cumulative percentages can be seen in this text output
https://cmssdt.cern.ch/SDT/jenkins-artifacts/igprof/CMSSW_11_3_X_2021-04-09-1100/slc7_amd64_gcc900/profiling/23434.21/RES_CPU_step3.txt

The difference seems to be that igprof-analyze or igprof-navigator to not show spontaneous as a valid base node.

@gartung
Copy link
Member

gartung commented Apr 10, 2021

The function tbb::internal::co_local_wait_for_all is not connected to main in call stacks and needs to be added to main to get the total.

@gartung
Copy link
Member

gartung commented Apr 10, 2021

Searching co_local_wait_for_all in the cmssw repo shows 4 issues where co_local_wait_for_all appears 2 or third from the bottom of stack traces. It appears after __start_context which is why it is not linked to main,
https://github.com/cms-sw/cmssw/search?q=co_local_wait_for_all&type=issues

@gartung
Copy link
Member

gartung commented Apr 12, 2021

Besides __start_context there is a second glibc function that connects to co_local_wait_for_all
__correctly_grouped_prefixwc

@gartung
Copy link
Member

gartung commented Apr 12, 2021

For instance _start_context calls __correctly_grouped_prefixwc which calls co_local_wait_for_all
http://cms-reco-profiling.web.cern.ch/cms-reco-profiling/cgi-bin/igprof-navigator/releases/CMSSW_11_3_0_pre5/23434.21/step3/cpu_endjob/12
which calls tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*)
http://cms-reco-profiling.web.cern.ch/cms-reco-profiling/cgi-bin/igprof-navigator/releases/CMSSW_11_3_0_pre5/23434.21/step3/cpu_endjob/3

@gartung
Copy link
Member

gartung commented Apr 12, 2021

I found a second instance of local_wait_for_all in the text report as indicated by '2 added to the end

            0.0  .........       0.00 / 0.00         tbb::interface7::internal::delegated_function<ROOT::TThreadExecutor::ParallelFor(unsigned int, unsigned int, unsigned int, std::function<void (unsigned int)> const&)::{lambda()#1}::operator()() const::{lambda()#1} const, void>::operator()() const [7472]
            0.3  .........       0.25 / 0.25         tbb::interface7::internal::delegated_function<HGCalCLUEAlgoT<HGCalLayerTilesT<HGCalTilesConstants> >::makeClusters()::{lambda()#1} const, void>::operator()() const [998]
[988]       0.3       0.25       0.00 / 0.25       tbb::internal::generic_scheduler::local_spawn_root_and_wait(tbb::task*, tbb::task*&)
            0.3  .........       0.25 / 0.25         tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*)'2 [1000]
            0.0  .........       0.00 / 62.32        tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*) [2]

which does not appear in the sql3 database used by igprof0-navigator.

@gartung
Copy link
Member

gartung commented Apr 15, 2021

Perf report also shows less than 100%

image

@gartung
Copy link
Member

gartung commented Apr 15, 2021

perf text report

+   61.18%     0.00%  cmsRun           libtbb_preview.so.2                                             [.] tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop
+   60.35%     0.00%  cmsRun           libtbb_preview.so.2                                             [.] tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all
+   53.37%     0.00%  cmsRun           libFWCoreFramework.so                                           [.] edm::convertException::wrap<bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTra
+   53.32%     0.00%  cmsRun           libFWCoreFramework.so                                           [.] edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >
+   53.15%     0.00%  cmsRun           libFWCoreFramework.so                                           [.] edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >
+   52.45%     0.00%  cmsRun           libFWCoreFramework.so                                           [.] edm::EventProcessor::processLumis
+   52.31%     0.00%  cmsRun           libFWCoreFramework.so                                           [.] edm::EventProcessor::runToCompletion
+   50.64%     0.00%  cmsRun           libtbb_preview.so.2                                             [.] tbb::interface7::internal::task_arena_base::internal_execute
+   50.54%     0.00%  cmsRun           cmsRun                                                          [.] tbb::interface7::internal::delegated_function<main::{lambda()#1}::operator()() const::{lambda()#1} const, void>::operator()
+   48.71%     0.00%  cmsRun           cmsRun                                                          [.] main::{lambda()#1}::operator()
+   34.86%     0.00%  cmsRun           libFWCoreFramework.so                                           [.] edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute
+   34.34%     0.00%  cmsRun           libFWCoreConcurrency.so                                         [.] tbb::internal::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute
+   34.26%     0.00%  cmsRun           libFWCoreFramework.so                                           [.] edm::stream::EDProducerAdaptorBase::doEvent
+   34.01%     0.00%  cmsRun           libFWCoreFramework.so                                           [.] edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo
+   26.60%     0.00%  cmsRun           libFWCoreConcurrency.so                                         [.] tbb::internal::function_task<edm::SerialTaskQueue::spawn(edm::SerialTaskQueue::TaskBase&)::{lambda()#1}>::execute
+   21.57%     0.00%  cmsRun           cmsRun                                                          [.] main
+   19.05%     0.00%  cmsRun           libc-2.17.so                                                    [.] __libc_start_main
+   18.15%     0.00%  cmsRun           libIOPoolOutput.so                                              [.] edm::RootOutputFile::fillBranches
+   18.15%     0.00%  cmsRun           libTree.so                                                      [.] TTree::Fill
+   18.15%     0.00%  cmsRun           libFWCoreFramework.so                                           [.] edm::WorkerT<edm::one::OutputModuleBase>::implDo
+   18.15%     0.00%  cmsRun           libFWCoreFramework.so                                           [.] edm::one::OutputModuleBase::doEvent
+   18.15%     0.00%  cmsRun           libIOPoolOutput.so                                              [.] edm::PoolOutputModule::write
+   18.15%     0.00%  cmsRun           libIOPoolOutput.so                                              [.] edm::RootOutputFile::writeOne
+   18.15%     0.00%  cmsRun           libtbb_preview.so.2                                             [.] tbb::interface7::internal::isolate_within_arena
+   18.15%     0.00%  cmsRun           libIOPoolOutput.so                                              [.] edm::RootOutputTree::fillTree
+   18.15%     0.00%  cmsRun           libFWCoreFramework.so                                           [.] edm::SerialTaskQueue::QueuedTask<void edm::SerialTaskQueueChain::push<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm:
+   18.15%     0.00%  cmsRun           libFWCoreFramework.so                                           [.] edm::SerialTaskQueueChain::actionToRun<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute
+   18.00%     0.00%  cmsRun           libCore.so                                                      [.] R__zipMultipleAlgorithm
+   18.00%     0.00%  cmsRun           libCore.so                                                      [.] R__zipLZMA
+   17.98%     0.00%  cmsRun           liblzma.so.5.2.4                                                [.] lzma_code
+   17.98%     0.00%  cmsRun           liblzma.so.5.2.4                                                [.] 0x00007fded255d9c4
+   17.92%     0.00%  cmsRun           liblzma.so.5.2.4                                                [.] 0x00007fded255beda
+   17.92%     0.00%  cmsRun           libTree.so                                                      [.] TBasket::WriteBuffer
+   17.90%     0.00%  cmsRun           liblzma.so.5.2.4                                                [.] 0x00007fded2564a11
+   17.89%     0.00%  cmsRun           liblzma.so.5.2.4                                                [.] 0x00007fded256fb6a
+   17.55%     0.00%  cmsRun           libImt.so                                                       [.] ROOT::Experimental::TTaskGroup::Wait
+   17.55%     0.00%  cmsRun           libImt.so                                                       [.] tbb::internal::function_task<std::function<void ()> >::execute
+   17.55%     0.00%  cmsRun           libTree.so                                                      [.] std::_Function_handler<void (), void ROOT::Internal::TBranchIMTHelper::Run<TBranch::WriteBasketImpl(TBasket*, int, ROOT::Internal::TBranchIMTHelp
+   15.88%     0.00%  cmsRun           liblzma.so.5.2.4                                                [.] 0x00007fded25675ed
+   15.68%     0.00%  cmsRun           cmsRun                                                          [.] _start
+    8.44%     0.00%  cmsRun           libRecoTrackerCkfPattern.so                                     [.] cms::CkfTrackCandidateMakerBase::produceBase
+    8.28%     0.01%  cmsRun           libRecoTrackerCkfPattern.so                                     [.] cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()
+    7.28%     0.00%  cmsRun           libtbb_preview.so.2                                             [.] tbb::internal::co_local_wait_for_all
+    7.27%     0.00%  cmsRun           libc-2.17.so                                                    [.] 0x00007fded11e21df
+    5.78%     0.00%  cmsRun           liblzma.so.5.2.4                                                [.] 0x00007fded25658d2
+    5.21%     0.03%  cmsRun           libDetectorDescriptionCore.so                                   [.] DDFilteredView::next
+    5.06%     0.01%  cmsRun           pluginRecoTrackerCkfPatternPlugins.so                           [.] GroupedCkfTrajectoryBuilder::groupedLimitedCandidates
+    5.03%     0.07%  cmsRun           pluginRecoTrackerCkfPatternPlugins.so                           [.] GroupedCkfTrajectoryBuilder::advanceOneLayer
+    5.02%     0.00%  cmsRun           liblzma.so.5.2.4                                                [.] 0x00007fded2569dbc
+    4.59%     0.09%  cmsRun           libRecoPixelVertexingPixelTriplets.so                           [.] CAHitQuadrupletGenerator::hitNtuplets
+    4.57%     0.00%  cmsRun           pluginRecoPixelVertexingPixelTripletsPlugins.so                 [.] CAHitNtupletEDProducerT<CAHitQuadrupletGenerator>::produce
+    4.29%     3.57%  cmsRun           libRecoPixelVertexingPixelTriplets.so                           [.] CellularAutomaton::createAndConnectCells
+    4.16%     0.05%  cmsRun           pluginRecoTrackerCkfPatternPlugins.so                           [.] TrajectorySegmentBuilder::segments
+    4.01%     0.00%  cmsRun           pluginRecoTrackerCkfPatternPlugins.so                           [.] GroupedCkfTrajectoryBuilder::buildTrajectories
+    3.98%     0.00%  cmsRun           liblzma.so.5.2.4                                                [.] 0x00007fded2566701
+    3.97%     1.21%  cmsRun           libRecoEgammaEgammaElectronAlgos.so                             [.] PixelHitMatcher::operator()
+    3.87%     0.01%  cmsRun           pluginTrackingToolsTrackFittersPlugins.so                       [.] (anonymous namespace)::KFFittingSmoother::fitOne
+    3.77%     0.00%  cmsRun           pluginGeometryMTDNumberingBuilderPlugins.so                     [.] edm::SerialTaskQueue::QueuedTask<void edm::SerialTaskQueueChain::push<edm::eventsetup::Callback<MTDGeometricTimingDetESModule, std::unique_ptr<Ge
+    3.77%     0.00%  cmsRun           pluginGeometryMTDNumberingBuilderPlugins.so                     [.] edm::SerialTaskQueueChain::actionToRun<edm::eventsetup::Callback<MTDGeometricTimingDetESModule, std::unique_ptr<GeometricTimingDet, std::default_
+    3.77%     0.00%  cmsRun           pluginGeometryMTDNumberingBuilderPlugins.so                     [.] edm::eventsetup::Callback<MTDGeometricTimingDetESModule, std::unique_ptr<GeometricTimingDet, std::default_delete<GeometricTimingDet> >, IdealGeom
+    3.77%     0.00%  cmsRun           pluginGeometryMTDNumberingBuilderPlugins.so                     [.] edm::convertException::wrap<edm::eventsetup::Callback<MTDGeometricTimingDetESModule, std::unique_ptr<GeometricTimingDet, std::default_delete<Geom
+    3.77%     0.00%  cmsRun           pluginGeometryMTDNumberingBuilderPlugins.so                     [.] MTDGeometricTimingDetESModule::produce
+    3.77%     0.00%  cmsRun           pluginGeometryMTDNumberingBuilderPlugins.so                     [.] DDCmsMTDConstruction::construct
+    3.75%     0.39%  cmsRun           pluginGeometryMTDNumberingBuilderPlugins.so                     [.] DDNameFilter::accept
+    3.66%     0.12%  cmsRun           libTrackingToolsMaterialEffects.so                              [.] PropagatorWithMaterial::propagateWithPath
+    3.47%     0.00%  cmsRun           pluginRecoTrackerTrackProducerPlugins.so                        [.] GsfTrackProducer::produce
+    3.33%     0.00%  cmsRun           libFWCoreFramework.so                                           [.] edm::Factory::makeModule
+    3.33%     0.00%  cmsRun           libFWCoreFramework.so                                           [.] edm::ModuleRegistry::getModule

@gartung
Copy link
Member

gartung commented Apr 15, 2021

The fun fact is that using the TBB test executable test_resumable_tasks.exe I get similar Igprof data

and the perf data seems to be missing 30%

+   67.35%     0.12%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all
+   56.81%    30.57%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task
+   53.80%     0.22%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop
+   47.91%     0.06%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::co_local_wait_for_all
+   41.77%     0.21%  test_resumable_  test_resumable_tasks.exe  [.] tbb::internal::function_task<TestCleanupMaster()::{lambda(int)#1}::operator()(int) const::{lambda()#1}>::execute
+   39.77%     0.05%  test_resumable_  libtbb_preview.so.2       [.] sched_yield@plt
+   32.23%     0.01%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::market::process
+   28.30%     0.04%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::arena::process
+   24.77%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f794f92
+   20.00%    20.00%  test_resumable_  [kernel.kallsyms]         [k] sysret_check
+   18.87%     0.02%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::rml::private_worker::run
+   11.67%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f795095
+   10.35%     0.34%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::internal_suspend
+    9.25%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f0d9d9c
+    9.24%     0.16%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::concurrent_queue_base_v3::internal_pop
+    8.35%     0.00%  test_resumable_  test_resumable_tasks.exe  [.] AsyncActivity::asyncLoop
+    8.33%     0.01%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::generic_scheduler::create_worker
+    8.14%     0.26%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::micro_queue::pop
+    8.11%     8.11%  test_resumable_  [kernel.kallsyms]         [k] system_call_after_swapgs
+    8.03%     0.08%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::concurrent_queue_base_v3::internal_insert_item
+    6.94%     0.01%  test_resumable_  test_resumable_tasks.exe  [.] tbb::internal::suspend_callback<SuspendBody>
+    6.71%     0.02%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::allocate_scheduler
+    6.30%     0.03%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::generic_scheduler::generic_scheduler
+    6.24%     0.01%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::create_coroutine
+    6.13%     0.07%  test_resumable_  test_resumable_tasks.exe  [.] tbb::interface9::internal::start_for<tbb::blocked_range<int>, tbb::internal::parallel_for_body<ParForSuspendBody, int>, tbb::affinity_partitioner>::execute
+    6.07%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f786aaa
+    5.89%     0.15%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::micro_queue::push
+    5.78%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f794fd8
+    5.69%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f794eb9
+    4.86%     0.02%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::arena::free_arena
+    4.78%     0.02%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::~custom_scheduler
+    4.76%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f1fcbf2
+    4.75%     0.00%  test_resumable_  libtbb_preview.so.2       [.] munmap@plt
+    4.64%     0.00%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::generic_scheduler::destroy
+    4.00%     0.00%  test_resumable_  libtbb_preview.so.2       [.] syscall@plt
+    3.90%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f78655d
+    3.90%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f397597
+    3.67%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f0e9942
+    3.59%     0.00%  test_resumable_  libtbb_preview.so.2       [.] swapcontext@plt
+    3.58%     0.03%  test_resumable_  test_resumable_tasks.exe  [.] tbb::interface9::internal::start_for<tbb::blocked_range<int>, tbb::internal::parallel_for_body<TestSuspendResume()::{lambda(int)#1}::operator()(int) const::{lambda(int)#1}, int>, tbb::auto_partitioner const>::exe
+    3.50%     0.00%  test_resumable_  libpthread-2.17.so        [.] start_thread
+    3.32%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f0daefb
+    3.04%     0.00%  test_resumable_  [unknown]                 [k] 0xffffffff8f115270
+    2.68%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f1fb955
+    2.52%     0.00%  test_resumable_  libtbb_preview.so.2       [.] mmap@plt
+    2.51%     0.00%  test_resumable_  libc-2.17.so              [.] __clone
+    2.50%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f031f12
+    2.47%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f1fa8c6
+    2.43%     0.00%  test_resumable_  libtbb_preview.so.2       [.] mprotect@plt
+    2.43%     0.00%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::generic_scheduler::local_spawn_root_and_wait
+    2.42%     0.00%  test_resumable_  [unknown]                 [k] 0xffffffff8f1fe933
+    2.39%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f1fb715
+    2.36%     0.18%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::concurrent_monitor::notify_relaxed<tbb::internal::predicate_leq>
+    2.30%     1.09%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::generic_scheduler::steal_task
+    2.22%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f787229
+    2.16%     0.00%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::generic_scheduler::do_post_resume_action
+    2.10%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f795176
+    2.09%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f788805
     2.04%     2.04%  test_resumable_  [kernel.kallsyms]         [k] osq_lock
+    1.95%     0.01%  test_resumable_  test_resumable_tasks.exe  [.] tbb::interface9::internal::start_for<tbb::blocked_range<int>, tbb::internal::parallel_for_body<TestSuspendResume()::{lambda(int)#1}, int>, tbb::auto_partitioner const>::execute
+    1.93%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f1179e5
+    1.93%     0.01%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::market::arena_in_need
+    1.90%     0.02%  test_resumable_  libtbb_preview.so.2       [.] tbb::spin_rw_mutex_v3::internal_acquire_reader
+    1.89%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f1ef933
+    1.88%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f3975cb
+    1.85%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f1ef80f
+    1.85%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f1f9124
+    1.84%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f0cad35
+    1.83%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f1edd46

@gartung
Copy link
Member

gartung commented Apr 15, 2021

The [k] in the perf output indicates a kernel call. As far as I know igprof does not show kernel calls.

@gartung
Copy link
Member

gartung commented Apr 19, 2021

Currently rebuilding CMSSW_11_3_0_pre6 with oneTBB 2021.2. Tests of igprof with the TBB test executable test_resumable_tasks show that main still does not account for 100% cumulative time, but summing _clone and _start does.

@gartung
Copy link
Member

gartung commented Apr 23, 2021

We'll see what the TBB developers say
uxlfoundation/oneTBB#384

@gartung
Copy link
Member

gartung commented Apr 28, 2021

Using Google perftools profiler library I am able to get a profile which seems to account for all activity. The pprof tools interprets the data collected and can produce text output or data suitable for producing a flamegraph.

@gartung
Copy link
Member

gartung commented Apr 28, 2021

TBB resumable tasks uses makecontext and swapcontext with coroutines. Igprof does not seem to be able to handle correctly. Rather than try to make Igprof work with these coroutines it might be easier to make pprof output Igprof compatible data.

@gartung
Copy link
Member

gartung commented Apr 28, 2021

Text output with template arguments not stripped:
https://gartung.web.cern.ch/gartung/cmsRun.prof.cumul.templargs.txt.gz

@gartung
Copy link
Member

gartung commented Apr 30, 2021

When the coroutine is create the parent link is set to 0
https://github.com/oneapi-src/oneTBB/blob/9e15720bc7744f85dff611d34d65e9099e077da4/src/tbb/co_context.h#L176
When the coroutine is swapped back the thread exits after running the task. (edited)
The IgProf threadWrapper sees the task return and disables the profiling timer.
https://github.com/igprof/igprof/blob/03d16c15bd970256affcfde9e510a8404856b233/src/profile.cc#L835
Gperftools does not ever disables the profiling timer on the main thread.

@gartung
Copy link
Member

gartung commented Apr 30, 2021

This patch for igprof seems to resolve the undercounting problem.
cms-externals/igprof#2

@gartung
Copy link
Member

gartung commented May 5, 2021

@gartung
Copy link
Member

gartung commented May 5, 2021

Memory profiling looks more accurate after the patch
https://gartung.web.cern.ch/gartung/cgi-bin/igprof-navigator/igprofMEM_step3/cumulative

@jpata
Copy link
Contributor Author

jpata commented Oct 25, 2021

solved thanks to @gartung!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants