-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
igprof pp profiling outputs no longer work #33297
Comments
A new Issue was created by @jpata Joosep Pata. @Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign core |
New categories assigned: core @Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks |
I agree with @dan131riley that #32804 is the likely culprit (given that we saw TBB to "move stack between threads", I'd easily believe IgProf to get confused by that, among others). |
@jpata Is this specific to pp? Or in other words, do mp profiles work properly? |
The “moving between threads” should also have been stopped when I later attend the use of a tbb::task_arena call. |
I poked around between 11_2_X and 11_3_X IgProf reports and noticed something In CMSSW_11_2_X_2021-03-30-2300 23434.21
But in CMSSW_11_3_X_2021-03-30-2300 23434.21
Could it be that in 11_3_X the In the GEN-SIM job ( |
We observe the same issue with igprof pp in the reco profiling, which is run independently with different scripts. From pre5, they don't look meaningful.
|
Thanks Joosep. Perhaps the symptom should be interpreted then as time accounting effectively being lost some time before the capture point of It's still curious that in the GEN-SIM job time does get added after capture point of |
The output is completely controlled by the the runTheMatrix.py switch --profile prof which enables igprof as a service. |
The output name of the sqlite files is derived from the igprof service output name here |
Luckily I kept the original gzipped tarfile output from igprof. I you click on Logs from the igprof top page you get a directory listing like this where you can download the original igprof output. |
If Igprof is doing something strange you should be able to investigate with the original output. |
curiously, pp for run3 makes sense in pre5 ( although I'm less sure about the test in the IB |
The run-ib-profining job run on cms-oc-gpu-01 while the run-ib-igprof job is run on cmsprofile-01 or cmsprofile-02 which are 4 core VM's. The release-run-reco-profling jobs are run on vocms11 which is a 32 core bare metal machine. |
The main sum from the run-ib-igprof jobs run on cmsprofile-01 is even lower -- 89% |
I would like to move the run-ib-profiling job and the run-ib-igprof jobs to vocms11 so that the hardware is the same when making comparisons. |
@gartung No problem for reco side, I think! Feel free to do what you need in jenkins. Just not to side-track this issue here, we can discuss in mattermost in case of need. |
Apparently we still observe stacks being moved between threads #33289 (comment), I would guess that to confuse IgProf. |
I'm running some tests now that get the thread-id at the top of |
I noticed that the runTheMatrix.py --profile generates the command |
I think I might know what is causing the time difference. |
rdtsc is used by TBB in prolonged_pause_impl which calls machine_time_stamp which calls rdtsc. |
Definition os machine_time_stamp here |
In igprof the ticks are returned by the add function which is used to add the time in various memory allocation calls |
I think changing the rdtsc calls to libc's clock calls could avoid the problem |
I tried this patch with igprof and generated the text output
and lower down at [20] I see the __start_context where the “missing” cumulative total from spontaneous goes and the main::lambda eventually appears
|
A similar breakdown of cumulative percentages can be seen in this text output The difference seems to be that igprof-analyze or igprof-navigator to not show spontaneous as a valid base node. |
The function |
Searching co_local_wait_for_all in the cmssw repo shows 4 issues where co_local_wait_for_all appears 2 or third from the bottom of stack traces. It appears after __start_context which is why it is not linked to main, |
Besides |
For instance |
I found a second instance of local_wait_for_all in the text report as indicated by '2 added to the end
which does not appear in the sql3 database used by igprof0-navigator. |
perf text report
|
The fun fact is that using the TBB test executable test_resumable_tasks.exe I get similar Igprof data and the perf data seems to be missing 30%
|
The [k] in the perf output indicates a kernel call. As far as I know igprof does not show kernel calls. |
Currently rebuilding CMSSW_11_3_0_pre6 with oneTBB 2021.2. Tests of igprof with the TBB test executable test_resumable_tasks show that main still does not account for 100% cumulative time, but summing _clone and _start does. |
We'll see what the TBB developers say |
Using Google perftools profiler library I am able to get a profile which seems to account for all activity. The pprof tools interprets the data collected and can produce text output or data suitable for producing a flamegraph. |
TBB resumable tasks uses makecontext and swapcontext with coroutines. Igprof does not seem to be able to handle correctly. Rather than try to make Igprof work with these coroutines it might be easier to make pprof output Igprof compatible data. |
Text output with template arguments not stripped: |
When the coroutine is create the parent link is set to 0 |
This patch for igprof seems to resolve the undercounting problem. |
Memory profiling looks more accurate after the patch |
solved thanks to @gartung! |
There seems to be an issue with igprof pp (CPU profiling) in 11_3_X between pre4 and pre5, starting at least from March 14. The stack in 11_3_X is no longer capturing main properly and the cumulative number is not meaningful. @slava77 pointed it out to me yesterday. Things look fine in 11_2_X.
11_3_X: https://cmssdt.cern.ch/SDT/cgi-bin/igprof-navigator/CMSSW_11_3_X_2021-03-14-0000/slc7_amd64_gcc900/pp/23434.21_TTbar_14TeV+2026D49PU_ProdLike+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTriggerPU+RecoGlobalPU+MiniAODPU/step3___580bd4d7c9a1fca4d329fb1738ee817d___10_EndOfJob
11_2_X: https://cmssdt.cern.ch/SDT/cgi-bin/igprof-navigator/CMSSW_11_2_X_2021-03-14-0000/slc7_amd64_gcc900/pp/23434.21_TTbar_14TeV+2026D49PU_ProdLike+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTriggerPU+RecoGlobalPU+MiniAODPU/step3___580bd4d7c9a1fca4d329fb1738ee817d___10_EndOfJob
In mattermost, Daniel Sherman Riley pointed to this as a possible culprit: #32804
Perhaps something needs to be adjusted on the igprof side, but it'd be nice to get the profiles working again.
The text was updated successfully, but these errors were encountered: