-
Notifications
You must be signed in to change notification settings - Fork 0
profiling
ASGarD has a built-in build support for a few different methods of profiling. Each method will be explained in its own section below.
GNU gprof is a sampling-based profiler that is widely available on many systems, usually coming as part of binutils. It depends on the GCC compiler.
Check out the GNU gprof documentation for all of the details about its usage. A simple but useful case is explained below.
For many cases, this simple recipe is sufficient to get some good info about
where time is spent in the executable. There are three main steps, shown here
using the main asgard
executable:
-
Build and link the code using the
-pg
option. This is done for you during the make process when the cmake option-DASGARD_PROFILE_GPROF=1
option is set. -
Run the executable to be profiled. This produces a
gmon.out
file in the current working directory when the executable was run.$ ./asgard -p continuity_6 -l 8 -d 3
-
Process the output file using
gprof
$ gprof asgard gmon.out > [output-file.txt]
We can produce a weighted call graph using gprof2dot
. This is a python script
that depends on dot
from graphviz. Both are automatically installed into the
contrib
directory if not already found.
$ gprof asgard gmon.out | ../contrib/gprof2dot/bin/gprof2dot.py | ../contrib/graphviz/bin/dot -Tpdf -o asgard-profile.pdf
The graph can be trimmed of smaller contributors using some options to
gprof2dot.py
:
$ gprof asgard gmon.out | ../contrib/gprof2dot/bin/gprof2dot.py -n 2 -e 1 -w | ../contrib/graphviz/bin/dot -Tpdf -o asgard-profile.pdf
LLVM XRay is a profiling tool from LLVM which uses the compiler's internal AST to automatically instrument the code to enable tracing support. This is like putting start/stop timers around the code's functions, as opposed getting the timings via samples like in the case of GNU gprof, for example.
In addition to the LLVM XRay documentation, there is also a "Debugging with XRay" page that explains some of the options/features.
To get a trace of the entire application with every function that has at least 200 instructions:
-
Build and link the code with
-fxray-instrument
. This is done for you during the make process with the-DASGARD_PROFILE_XRAY=1
option is used. -
Run the executable to be profiled with
$ XRAY_OPTIONS=\"patch_premain=true xray_mode=xray-basic verbosity=1\" ./asgard -p continuity_6 -l 8 -d 3
This produces a uniquely-hashed
xray-log.asgard.[hash]
file with the profile data. -
Process the output using
llvm-xray
$ llvm-xray account xray-log.asgard.[hash] -sort=sum -sortorder=dsc -instr_map ./asgard
Mainly quoting/following the docs listed above:
By default, functions that have at least 200 instructions will get XRay instrumentation points. You can tweak that number through the -fxray-instruction-threshold= flag:
clang -fxray-instrument -fxray-instruction-threshold=1 ...
You can also specifically instrument functions in your binary to either always or never be instrumented using source-level attributes. You can do it using the GCC-style attributes or C++11-style attributes.
[[clang::xray_always_instrument]] void always_instrumented();
[[clang::xray_never_instrument]] void never_instrumented();
void alt_always_instrumented() __attribute__((xray_always_instrument));
void alt_never_instrumented() __attribute__((xray_never_instrument));
We can also specify which functions to include/exclude via an external file. Again, from the docs:
Attributes can be imbued through the use of special case files instead of adding them to the original source files. You can use this to mark certain functions and classes to be never, always, or instrumented with first-argument logging from a file. The file’s format is described below:
# always instrument functions that match the following filters
[always]
fun:always_instrument
fun:log_arg1=arg1 # Log the first argument for the function
# never instrument functions that match the following filters
[never]
fun:never_instrument
fun:__cxx_*
These files can be provided through the -fxray-attr-list= flag to clang. You may have multiple files loaded through multiple instances of the flag.
Although the output from XRay is human-readable and usable as-is, sometimes we want to see pictures.
As of this writing, there are a couple of ways mentioned in the docs of visualizing the output data. Refer to them for more details, but here are a couple of simple examples.
This is a small utility to generate a flamegraph of the output. From the utility author's page
- Each box represents a function in the stack (a "stack frame").
- The y-axis shows stack depth (number of frames on the stack). The top box shows the function that was on-CPU. Everything beneath that is ancestry. The function beneath a function is its parent, just like the stack traces shown earlier. (Some flame graph implementations prefer to invert the order and use an "icicle layout", so flames look upside down.)
- The x-axis spans the sample population. It does not show the passing of time from left to right, as most graphs do. The left to right ordering has no meaning (it's sorted alphabetically to maximize frame merging).
- The width of the box shows the total time it was on-CPU or part of an ancestry that was on-CPU (based on sample count). Functions with wide boxes may consume more CPU per execution than those with narrow boxes, or, they may simply be called more often. The call count is not shown (or known via sampling).
- The sample count can exceed elapsed time if multiple threads were running and sampled concurrently.
Note that XRay is still giving instrumented trace output, not sampling, but the comments about time aggregation apply nonetheless.
To have the LLVM XRay tool output the necessary data in the format the tool wants to see, do something like the following:
$llvm-xray stack xray-log.asgard.[hash] -instr_map ./asgard -stack-format=flame -aggregation-type=time -all-stacks | ../contrib/FlameGraph/flamegraph.pl > flamegraph.svg
llvm-xray convert -symbolize -instr_map=./asgard -output-format=trace_event xray-log.asgard.[hash] | gzip > asgard-trace.txt.gz
Now, in a Chrome browser, navigate to chrome:///tracing
and load the
asgard-trace.txt.gz
produced above.
gperftools used to be "Google Performance Tools", and includes tcmalloc, a heap profiler, and a cpu profiler.
For a quick-start on the CPU profiling, as per the documentation, we need to:
-
Install the gperftools package. This is done for you when you use the
-DASGARD_PROFILE_GPERF=
cmake options. -
Link your executable with
-lprofiler
. This is also done for you with the cmake option from step 1. -
Run your executable with the CPUPROFILE environment var set:
$ CPUPROFILE=/tmp/prof.out <path/to/binary> [binary args]
-
Run pprof to analyze the CPU usage
$ pprof <path/to/binary> /tmp/prof.out # -pg-like text output $ pprof --gv <path/to/binary> /tmp/prof.out # really cool graphical output
There is much more control available over the CPU profiling. Again, check out the docs.
Refer to gperftools docs for more details.
-
link the code with the
-ltcmalloc
option (done for you during 'make') -
run the executable to be profiled with:
$ HEAPPROFILE=some-name.hprof ./asgard -p continuity_6 -l 8 -d 3
this produces a list of heap profile file name 'some-name.prof.XXXXX.heap'
$ PPROF_PATH=/path/to/pprof HEAPCHECK=normal ./asgard -p continuity_6 -l 8 -d 3
this performs a basic memory leack check'
-
analyze the results with
$ pprof --text ./asgard some-name.prof.XXXXX.heap $ pprof --gv ./asgard some-name.hprof
More details can be found at the Linux perf wiki
-
Download and build Linux perf (done for you during 'make') perf is installed in asgard/contrib/lperftools/bin
(flex and bison are dependencies and must be installed by user)
-
run the executable to be profiled with: (System previlage must be given)
To get a runtime distribution:
$ perf record ./asgard -p continuity_6 -l 8 -d 3
To get a runtime call-graph
$ perf record -g ./asgard -p continuity_6 -l 8 -d 3
-
display the results with
$ perf report
See the valgrind user manual
-
build/link the code with the
-g
option (will be done for you during 'make') -
To perform memory check. e.g.
$ valgrind ---tool=memcheck --log-file=<filename> ./asgard -p continuity_6 -l 8 -d 3
this produces a 'gmon.out' in the current directory
-
Cache profling. e.g.
Run the profile too$ valgrind --tool=cachegrind ./asgard -p continuity_6 -l 8 -d 3
By default the outputfile is named cachegrind.out.
Run report
$ cg_annotate --auto=yes <cachegrind.out.<pid>>
This generate a call-by-call Cache performance
-
Call graph
$ valgrind --tool=callgrind [callgrind options] ./asgard -p continuity_6 -l 8 -d 3 $ callgrind_annotate [options] callgrind.out.<pid>
for more details about call-graph generation, see the valgrind documentation for it at ch6
-
Heap profiler
$ valgrind --tool=massif [massif options] ./asgard -p continuity_6 -l 8 -d 3 $ ms_print massif.out.<pid>
"Massif Visualizer" (e.g. available in apt) is a nice tool for visualizing this data.
for more details about call-graph generation, see the valgrind documentation for it at ch9
Each run command is followed by size of entire problem in GPU DRAM.
(Note we can still run problems where below allocations exceed GPU DRAM size. The problem is broken into pieces based on workspace limit in main, currently hardcoded to 10GB. However, we don't have a means of splitting coefficients yet, so those have to fit.)
Approximate time per explicit timestep is given; this is for a single Titan V on our workstation.
Provided runtimes are total wallclock time for these runs on Cori, five explicit timesteps, both without GPU acceleration and with one GPU.
./asgard -p continuity_2 -l 2 -d 2
DOF=32. small 2d problem. can run on laptop. ~1 sec timesteps.
coefficient size (MB): <1
solution size (MB): <1
workspace size (MB): <1
runtime (with GPU): 0m 0.822s
runtime (without GPU): 0m 0.451s
./asgard -p continuity_2 -l 10 -d 4
DOF=98,304, large 2d problem. ~3.5 sec timesteps.
coefficient size (MB): 537
solution size (MB): <1
workspace size (MB): 19,327
runtime (with GPU): 0m 26.043s
runtime (without GPU): 15m 59.284s
./asgard -p continuity_2 -l 12 -d 4
DOF=458,752, very large 2d problem. could benefit from multiple GPUs. ~50% DRAM on our Titan V cards is consumed by coefficients, which have to be resident with our current design. 1-2 minute timesteps.
coefficient size (MB): 8,590
solution size (MB): 4
workspace size (MB): 420,907
runtime (with GPU): 9m 19.823s
runtime (without GPU):
./asgard -p continuity_3 -l 3 -d 2
DOF=304, small 3d problem. can run on laptop; 1-2 sec timesteps.
coefficient size (MB): <1
solution size (MB): <1
workspace size (MB): <1
runtime (with GPU): 0m 2.094s
runtime (without GPU): 0m 0.795s
./asgard -p coninuity_3 -l 8 d 4
DOF=262,144, large 3d problem. 7-8 sec timesteps.
coefficient size (MB): 75
solution size (MB): 2
workspace size (MB): 51,539
runtime (with GPU): 0m 48.750s
runtime (without GPU): 31m 18.417s
./asgard -p continuity_3 -l 11 -d 4
DOF=1,458,176, very large 3d problem, could use multiple GPUs. ~5 min per timestep.
coefficient size (MB): 1,208
solution size (MB): 12
workspace size (MB): 1,594,710
runtime (with GPU): 119m 9.511s
runtime (without GPU):
./asgard -p continuity_6 -l 2 -d 4
DOF=139,264, small 6d problem. can be run on a laptop. ~2 sec per timestep.
coefficient size (MB): <1
solution size (MB): 1
workspace size (MB): 455
runtime (with GPU): 0m 3.288s
runtime (without GPU): 0m 23.532s
./asgard -p continuity_6 -l 4 -d 4
DOF=2,052,096, large 6d problem. multiple GPUs will help; ~30 sec per timestep.
coefficient size (MB): 2
solution size (MB): 16
workspace size (MB): 98,698
runtime (with GPU): 3m 9.969s
runtime (without GPU): 83m 33.835s
./asgard -p continuity_6 -l 6 -d 4
DOF=21,856,256, very large 6d problem. use multiple nodes/GPUs.
coefficient size (MB): 19
solution size (MB): 175
workspace size (MB): 112,960,000
runtime (with GPU):
runtime (without GPU):