Skip to content

profiling

Tyler McDaniel edited this page Jul 6, 2020 · 26 revisions

ASGarD has a built-in build support for a few different methods of profiling. Each method will be explained in its own section below.

GNU gprof

cmake .. -DCMAKE_CXX_COMPILER=g++ -DASGARD_PROFILE_GPROF=1

GNU gprof is a sampling-based profiler that is widely available on many systems, usually coming as part of binutils. It depends on the GCC compiler.

Check out the GNU gprof documentation for all of the details about its usage. A simple but useful case is explained below.

For many cases, this simple recipe is sufficient to get some good info about where time is spent in the executable. There are three main steps, shown here using the main asgard executable:

  1. Build and link the code using the -pg option. This is done for you during the make process when the cmake option -DASGARD_PROFILE_GPROF=1 option is set.

  2. Run the executable to be profiled. This produces a gmon.out file in the current working directory when the executable was run.

    $ ./asgard -p continuity_6 -l 8 -d 3

  3. Process the output file using gprof

    $ gprof asgard gmon.out > [output-file.txt]

graphical, weighted call-graph from gprof

We can produce a weighted call graph using gprof2dot. This is a python script that depends on dot from graphviz. Both are automatically installed into the contrib directory if not already found.

$ gprof asgard gmon.out | ../contrib/gprof2dot/bin/gprof2dot.py | ../contrib/graphviz/bin/dot -Tpdf -o asgard-profile.pdf

The graph can be trimmed of smaller contributors using some options to gprof2dot.py:

$ gprof asgard gmon.out | ../contrib/gprof2dot/bin/gprof2dot.py -n 2 -e 1 -w | ../contrib/graphviz/bin/dot -Tpdf -o asgard-profile.pdf

LLVM XRay

cmake .. -DCMAKE_CXX_COMPILER=clang++ -DASGARD_PROFILE_XRAY=1

LLVM XRay is a profiling tool from LLVM which uses the compiler's internal AST to automatically instrument the code to enable tracing support. This is like putting start/stop timers around the code's functions, as opposed getting the timings via samples like in the case of GNU gprof, for example.

In addition to the LLVM XRay documentation, there is also a "Debugging with XRay" page that explains some of the options/features.

To get a trace of the entire application with every function that has at least 200 instructions:

  1. Build and link the code with -fxray-instrument. This is done for you during the make process with the -DASGARD_PROFILE_XRAY=1 option is used.

  2. Run the executable to be profiled with

    $ XRAY_OPTIONS=\"patch_premain=true xray_mode=xray-basic verbosity=1\" ./asgard -p continuity_6 -l 8 -d 3

    This produces a uniquely-hashed xray-log.asgard.[hash] file with the profile data.

  3. Process the output using llvm-xray

    $ llvm-xray account xray-log.asgard.[hash] -sort=sum -sortorder=dsc -instr_map ./asgard

controlling what XRay instruments

Mainly quoting/following the docs listed above:

By default, functions that have at least 200 instructions will get XRay instrumentation points. You can tweak that number through the -fxray-instruction-threshold= flag:

clang -fxray-instrument -fxray-instruction-threshold=1 ...

You can also specifically instrument functions in your binary to either always or never be instrumented using source-level attributes. You can do it using the GCC-style attributes or C++11-style attributes.

[[clang::xray_always_instrument]] void always_instrumented();
[[clang::xray_never_instrument]] void never_instrumented();

void alt_always_instrumented() __attribute__((xray_always_instrument));
void alt_never_instrumented() __attribute__((xray_never_instrument));

We can also specify which functions to include/exclude via an external file. Again, from the docs:

Attributes can be imbued through the use of special case files instead of adding them to the original source files. You can use this to mark certain functions and classes to be never, always, or instrumented with first-argument logging from a file. The file’s format is described below:

# always instrument functions that match the following filters
[always]
fun:always_instrument
fun:log_arg1=arg1 # Log the first argument for the function

# never instrument functions that match the following filters
[never] 
fun:never_instrument
fun:__cxx_*

These files can be provided through the -fxray-attr-list= flag to clang. You may have multiple files loaded through multiple instances of the flag.

visualizing the data

Although the output from XRay is human-readable and usable as-is, sometimes we want to see pictures.

As of this writing, there are a couple of ways mentioned in the docs of visualizing the output data. Refer to them for more details, but here are a couple of simple examples.

FlameGraph

This is a small utility to generate a flamegraph of the output. From the utility author's page

  • Each box represents a function in the stack (a "stack frame").
  • The y-axis shows stack depth (number of frames on the stack). The top box shows the function that was on-CPU. Everything beneath that is ancestry. The function beneath a function is its parent, just like the stack traces shown earlier. (Some flame graph implementations prefer to invert the order and use an "icicle layout", so flames look upside down.)
  • The x-axis spans the sample population. It does not show the passing of time from left to right, as most graphs do. The left to right ordering has no meaning (it's sorted alphabetically to maximize frame merging).
  • The width of the box shows the total time it was on-CPU or part of an ancestry that was on-CPU (based on sample count). Functions with wide boxes may consume more CPU per execution than those with narrow boxes, or, they may simply be called more often. The call count is not shown (or known via sampling).
  • The sample count can exceed elapsed time if multiple threads were running and sampled concurrently.

Note that XRay is still giving instrumented trace output, not sampling, but the comments about time aggregation apply nonetheless.

To have the LLVM XRay tool output the necessary data in the format the tool wants to see, do something like the following:

$llvm-xray stack xray-log.asgard.[hash] -instr_map ./asgard -stack-format=flame -aggregation-type=time -all-stacks | ../contrib/FlameGraph/flamegraph.pl > flamegraph.svg

chrome trace viewer

llvm-xray convert -symbolize -instr_map=./asgard -output-format=trace_event xray-log.asgard.[hash] | gzip > asgard-trace.txt.gz

Now, in a Chrome browser, navigate to chrome:///tracing and load the asgard-trace.txt.gz produced above.

gperftools

cmake .. -DASGARD_PROFILE_GPERF_CPU=1

gperftools used to be "Google Performance Tools", and includes tcmalloc, a heap profiler, and a cpu profiler.

CPU profiling

For a quick-start on the CPU profiling, as per the documentation, we need to:

  1. Install the gperftools package. This is done for you when you use the -DASGARD_PROFILE_GPERF= cmake options.

  2. Link your executable with -lprofiler. This is also done for you with the cmake option from step 1.

  3. Run your executable with the CPUPROFILE environment var set:

    $ CPUPROFILE=/tmp/prof.out <path/to/binary> [binary args]
    
  4. Run pprof to analyze the CPU usage

    $ pprof <path/to/binary> /tmp/prof.out      # -pg-like text output
    $ pprof --gv <path/to/binary> /tmp/prof.out # really cool graphical output
    

There is much more control available over the CPU profiling. Again, check out the docs.

Heap profiling

cmake .. -DASGARD_PROFILE_GPERF_MEM=1

Refer to gperftools docs for more details.

  1. link the code with the -ltcmalloc option (done for you during 'make')

  2. run the executable to be profiled with:

    $ HEAPPROFILE=some-name.hprof ./asgard -p continuity_6 -l 8 -d 3
    

    this produces a list of heap profile file name 'some-name.prof.XXXXX.heap'

    $ PPROF_PATH=/path/to/pprof HEAPCHECK=normal ./asgard -p continuity_6 -l 8 -d 3
    

    this performs a basic memory leack check'

  3. analyze the results with

    $ pprof --text ./asgard some-name.prof.XXXXX.heap
    $ pprof --gv ./asgard some-name.hprof
    

Linux perf

cmake .. -DASGARD_PROFILE_PERF=1

More details can be found at the Linux perf wiki

  1. Download and build Linux perf (done for you during 'make') perf is installed in asgard/contrib/lperftools/bin

    (flex and bison are dependencies and must be installed by user)

  2. run the executable to be profiled with: (System previlage must be given)

    To get a runtime distribution:

    $ perf record ./asgard -p continuity_6 -l 8 -d 3
    

    To get a runtime call-graph

    $ perf record -g ./asgard -p continuity_6 -l 8 -d 3
    
  3. display the results with $ perf report

Valgrind

cmake .. -DASGARD_PROFILE_VALGRIND=1

See the valgrind user manual

  1. build/link the code with the -g option (will be done for you during 'make')

  2. To perform memory check. e.g.

    $ valgrind ---tool=memcheck --log-file=<filename> ./asgard -p continuity_6 -l 8 -d 3
    

    this produces a 'gmon.out' in the current directory

  3. Cache profling. e.g.
    Run the profile too

    $ valgrind --tool=cachegrind ./asgard -p continuity_6 -l 8 -d 3
    

    By default the outputfile is named cachegrind.out.

    Run report

    $ cg_annotate --auto=yes <cachegrind.out.<pid>>
    

    This generate a call-by-call Cache performance

  4. Call graph

    $ valgrind --tool=callgrind [callgrind options] ./asgard -p continuity_6 -l 8 -d 3
    $ callgrind_annotate [options] callgrind.out.<pid>
    

    for more details about call-graph generation, see the valgrind documentation for it at ch6

  5. Heap profiler

    $ valgrind --tool=massif [massif options] ./asgard -p continuity_6 -l 8 -d 3
    $ ms_print massif.out.<pid>
    

    "Massif Visualizer" (e.g. available in apt) is a nice tool for visualizing this data.

    for more details about call-graph generation, see the valgrind documentation for it at ch9

Sample problem sizes

Each run command is followed by size of entire problem in GPU DRAM.

(Note we can still run problems where below allocations exceed GPU DRAM size. The problem is broken into pieces based on workspace limit in main, currently hardcoded to 10GB. However, we don't have a means of splitting coefficients yet, so those have to fit.)

Approximate time per explicit timestep is given; this is for a single Titan V on our workstation.

Provided runtimes are total wallclock time for these runs on Cori, five explicit timesteps, both without GPU acceleration and with one GPU.

2D

./asgard -p continuity_2 -l 2 -d 2

DOF=32. small 2d problem. can run on laptop. ~1 sec timesteps.

coefficient size (MB): <1

solution size (MB): <1

workspace size (MB): <1

runtime (with GPU): 0m 0.822s

runtime (without GPU): 0m 0.451s

./asgard -p continuity_2 -l 10 -d 4

DOF=98,304, large 2d problem. ~3.5 sec timesteps.

coefficient size (MB): 537

solution size (MB): <1

workspace size (MB): 19,327

runtime (with GPU): 0m 26.043s

runtime (without GPU): 15m 59.284s

./asgard -p continuity_2 -l 12 -d 4

DOF=458,752, very large 2d problem. could benefit from multiple GPUs. ~50% DRAM on our Titan V cards is consumed by coefficients, which have to be resident with our current design. 1-2 minute timesteps.

coefficient size (MB): 8,590

solution size (MB): 4

workspace size (MB): 420,907

runtime (with GPU): 9m 19.823s

runtime (without GPU):

3D

./asgard -p continuity_3 -l 3 -d 2

DOF=304, small 3d problem. can run on laptop; 1-2 sec timesteps.

coefficient size (MB): <1

solution size (MB): <1

workspace size (MB): <1

runtime (with GPU): 0m 2.094s

runtime (without GPU): 0m 0.795s

./asgard -p coninuity_3 -l 8 d 4

DOF=262,144, large 3d problem. 7-8 sec timesteps.

coefficient size (MB): 75

solution size (MB): 2

workspace size (MB): 51,539

runtime (with GPU): 0m 48.750s

runtime (without GPU): 31m 18.417s

./asgard -p continuity_3 -l 11 -d 4

DOF=1,458,176, very large 3d problem, could use multiple GPUs. ~5 min per timestep.

coefficient size (MB): 1,208

solution size (MB): 12

workspace size (MB): 1,594,710

runtime (with GPU): 119m 9.511s

runtime (without GPU):

6D

./asgard -p continuity_6 -l 2 -d 4

DOF=139,264, small 6d problem. can be run on a laptop. ~2 sec per timestep.

coefficient size (MB): <1

solution size (MB): 1

workspace size (MB): 455

runtime (with GPU): 0m 3.288s

runtime (without GPU): 0m 23.532s

./asgard -p continuity_6 -l 4 -d 4

DOF=2,052,096, large 6d problem. multiple GPUs will help; ~30 sec per timestep.

coefficient size (MB): 2

solution size (MB): 16

workspace size (MB): 98,698

runtime (with GPU): 3m 9.969s

runtime (without GPU): 83m 33.835s

./asgard -p continuity_6 -l 6 -d 4

DOF=21,856,256, very large 6d problem. use multiple nodes/GPUs.

coefficient size (MB): 19

solution size (MB): 175

workspace size (MB): 112,960,000

runtime (with GPU):

runtime (without GPU):