diff --git a/chapters/6-Performance-Analysis-Approaches/6-9 Chapter Summary.md b/chapters/6-Performance-Analysis-Approaches/6-10 Chapter Summary.md similarity index 100% rename from chapters/6-Performance-Analysis-Approaches/6-9 Chapter Summary.md rename to chapters/6-Performance-Analysis-Approaches/6-10 Chapter Summary.md diff --git a/chapters/6-Performance-Analysis-Approaches/6-4 Marker APIs.md b/chapters/6-Performance-Analysis-Approaches/6-4 Marker APIs.md new file mode 100644 index 0000000000..331db5d26d --- /dev/null +++ b/chapters/6-Performance-Analysis-Approaches/6-4 Marker APIs.md @@ -0,0 +1,117 @@ +### Using Marker APIs + +In certain scenarios, we might be interested in analyzing performance of a specific code region, not an entire application. This can be a situation when you're developing a new piece of code and want to focus just on that code. Naturally, you would like to track optimization progress and capture additional performance data that will help you along the way. Most performance analysis tools provide specific *marker APIs* that let you do that. Here are a few examples: + +* Likwid has `LIKWID_MARKER_START / LIKWID_MARKER_STOP` macros. +* Intel VTune has `__itt_task_begin / __itt_task_end` functions. +* AMD uProf has `amdProfileResume / amdProfilePause` functions. + +Such a hybrid approach combines benefits of instrumentation and performance events couting. Instead of measuring the whole program, marker APIs allow us to attribute performance statistics to code regions (loops, functions) or functional piecies (remote procedure calls (RPCs), input events, etc.). The quality of the data you get back can easily justify the effort. While chasing performance bug that happens only with a specific type of RPCs, you can enable monitoring just for that type of RPC. + +Below we provide a very basic example of using [libpfm4](https://sourceforge.net/p/perfmon2/libpfm4/ci/master/tree/)[^1], one of the popular Linux libraries for collecting performance monitoring events. It is built on top of the Linux `perf_events` subsystem, which lets you access performance event counters directly. The `perf_events` subsystem is rather low-level, so the `libfm4` package is useful here, as it adds both a discovery tool for identifying available events on your CPU, and a wrapper library around the raw `perf_event_open` system call. [@lst:LibpfmMarkerAPI] shows how one can use `libpfm4` to instrument the `render` function of the [C-Ray](https://openbenchmarking.org/test/pts/c-ray)[^2] benchmark. + +Listing: Using libpfm4 marker API on the C-Ray benchmark + +~~~~ {#lst:LibpfmMarkerAPI .cpp} ++#include ++#include +... +/* render a frame of xsz/ysz dimensions into the provided framebuffer */ +void render(int xsz, int ysz, uint32_t *fb, int samples) { + ... ++ pfm_initialize(); ++ struct perf_event_attr perf_attr; ++ memset(&perf_attr, 0, sizeof(perf_attr)); ++ perf_attr.size = sizeof(struct perf_event_attr); ++ perf_attr.read_format = PERF_FORMAT_TOTAL_TIME_ENABLED | ++ PERF_FORMAT_TOTAL_TIME_RUNNING | PERF_FORMAT_GROUP; ++ ++ pfm_perf_encode_arg_t arg; ++ memset(&arg, 0, sizeof(pfm_perf_encode_arg_t)); ++ arg.size = sizeof(pfm_perf_encode_arg_t); ++ arg.attr = &perf_attr; ++ ++ pfm_get_os_event_encoding("instructions", PFM_PLM3, PFM_OS_PERF_EVENT_EXT, &arg); ++ int leader_fd = perf_event_open(&perf_attr, 0, -1, -1, 0); ++ pfm_get_os_event_encoding("cycles", PFM_PLM3, PFM_OS_PERF_EVENT_EXT, &arg); ++ int event_fd = perf_event_open(&perf_attr, 0, -1, leader_fd, 0); ++ pfm_get_os_event_encoding("branches", PFM_PLM3, PFM_OS_PERF_EVENT_EXT, &arg); ++ event_fd = perf_event_open(&perf_attr, 0, -1, leader_fd, 0); ++ pfm_get_os_event_encoding("branch-misses", PFM_PLM3, PFM_OS_PERF_EVENT_EXT, &arg); ++ event_fd = perf_event_open(&perf_attr, 0, -1, leader_fd, 0); ++ ++ struct read_format { uint64_t nr, time_enabled, time_running, values[4]; }; ++ struct read_format before, after; + + for(j=0; j