From 4e94d63152ff6740c28015a6dba89100e9cb0fac Mon Sep 17 00:00:00 2001 From: dbakhval Date: Mon, 15 May 2023 10:13:56 -0400 Subject: [PATCH] Fixed last comments from Lally --- chapters/6-Performance-Analysis-Approaches/6-4 Marker APIs.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/6-Performance-Analysis-Approaches/6-4 Marker APIs.md b/chapters/6-Performance-Analysis-Approaches/6-4 Marker APIs.md index b8e856c6f8..331db5d26d 100644 --- a/chapters/6-Performance-Analysis-Approaches/6-4 Marker APIs.md +++ b/chapters/6-Performance-Analysis-Approaches/6-4 Marker APIs.md @@ -74,7 +74,7 @@ void render(int xsz, int ysz, uint32_t *fb, int samples) { } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -In this code example, we first initialize `libpfm` library and configure performance events and the format that we will use to read them. Then, we choose the code region we want to analyze, in our case it is a loop with a `trace` function call inside. We surround this code region with two `read` system calls that will capture values of performance counters before and after the loop. Next, we save the deltas for later processing, in this case, we aggregated (code is not shown) it by calculating average, 90th percentile and maximum values. Running it on an Intel Alderlake-based machine, we've got the output shown below. Root priviledges are not required, but `/proc/sys/kernel/perf_event_paranoid` should be set to less than 1. When reading counters from inside a thread, the values are for that thread alone. It can optionally include kernel code that ran and was attributed to the thread. +In this code example, we first initialize `libpfm` library and configure performance events and the format that we will use to read them. In the C-Ray benchmark, the `render` function is only called once. In your own code, be carefull about not doing `libpfm` initialization multiple times. Then, we choose the code region we want to analyze, in our case it is a loop with a `trace` function call inside. We surround this code region with two `read` system calls that will capture values of performance counters before and after the loop. Next, we save the deltas for later processing, in this case, we aggregated (code is not shown) it by calculating average, 90th percentile and maximum values. Running it on an Intel Alderlake-based machine, we've got the output shown below. Root priviledges are not required, but `/proc/sys/kernel/perf_event_paranoid` should be set to less than 1. When reading counters from inside a thread, the values are for that thread alone. It can optionally include kernel code that ran and was attributed to the thread. ```bash $ ./c-ray-f -s 1024x768 -r 2 -i sphfract -o output.ppm @@ -104,7 +104,7 @@ For long routines, you can collect counters at the beginning, end, and some part In the [@lst:LibpfmMarkerAPI], we collected 4 events simultaneously, though the CPU has 6 programmable counters. You can open up additional groups with different sets of events enabled. The kernel will select different groups to run at a time. The `time_enabled` and `time_running` fields indicate the multiplexing. They are both durations in nanoseconds. The `time_enabled` field indicates how many nanoseconds the event group has been enabled. The `time_running` indicates how much of that enabled time the events were actually collecting. If you had two event groups enabled simultaneously that couldn't fit together on the HW counters, you might see them both converge to `time_running = 0.5 * time_enabled`. Scheduling in general is complicated so verify before depending on your exact scenario. -Capturing multiple events simultaneously allows to calculate various metrics that we discussed in Chapter 4. For example, capturing `INSTRUCTIONS_RETIRED` and `UNHALTED_CLOCK_CYCLES` enables us to measure IPC. We can observe the effects of frequency scaling by comparing CPU cycles (`UNHALTED_CORE_CYCLES`) vs the fixed-frequency reference clock (`UNHALTED_REFERENCE_CYCLES`). It is possible to detect when the thread wasn't running by requesting CPU cycles consumed (`UNHALTED_CORE_CYCLES`, only counts when the thread is running) and comparing against wall-clock. Also, we can normalize the numbers to get the event rate per second/clock/instruction. For instance, measuring `MEM_LOAD_RETIRED.L3_MISS` and `INSTRUCTIONS_RETIRED` we can get the `L3MPKI` metric. As you can see, the setup is very flexible. +Capturing multiple events simultaneously allows to calculate various metrics that we discussed in Chapter 4. For example, capturing `INSTRUCTIONS_RETIRED` and `UNHALTED_CLOCK_CYCLES` enables us to measure IPC. We can observe the effects of frequency scaling by comparing CPU cycles (`UNHALTED_CORE_CYCLES`) vs the fixed-frequency reference clock (`UNHALTED_REFERENCE_CYCLES`). It is possible to detect when the thread wasn't running by requesting CPU cycles consumed (`UNHALTED_CORE_CYCLES`, only counts when the thread is running) and comparing against wall-clock time. Also, we can normalize the numbers to get the event rate per second/clock/instruction. For instance, measuring `MEM_LOAD_RETIRED.L3_MISS` and `INSTRUCTIONS_RETIRED` we can get the `L3MPKI` metric. As you can see, the setup is very flexible. The important property of grouping events is that the counters will be available atomically under the same `read` system call. These atomic bundles are very useful. First, it allows us to correlate events within each group. Say we measure IPC for a region of code, and found that it is very low. In this case, we can pair two events (instructions and cycles) with a third one, say L3 cache misses, to check if it contributes to a low IPC that we're dealing with. If it doesn't, we continue factor analysis using other events. Second, event grouping helps to mitigate bias in case a workload has different phases. Since all the events within a group are measured at the same time, they always capture the same phase.