Skip to content

Commit

Permalink
More edits on Marker API section
Browse files Browse the repository at this point in the history
  • Loading branch information
dbakhval authored and dbakhval committed May 15, 2023
1 parent 66d23f1 commit e7990b8
Showing 1 changed file with 7 additions and 22 deletions.
29 changes: 7 additions & 22 deletions chapters/6-Performance-Analysis-Approaches/6-4 Marker APIs.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,38 +92,23 @@ Remember, that the instrumentation that we added measures the per-pixel ray trac
The C-ray benchmark primarily stresses the floating-point performance of a CPU core, which generally should not cause high variance in the measuments, in other words, we expect all the measurements to be very close to each other. However, we see that it's not the case, as p90 values are 1.33x average numbers and max is sometimes 5x slower than the average case. The most likely explanation here is that for some pixels the algorithm hits a corner case, executes more instructions and subsequently runs longer. But it's always good to confirm the hypothesis by studying the source code or extending the instrumentation to capture more data for the "slow" pixels.
Managing the overhead of your instrumentation is critical, especially if you choose to enable it in production environment. The instumentation code has three logical parts: collecting the information, storing it, and reporting it. Overhead is usefully calculated as rates per unit of time or work (RPC, loop iteration, etc.). Each collection should have a fixed cost (e.g., a syscall, but not a list traversal) and its overhead is that cost times the rate. For example, if a system call on our system is roughly 1.6 microseconds of CPU time, and we do it twice for each pixel (iteration of the outer loop), our overhead is 3.2 microseconds of CPU per pixel.
The additional instrumentation code showed in [@lst:LibpfmMarkerAPI] causes 17% overhead, which is OK for local experiments, but quite high to run in production. Most large distributed systems aim for less than 1% overhead, and for some up to 5% can be tolerable, but it's unlikely that users would be happy with 17% slowdown. Managing the overhead of your instrumentation is critical, especially if you choose to enable it in production environment.
The additional code showed in [@lst:LibpfmMarkerAPI] causes 17% overhead, which is OK for local experiments, but quite high to run in production. Most large distributed systems aim for less than 1% overhead, and for some up to 5% can be tolerable, but it's unlikely that users would be happy with 17% slowdown. To bring the overhead down, we could capture counters only once instead twice inside the loop, which will lower the overhead two times but make it a little less accurate. This inaccuracy can be compensated by a fixed cost of the code that goes after the instumentation inside the outer loop.
Overhead is usefully calculated as occurence rate per unit of time or work (RPC, database query, loop iteration, etc.). If a system call on our system is roughly 1.6 microseconds of CPU time, and we do it twice for each pixel (iteration of the outer loop), the overhead is 3.2 microseconds of CPU time per pixel.
With `perf_counters` you get fantastic precision and control over what you measure. It's tempting to do too much. Treat the tool like a scalpel - a small number of key uses are best. The instrumentation overhead for `perf_counter` is the cost of a system call -- easily a 1.5 microseconds each. The highest-overhead scenario involves instrumenting almost everything in the code, saving it all to disk, and then calculating some basic statisics that you could have estimated with .01% of what you collected. You should only collect, processas, and retain only much as you need to understand the performance of the system. How much understanding you want will drive that.
There are many strategies to bring the overhead down. As a general rule, your instrumentation should always have a fixed cost, e.g., a deterministic syscall, but not a list traversal or dynamic memory allocation, otherwise it will interfere with the measurements. The instumentation code has three logical parts: collecting the information, storing it, and reporting it. To lower the overhead of the first part (collection), we can decrease the sampling rate, e.g. sample each 10th RPC and skip the rest. For a long-running application, performance can be monitored with a relatively cheap random sampling - randomly select which events to observe. These methods sacrifice collection accuracy but still provide a good estimate of the overall performance characteristics while incurring a very low overhead.
If you need to save data for later use, you will need a low-everhead I/O solution. A custom `struct` holding the values you care about, perhaps with an `enum` identifying the sample site, will stay efficient and can be interpreted by the `python` `struct` library. You can keep an buffer of these structs in memory, and fill them in as you get instrumentation data. Then `fwrite(3)` them to disk when the app is less busy, or in another thread. Alternatively, you can use `lseek(2)` past EOF to create a "hole" in your output file, `mmap(2)` it to use directly as your buffer, and fill in as you go.
For the second and third parts (storing and aggregating), the recommendation is to only collect, processes, and retain only much data as you need to understand the performance of the system. You can avoid storing every sample in memory by using "online" algorithms for calculating mean, variance, min, max and other metrics. This will drastically reduce the memory footprint of the instrumentation. For instance, variance and standard deviation can be calculated using Knuth's online-variance algorithm. A good implementation[^3] uses less than 50 bytes of memory.
If you want to know your efficiency per event type (e.g., different RPCs, input message types, etc), you can keep a running mean/variance per type. Collect the counters at the beginning and end of the event's processing and take the differences. Add those differences to per-event sums. You can calculate variance and standard deviation using Knuth's online-variance algorithm. A good implementation[^3] uses only a few doubles. You can look at event distribution, means, and variances to prioritize optimization work. You can optimize the event handlers for frequent events or the ones responsible for most of your overall variance. For long routines, you can collect counters at the beginning, end, and some parts in the middle. Over consequtive runs, you can binary search for the part of the routine that performs poorest and optimize it. Repeat this until all the poorly-performing spots are removed. Remember to remove counter-collection code as you stop needing it.
Use statistical methods to reduce your data collection needs. For a long-running application, random sampling provides the overall performance characteristics while incurring a very low overhead. A quick `if (selected_for_sampling) {}` branch around each `read(2)` costs almost no overhead due to branch prediction. Branch mispredicts here will get still be small compared to the cost of thre `read(2)`. Alternatively, you can collect a lot but only report useful information: if tail latency is of a primary concern, emitting log messages on a particularly slow run can provide useful insights.
In the [@lst:LibpfmMarkerAPI], we collected 4 events simultaneously, though the CPU has 6 programmable counters. You can open up additional groups with different sets of counters enabled. The kernel will select different groups to run at a time. The `time_enabled` and `time_running` fields indicate the multiplexing. They are both durations in nanoseconds. `time_enabled` indicates how many nanos the counter group has been enabled. `time_running` indicates how much of that enabled time the counters were actually collecting. If you had two counter groups enabled simultaneously that couldn't fit together on the counters, you might see them both converge to `time_running == 0.5 * time_enabled`. Scheduling in general is complicated so verify before depending on your exact scenario.
For long routines, you can collect counters at the beginning, end, and some parts in the middle. Over consequtive runs, you can binary search for the part of the routine that performs poorest and optimize it. Repeat this until all the poorly-performing spots are removed. If tail latency is of a primary concern, emitting log messages on a particularly slow run can provide useful insights.
In the [@lst:LibpfmMarkerAPI], we collected 4 events simultaneously, though the CPU has 6 programmable counters. You can open up additional groups with different sets of events enabled. The kernel will select different groups to run at a time. The `time_enabled` and `time_running` fields indicate the multiplexing. They are both durations in nanoseconds. The `time_enabled` field indicates how many nanoseconds the event group has been enabled. The `time_running` indicates how much of that enabled time the events were actually collecting. If you had two event groups enabled simultaneously that couldn't fit together on the HW counters, you might see them both converge to `time_running = 0.5 * time_enabled`. Scheduling in general is complicated so verify before depending on your exact scenario.
Capturing multiple events simultaneously allows to calculate various metrics that we discussed in Chapter 4. For example, capturing `INSTRUCTIONS_RETIRED` and `UNHALTED_CLOCK_CYCLES` enables us to measure IPC. We can observe the effects of frequency scaling by comparing CPU cycles (`UNHALTED_CORE_CYCLES`) vs the fixed-frequency reference clock (`UNHALTED_REFERENCE_CYCLES`). It is possible to detect when the thread wasn't running by requesting CPU cycles consumed (`UNHALTED_CORE_CYCLES`, only counts when the thread is running) and comparing against wall-clock. Also, we can normalize the numbers to get the event rate per second/clock/instruction. For instance, measuring `MEM_LOAD_RETIRED.L3_MISS` and `INSTRUCTIONS_RETIRED` we can get the `L3MPKI` metric. As you can see, the setup is very flexible.
The important property of grouping events is that the counters will be available atomically under the same `read` system call. These atomic bundles are very useful. First, it allows us to correlate events within each group. Say we measure IPC for a region of code, and found that it is very low. In this case, we can pair two events (instructions and cycles) with a third one, say L3 cache misses, to check if it contributes to a low IPC that we're dealing with. If it doesn't, we continue factor analysis using other events. Second, event grouping helps to mitigate bias in case a workload has different phases. Since all the events within a group are measured at the same time, they always capture the same phase.
In some scenarios, instrumentation may become a part of a functionality or a feature. For example, a developer can implement an instrumentation logic that detects decrease in IPC (e.g. when there is a busy sibling HW thread running) or decreasing CPU frequency (e.g. system throttling due to heavy load). When such event occurs, application automatically defers low-priority work to compensate for the hopefully temporarily increased load.
**Interval Estimation**
<deleted; summarized & moved up>
**Grouped Counter techniques**
<moved up>
**Getting the data out of process**
<deleted>
**Aggregate Statistics Solutions**
<deleted>
In some scenarios, instrumentation may become a part of a functionality or a feature. For example, a developer can implement an instrumentation logic that detects decrease in IPC (e.g. when there is a busy sibling HW thread running) or decreasing CPU frequency (e.g. system throttling due to heavy load). When such event occurs, application automatically defers low-priority work to compensate for the temporarily increased load.
[^1]: libpfm4 - [https://sourceforge.net/p/perfmon2/libpfm4/ci/master/tree/](https://sourceforge.net/p/perfmon2/libpfm4/ci/master/tree/)
Expand Down

0 comments on commit e7990b8

Please sign in to comment.