Skip to content

Commit

Permalink
Finished TODOs.
Browse files Browse the repository at this point in the history
  • Loading branch information
Lally Singh authored and dbakhval committed May 8, 2023
1 parent fe79cd4 commit 594d682
Showing 1 changed file with 8 additions and 11 deletions.
19 changes: 8 additions & 11 deletions chapters/6-Performance-Analysis-Approaches/6-4 Marker APIs.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,23 +94,18 @@ The C-ray benchmark primarily stresses the floating-point performance of a CPU c
Managing the overhead of your instrumentation is critical, especially if you choose to enable it in production environment. The instumentation code has three logical parts: collecting the information, storing it, and reporting it. Overhead is usefully calculated as rates per unit of time or work (RPC, loop iteration, etc.). Each collection should have a fixed cost (e.g., a syscall, but not a list traversal) and its overhead is that cost times the rate. For example, if a system call on our system is roughly 1.6 microseconds of CPU time, and we do it twice for each pixel (iteration of the outer loop), our overhead is 3.2 microseconds of CPU per pixel.
TODO: @Lally: do you want to elaborate on three parts of the instrumentation (collect, store, report)?
TODO: @Lally: do you want to elaborate on the fix cost of the instrumentation? Any guidelines how to achieve that?
The additional code showed in [@lst:LibpfmMarkerAPI] causes 17% overhead, which is OK for local experiments, but quite high to run in production. Most large distributed systems aim for less than 1% overhead, and for some up to 5% can be tolerable, but it's unlikely that users would be happy with 17% slowdown. To bring the overhead down, we could capture counters only once instead twice inside the loop, which will lower the overhead two times but make it a little less accurate. This inaccuracy can be compensated by a fixed cost of the code that goes after the instumentation inside the outer loop.
To control overhead and relevancy you can choose when to enable data collection. For a long-running application, random sampling provides the overall performance characteristics while incurring a very low overhead. If tail latency is of a primary concern, the emitting log messages on a particularly slow run can provide useful insights. In general, developers are free to implement any kind of adaptive instrumentation logic to keep the overhead low while still capturing useful data.
With `perf_counters` you get fantastic precision and control over what you measure. It's tempting to do too much. Treat the tool like a scalpel - a small number of key uses are best. The instrumentation overhead for `perf_counter` is the cost of a system call -- easily a 1.5 microseconds each. The highest-overhead scenario involves instrumenting almost everything in the code, saving it all to disk, and then calculating some basic statisics that you could have estimated with .01% of what you collected. You should only collect, processas, and retain only much as you need to understand the performance of the system. How much understanding you want will drive that.
TODO: @Lally: do you want to elaborate on the storage overhead? We can say about online algorithms for calculating mean and stddev, which require no additional memory consumption.
If you need to save data for later use, you will need a low-everhead I/O solution. A custom `struct` holding the values you care about, perhaps with an `enum` identifying the sample site, will stay efficient and can be interpreted by the `python` `struct` library. You can keep an buffer of these structs in memory, and fill them in as you get instrumentation data. Then `fwrite(3)` them to disk when the app is less busy, or in another thread. Alternatively, you can use `lseek(2)` past EOF to create a "hole" in your output file, `mmap(2)` it to use directly as your buffer, and fill in as you go.
Storage overhead is either linear or logarithmic to the collection rate times the retentia
If you want to know your efficiency per event type (e.g., different RPCs, input message types, etc), you can keep a running mean/variance per type. Collect the counters at the beginning and end of the event's processing and take the differences. Add those differences to per-event sums. You can calculate variance and standard deviation using Knuth's online-variance algorithm. A good implementation[^3] uses only a few doubles. You can look at event distribution, means, and variances to prioritize optimization work. You can optimize the event handlers for frequent events or the ones responsible for most of your overall variance. For long routines, you can collect counters at the beginning, end, and some parts in the middle. Over consequtive runs, you can binary search for the part of the routine that performs poorest and optimize it. Repeat this until all the poorly-performing spots are removed. Remember to remove counter-collection code as you stop needing it.
While capturing data you have to be careful not to flood the I/O system with too much performance data. In such a scenario, the application performance will sink and the instrumentation will interfere too much with the runtime.
Use statistical methods to reduce your data collection needs. For a long-running application, random sampling provides the overall performance characteristics while incurring a very low overhead. A quick `if (selected_for_sampling) {}` branch around each `read(2)` costs almost no overhead due to branch prediction. Branch mispredicts here will get still be small compared to the cost of thre `read(2)`. Alternatively, you can collect a lot but only report useful information: if tail latency is of a primary concern, emitting log messages on a particularly slow run can provide useful insights.
In the [@lst:LibpfmMarkerAPI], we collected 4 events simultaneously, though the CPU has 6 programmable counters...
In the [@lst:LibpfmMarkerAPI], we collected 4 events simultaneously, though the CPU has 6 programmable counters. You can open up additional groups with different sets of counters enabled. The kernel will select different groups to run at a time. The `time_enabled` and `time_running` fields indicate the multiplexing. They are both durations in nanoseconds. `time_enabled` indicates how many nanos the counter group has been enabled. `time_running` indicates how much of that enabled time the counters were actually collecting. If you had two counter groups enabled simultaneously that couldn't fit together on the counters, you might see them both converge to `time_running == 0.5 * time_enabled`. Scheduling in general is complicated so verify before depending on your exact scenario.
TODO: @Lally describe multiplexing in case we specify more events than physical counters. If I specify more events than physical PMCs, will libpfm start automatically multiplexing? How do you then scale the counters?
Capturing multiple events simultaneously allows to calculate various metrics that we discussed in Chapter 4. For example, capturing `INSTRUCTIONS_RETIRED` and `UNHALTED_CLOCK_CYCLES` enables us to measure IPC. We can observe the effects of frequency scaling by comparing CPU cycles (`UNHALTED_CORE_CYCLES`) vs the fixed-frequency reference clock (`UNHALTED_REFERENCE_CYCLES`). It is possible to detect when the thread wasn't running by requesting CPU cycles consumed (`UNHALTED_CORE_CYCLES`, only counts when the thread is running) and comparing against wall-clock. Also, we can normalize the numbers to get the event rate per second/clock/instruction. For instance, measuring `MEM_LOAD_RETIRED.L3_MISS` and `INSTRUCTIONS_RETIRED` we can get the `L3MPKI` metric. As you can see, the setup is very flexible.
Expand All @@ -132,4 +127,6 @@ In some scenarios, instrumentation may become a part of a functionality or a fea
[^1]: libpfm4 - [https://sourceforge.net/p/perfmon2/libpfm4/ci/master/tree/](https://sourceforge.net/p/perfmon2/libpfm4/ci/master/tree/)
[^2]: C-Ray benchmark - [https://openbenchmarking.org/test/pts/c-ray](https://openbenchmarking.org/test/pts/c-ray)
[^2]: C-Ray benchmark - [https://openbenchmarking.org/test/pts/c-ray](https://openbenchmarking.org/test/pts/c-ray)
[^3]: Accurately computing running variance - [https://www.johndcook.com/blog/standard_deviation/](https://www.johndcook.com/blog/standard_deviation/)

0 comments on commit 594d682

Please sign in to comment.