Skip to content

Latest commit

 

History

History
415 lines (352 loc) · 18.6 KB

readme.org

File metadata and controls

415 lines (352 loc) · 18.6 KB

ppt Production Performance Telemetry

AKA the Portable Performance Tool. But really, the performance tool of last resort.

Too cool to find on Google! We live in the shadows of powerpoint!

Purpose

ppt is a performance tracing tool. You define different types of individual frames and ppt will generate code to call and link into your executable. Then ppt will let you attach to your running executable to collect these traces.

This is quite useful when you want to know why your code performs a certain way in real-world scenarios.

A frame is essentially a C struct that you fill out and save to a circular buffer in shared memory, called a buffer. You can add any members you like to a frame to record relevant details about the event. ppt also has built-in support for using timers and CPU performance counters.

ppt prioritizes low overhead first.

When ppt isn’t attached to a running process, there is no buffer and saving a frame is a no-op.

How do I build it?

You need stack, the haskell build tool. https://docs.haskellstack.org/en/stable/README/. Only tested on Linux. I’m quite sure it won’t work on MacOS X without serious changes (e.g., the ELF dependency).

With Nix

With nix, it’s just stack install. The build output will indicate the directory (~/.local/bin) where the binaries are installed.

Without Nix

nix installs some dependencies, but not many. You just need to install two libraries:

LibraryUbuntu Names
zlibzlib1g, zlib1g-dev
libpfm4libpfm4 libpfm4-dev

Once installed just run stack install --no-nix. The build output will indicate the directory (~/.local/bin) where the binaries are installed. You can ignore elf, only ppt matters.

Why another performance tool?

Why not dtrace or gprof?

Because they solve different problems. In a dreaded car analogy, gprof is a dynomometer check, dtrace is a PMU logger, and ppt is custom telemetry. All useful, all separate.

dtrace is an excellent tool. It instruments unaltered binaries using several built-in “providers”, and you can alter the program to give dtrace even more data. But it has two drawbacks:

  • Attachment time: the more instrumentation you want on the execution, the longer it takes to attach to the process. This can leave the process unresponsive for longer than you may like.
  • Statistics and histograms only: no access to raw data. Instead, you can get histograms of collected data (or simple statistics). But the underlying raw data is captured only (AFAIK) with a relatively expensive print statement.

gprof is a good tool. It provides a simple way to get a rough idea of where your CPU time is taken. But it sacrifices a few things to get there:

  • It chooses where the overhead goes – gprof instruments on a function level, which biases towards overreporting small function overhead (because as a percentage, gprof-induced overhead is higher).
  • You don’t get as much detail. gprof gives you basic statistics on a program execution, but much detail can be hidden behind those statistics. For example, the program performance could be dominated by some infrequent - but very expensive - situation.

Why ppt? Get everything you want, how you want it.

In contrast, ppt’s data gives you the whole distribution (instead of statistics) for your data, and you choose what and when to instrument in the program’s execution.

The transfer protocol for sending a frame from the process under observation to ppt for recording is a pair of memory barriers, a copy of the data, and some additional writes to the sequence number. There’s no flow-control or other synchronization between ppt and the instrumented app. ppt can’t slow down your app, even if you ^Z ppt.

Use ppt for regression analysis.

The best use-case for ppt is for regression analysis of state or inputs to performance values. Embed input values, loop counts, and whatever else you want (add more than you think you’ll need, the instrumentation’s cheap) and measure what matters. The resulting records will directly contain cause and effect.

Can it lose data?

Yes. This is a flow-control policy. ppt does not let the writer (the process being instrumented) wait for the reader (ppt) to catch up. Instead, some data will be overwritten in the shared buffer before it’s read by ppt and saved to disk.

Each frame in a ppt buffer has a sequence number, that monotonically increases. If ppt doesn’t capture some frames in time, the captured frames will have jumps in their sequence numbers.

How do I use it?

Roughly:

  1. Describe the frames you want to collect in a .spec file.
  2. Generate source for those frames,
  3. Add instrumentation to your program to fill in frames and save them to the buffer.
  4. Attach to the running program to collect your data.
  5. Convert the collected data to CSV for analysis.

It’s a lot of steps, but it has a few advantages:

  • During your program’s run, PPT isn’t doing anything more than saving a shared memory buffer to disk periodically.
  • You get lots of control in your instrumentation.
  • You get raw data at the end, for deeper analysis.

Describing the Frames

emit C++;

buffer Minimal 512;

option time timespec realtime;

frame first {
   int a, b, c;
   interval time duration;
   interval counter events;
}

frame second {
   interval counter foos;
}

See the Buffer Syntax Reference for details.

Performance Counters

Notice above that you can use the counter type. Its recorded value is a uint64_t, but the actual counter used is unspecified. When attaching, use the -c flag to specify which actual counters to use.

The counter names are as-specified by libpfm4. You can use the showevtinfo command from that distribution to list the (many) counters available on your machine. Some highlights:

CounterDescription
INSTRUCTION_RETIREDcount the number of instructions at retirement. For instructions that consists of multiple micro-ops, this event counts the retirement of the last micro-op of the instruction
LLC_MISSEScount each cache miss condition for references to the last level cache. The event count may include speculation, but excludes cache line fills due to hardware prefetch
DTLB_LOAD_MISSESDTLB Load misses. See all modifiers in showevtinfo for details.
ITLB_MISSESInstruction TLB misses.
L1-DCACHE-LOAD-MISSESL1 data cache load misses.
L1-ICACHE-LOAD-MISSESL1 instruction cache misses

Many, many more are available, but showevtinfo does the job of explaining what’s counters you can use much better than we can.

ppt allocates enough space for 3 counters’ worth of data in each counter member of a frame. Twice that for interval counter. You specify which counters you want on the command line in ppt attach. You can attach to the same process several times (sequentially) with different counters to get more than 3.

The current implementation will use a system call to read the counters in the generated code. Please beware of the performance impact of measurement.

Generating Source for Frames

$ ppt generate ./minimal.spec

Will generate source with this public API:

namespace ppt { namespace Minimal {
class first {
public:
    struct timespec duration_start;
    struct timespec duration_end;
    uint64_t events_0_start= 0;
    uint64_t events_0_end= 0;
    uint64_t events_1_start= 0;
    uint64_t events_1_end= 0;
    uint64_t events_2_start= 0;
    uint64_t events_2_end= 0;
    int a= 0;
    int b= 0;
    int c= 0;

    void save();
    void snapshot_duration_start();
    void snapshot_duration_end();
    void snapshot_events_start();
    void snapshot_events_end();
};
class second {
public:
    uint64_t foos_0_start= 0;
    uint64_t foos_0_end= 0;
    uint64_t foos_1_start= 0;
    uint64_t foos_1_end= 0;
    uint64_t foos_2_start= 0;
    uint64_t foos_2_end= 0;

    void save();
    void snapshot_foos_start();
    void snapshot_foos_end();
};
}} // namespace ppt::Minimal

There are additional members saved in each class that ppt uses internally. Check out the generated code for details.

Generally:

  • For simple scalar frame members, you should see a matching class member of the same name and type. Directly assign to it.
  • For time members, a method is provided to save the current time, called snapshot_MEMBER().
  • Same for counter members, only that they’re saving to 3 members at a time.
  • For interval types, ppt adds a _start and _end suffix to distinguish members for the start and end of the interval.
  • When it’s done filling in the frame, your code should call save() to make it available for an attached ppt process to capture.

Instrumenting your Program

minimal-client: ppt-Minimal.hh ppt-Minimal.cc minimal-client.cc
	g++ -o minimal-client minimal-client.cc ppt-Minimal.cc
#include "ppt-Minimal.hh"

int main() {
   int acount = 0;
   while (1) {
       ppt::Minimal::first record;
       // collect timestamp of when this starts.
       record.snapshot_duration_start();
       // snapshot performance counters
       record.snapshot_events_start();
       // Do anything you want here.  For example, saving relevant parts of
       // input, loop counts, etc.
       record.a = 0xaaaa0000 + acount++;
       record.b = acount - record.a;
       record.c = 0xcccccccc;
       // snapshot performance counters.
       record.snapshot_events_end();
       // snapshot timestamp.
       record.snapshot_duration_end();
       // save to buffer.
       record.save();
   }
   return 0;
}

Roughly: make an instance of the frame type you want, fill it in, and then save() it. You can reuse the instance across save() calls. The members are directly accessible.

For example, if you’re processing a batch of the events, you may not want to pay the overhead of repeating snapshot_duration_start() and snapshot_duration_end() calls. You can simply call one of them, and copy the member over to the other:

 ppt::Minimal::first record;
 bool first = true;
 while (1) {
    if (first) {
       record.snapshot_duration_start();
       first = false;
    } else {
       record.duration_start = record.duration_end;
    }
    // .. same innards as above.
    record.snapshot_duration_end();
    record.save();
}

You can do the same for the counters, but you may end up with a frame of distorted data if you detach and reattach (with different counters).

Attaching to your Program

$ ./minimal-client &
$ ppt attach -p $(pgrep minimal-client) -o output.bin

Converting Data for Analysis

$ ppt convert output.bin 
$ ls output.bin_output
minimal.csv

How do I use it effectively?

ppt has two phases of your program’s lifecycle where it becomes quite handy:

  1. During development, it provides good feedback on the implementation’s performance characteristics. This is very useful for:
    • Determining performance trade-offs.
    • Improving performance of the system
  2. During operations, it provides a good way to monitor the application.
    • Significantly faster to log than text
    • Easier to analyze/plot

    Note that more operational support is planned. Once I get around to learning (n)curses.

When optimizing my program

First, you’ll clearly have to figure out what you want to optimize. The latency in response to an event? The time through the main loop? Time to complete N items of work?

  • Sort that out and come back. We’ll wait.
  • Got it? Good. Here we go:

What to instrument

First figure out what you’re measuring:

  • The performance metric: the number you want to make better. This can be, for example: frame rate (go higher!), latency (go lower), or throughput (higher again!). You don’t have to make it time-based. If you want to measure how many I/Os you do instead, you can do that.
  • The unit of work. What’s a single measurement look like? For frame rates, this would be the time for a single frame. For latency, the a single time interval. For throughput, the time for a single item (or if batching, two numbers: number in batch and time taken for batch).

Now, here’s what you want to instrument:

  • The load on your program for this unit. The event you processed, some characteristic of how much data you processed in that iteration of your main loop, or the type/parameters of the item processed.
  • The effort expended on this unit. This is where you do most of your instrumentation. Things to record:
    • How many iterations of each loop you run
    • Which major branches (if conditions) you take
    • Key performance counters
      • Mostly we’re talking about cache misses
    • Synchronization overhead
      • How long you spent waiting for a mutex, for example.

Generally, you can start off with a rough breakdown of where you’re spending your effort, and drill down with more instrumentation once you see where the effort’s really going.

Setting up a benchmark

When optimizing a program, you can’t be sure that you’ve actually improved the performance without a benchmark for comparison. This doesn’t have to be hard, it can be setup like a unit test.

  • Take some input that’s characteristic of reality.
  • Run it through your code.
  • Collect results.

Run before/after each change you want to compare, and you can tell if you’re doing better. As to how many times you want to run it: generally run it a bunch of times to get clean data, then investigate why it varies.

Then repeatedly go back and change your instrumentation, and re-benchmark until you can predict the performance of slightly different code or input load. Now that you have a real mapping between your load, your implementation, and its performance, you can start to alter the code to perform more like how you want it.

Data Analysis

ppt decode emits CSV files, one per frame type. You can use Jupyter, R, or Excel to great success with that data. Yes, other output formats are good too, I just haven’t had a use case to write more. CSV just keeps working, no matter how it kinda sucks.

When monitoring a program

This intended use case for ppt doesn’t have the desired support in ppt yet. Generally, you can define other buffers for monitoring, and in the future, ppt should have a monitor mode that presents a live-decoded version of the data in that buffer.

The essential issue with this is that we need to present the monitor data in a way that scales up easily. This may just be ppt emitting JSON on stdout.

Limitations

  • ppt is only maintained for x86_64 Linux. Not very portable, I know. It used to be used mostly on Solaris, which doesn’t really count. If anyone asks why I still call it portable, I do almost all the development for it while on public transportation.
  • ppt attaches to a running process via ptrace(2), which means that you can’t be debugging the process at the same time. As ppt only uses ptrace(2) when attaching and detaching, you can start the process, have ppt attach to it, and then have gdb attach to it. It should work, but isn’t exactly convenient.
  • C++ code gen only. It’s what I use, so it’s what I developed this for. As long as the generated code follows the same format and protocol, and that we can emit the same symbols into the executable, other languages should be straightforward.
    • Back when this was used on Solaris, the generator was C based. It’s not hard to bring it back, if it was useful.
    • VM-based languages are probably not as straight forward.
    • python3 support is work-in-progress.
  • save() checks to see if a ppt process is attached or has just detached. In that case, it will take a few system calls to setup / tear down the internal buffering and performance counter measurement apparatus. See the generated code (it’s readable!) for details.
  • The data transfer between ppt and the instrumented process is lossy.
    • You can detect loss by watching sequence numbers in the frames.
    • You can increase the buffer size to reduce loss. But you incur more cache misses that way. OTOH, larger buffers means ppt doesn’t have to wake up as often to get data. If you have a lot of cache churn anyways, you may prefer to keep that CPU core idle more often.
      • Perhaps non-temporal stores could be used in the future to mitigate this.
    • This is the price to pay to prevent having the process-under-observation block when the reader falls behind.
  • Performance counters are limited to 3 at a time, and require a system call per use. There is experimental code to avoid the system call interface if the counters are architectural, and could be read with rdpmc directly.
    • By experimental, I mean that it does what I think it’s supposed to, but still seems to crash the process a lot. Suggestions, pull requests, etc., more than welcome.