improve memory trace performance #2001

derekbruening · 2016-09-07T16:45:13Z

This is something of a broad issue covering analyzing and improving the performance of the following:

drcachesim's traces, both for online and offline trace analysis
the memtrace and instrace samples

Xref #1929: memtrace slowness due to unbuffered printf
Xref #790: online trace compression

I wanted a place to dump some of my notes on this. The #790 notes are somewhat duplicated in that issue:

memtrace_binary sample perf: 70x (SSD) to 180x (HDD); 4x-25x (ave 18x) w/o disk; w/ no PC 36x (SSD)

mcf test:

% cd /extsw/spec2006/mcf-test
% /usr/bin/time ./mcf_base.gcc-64bit inp.in
1.63user 0.03system 0:01.67elapsed 99%CPU (0avgtext+0avgdata 159668maxresident)k
% /usr/bin/time /work/dr/git/exports/bin64/drrun -c /work/dr/git/exports/samples/bin64/libmemtrace_x86_binary.so -- ./mcf_base.gcc-64bit inp.in
18.41user 10.45system 5:05.96elapsed 9%CPU (0avgtext+0avgdata 164712maxresident)k

=>
clearly i/o bound: 9% CPU. produces a 41GB file with 1.3 billion memrefs.
slowdown: 183x

More like 70x on laptop, and higher %CPU:

% /usr/bin/time  ./mcf_base.clang-64bit inp.in
2.00user 0.09system 0:02.09elapsed 99%CPU (0avgtext+0avgdata 159648maxresident)k
% /usr/bin/time ~/dr/git/build_x64_rel/bin64/drrun -c ~/dr/git/build_x64_rel/api/bin/libmemtrace_x86_binary.so -- ./mcf_base.clang-64bit inp.in
15.18user 21.44system 2:23.12elapsed 25%CPU (0avgtext+0avgdata 165128maxresident)k

Because it's got an SSD? Or also b/c CPU is slower (so higher CPU-to-disk
ratio; also slower native)?

Disabling dr_write_file:

% /usr/bin/time /work/dr/git/exports/bin64/drrun -c /work/dr/git/exports/samples/bin64/libmemtrace_x86_binary.so -- ./mcf_base.gcc-64bit inp.in
9.28user 0.04system 0:09.41elapsed 99%CPU (0avgtext+0avgdata 164908maxresident)k

=> 5.6x
That's PC, read or write, size, and address.
It should be easy to improve by 2x by removing read/write and size
(statically recoverable) and only including PC once per bb or even less.

But it's much worse on other spec. Taking too long to do a ref run of
everything but bmarks at the point I killed the run, 9 hours in:

% spec2k6cmp result/CINT2006.045.ref.txt result/res-native-allc-int
400.perlbench         0.00 (     0 /    207)
401.bzip2            16.28 (  5029 /    309)
403.gcc              22.69 (  4447 /    196)
429.mcf               4.27 (   994 /    233)
445.gobmk            18.95 (  5761 /    304)
456.hmmer            25.61 (  7095 /    277)
458.sjeng            18.37 (  5788 /    315)
462.libquantum        0.00 (     0 /    299)
464.h264ref           0.00 (     0 /    327)
471.omnetpp           0.00 (     0 /    271)
473.astar             0.00 (     0 /    267)
483.xalancbmk         0.00 (     0 /    150)
average              17.70

Qin: "if memtrace is 100x, if you can make the profile 1/5 the size, can hit 20x"

Can shrink some fields, but not to 1/5. Online gzip compression should easily give 1/5.
Simple test: I see >20x gzip compression (though w/ naive starting format):

% /work/dr/git/exports/bin64/drrun -c /work/dr/git/build_x64_dbg/api/bin/libmemtrace_x86_binary.so -- ls
% ls -sh /work/dr/git/build_x64_dbg/api/bin/*.log
12M /work/dr/git/build_x64_dbg/api/bin/memtrace.ls.15577.0000.log
% gzip /work/dr/git/build_x64_dbg/api/bin/*.log
% ls -sh /work/dr/git/build_x64_dbg/api/bin/*.log*
528K /work/dr/git/build_x64_dbg/api/bin/memtrace.ls.15577.0000.log.gz

Removing the PC field:

% /usr/bin/time /work/dr/git/exports/bin64/drrun -c /work/dr/git/exports/samples/bin64/libmemtrace_x86_binary.so -- ./mcf_base.gcc-64bit inp.in
13.26user 11.07system 3:31.28elapsed 11%CPU (0avgtext+0avgdata 164684maxresident)k

=>
still i/o bound: 11% CPU. produces a 31GB file.
slowdown: 126x

On laptop:

% /usr/bin/time ~/dr/git/build_x64_rel/bin64/drrun -c ~/dr/git/build_x64_rel/api/bin/libmemtrace_x86_binary.so -- ./mcf_base.clang-64bit inp.in
13.14user 15.50system 1:15.94elapsed 37%CPU (0avgtext+0avgdata 164736maxresident)k

Up to 37%CPU, and 36x slowdown.

drcachesim tracer performance => 2x slower b/c of icache entries

Switching from mcf test to bzip2 test b/c it's a little closer to the 18x
average performance for the memtrace sample not writing to disk and so
is more representative:

native:

% /usr/bin/time ./bzip2_base.gcc-64bit dryer.jpg 2
1.95user 0.00system 0:01.96elapsed 100%CPU (0avgtext+0avgdata 17176maxresident)k

No disk writes at all:

% /usr/bin/time /work/dr/git/exports/bin64/drrun -c /work/dr/git/exports/samples/bin64/libmemtrace_x86_binary.so -- ./bzip2_base.gcc-64bit dryer.jpg 2
30.23user 0.01system 0:30.49elapsed 99%CPU (0avgtext+0avgdata 24300maxresident)k

That's 15.6x.

% /usr/bin/time /work/dr/git/build_x64_rel/bin64/drrun -c /work/dr/git/build_x64_rel/clients/lib64/release/libdrmemtrace.so -- ./bzip2_base.gcc-64bit dryer.jpg 2
60.32user 0.01system 1:00.34elapsed 99%CPU (0avgtext+0avgdata 25564maxresident)k

30.8x! 2x vs memtrace, b/c it's including icache info, presumably.

Currently trace_entry_t is 4+8 => 16 bytes b/c of alignment (we didn't pack
it: b/c we only care about 32-bit?).

Packing trace_entry_t w/o any other changes to the struct:

55.70user 0.01system 0:55.72elapsed 99%CPU (0avgtext+0avgdata 25560maxresident)k

Also compressing size+type from 4 bytes into 2 bytes:
(Might need extra escape entry for memsz > 256)

54.11user 0.00system 0:54.12elapsed 100%CPU (0avgtext+0avgdata 25264maxresident)k

Also shrinking pc/addr field to 4 bytes:

53.57user 0.01system 0:53.59elapsed 99%CPU (0avgtext+0avgdata 25060maxresident)k

Also removing INSTR_BUNDLE (always has preceding abs pc so redundant):

48.63user 0.01system 0:48.99elapsed 99%CPU (0avgtext+0avgdata 24952maxresident)k

10.7x = Also removing all instr entries (thus there's no PC at all):

20.88user 0.01system 0:21.05elapsed 99%CPU (0avgtext+0avgdata 24896maxresident)k

Having the instr bundles and all the instr boundary info coming from the
tracer seems worth it for online simulation, where having the simulator go
dig it up from disassembly of a giant binary is going to be slower than the
tracer providing it. But for offline, it does seem
like we want to really optimize the tracing -- thus we need a split tracer!

14.3x = Adding back one instr entry per bb (1st instr in bb):

27.94user 0.00system 0:27.95elapsed 100%CPU (0avgtext+0avgdata 24768maxresident)k

Significant cost for instr-entry-per-bb: 33% more expensive.
Maybe we can leverage traces to bring it down, having one instr entry per
trace + a bit per cbr + an extra entry per mbr crossed?!?

#790: try online compression with zlib

With the private loader, should be able to just use zlib library directly.

% /usr/bin/time /work/dr/git/exports/bin64/drrun -c /work/dr/git/exports/samples/bin64/libmemtrace_x86_gz.so -- ./mcf_base.gcc-64bit inp.in
484.16user 1.91system 8:13.96elapsed 98%CPU (0avgtext+0avgdata 165552maxresident)k

It produces a 4GB file (vs 41GB uncompressed binary) but it is much slower!
295x vs native, 1.6x vs uncompressed.
98% CPU, too.

try zlib format instead of gz format, where we can set high speed => Z_BEST_SPEED is faster than uncompressed for HDD, but still not SSD

Z_BEST_SPEED

Have to use the deflate interface directly and the zlib compression format.
The gz interface uses the gzip compression format and apparently has no
interface to set the speed vs size.

% /usr/bin/time /work/dr/git/exports/bin64/drrun -c /work/dr/git/exports/samples/bin64/libmemtrace_x86_gz.so -- ./mcf_base.gcc-64bit inp.in
181.29user 1.91system 3:10.70elapsed 96%CPU (0avgtext+0avgdata 165220maxresident)k

It produces a 4.5GB file and is significantly faster than uncompressed,
but it's still 114x vs native.

On laptop it makes a 4.3GB file (should have saved to see if really
different) and:

229.51user 2.56system 3:53.45elapsed 99%CPU (0avgtext+0avgdata 167372maxresident)k

So even Z_BEST_SPEED is slower than uncompressed on an SSD!

The text was updated successfully, but these errors were encountered:

derekbruening · 2016-09-19T21:26:32Z

Xref #1738: optimize cache simulator

For drcachesim, the tracer's trace_entry_t has padding for 64-bit. We can eliminate the padding if having 4-byte-aligned 8-byte accesses is not a bigger perf loss than the gain from shrinking the memory and pipe footprint. This is almost certainly true for x86, but we should measure for ARM and AArch64.

derekbruening · 2017-03-30T16:21:23Z

Optimized trace format for offline traces with post-processing: 67x SSD, 42x SSD just data; 13.7x w/o disk; 10.5x just data

See full numbers in #1729 at
#1729 (comment) and #1729 (comment)

derekbruening · 2017-04-30T03:35:31Z

Xref #2299

zhaoqin · 2017-05-01T18:36:58Z

One idea to reduce memtrace overhead is to use asynchronous write so that the profile collection and trace dump could be performed in parallel. The basic implementation is to create a sideline threading pool and a producer-consumer queue. The application threads produce the traces and put them into the queue, while the sideline threads consume them and write them into disks.

There are several factors may affect the performance, for example:

the overall workload: sideline threads and passing traces via queue are additional workload comparing to the synchronous write implementation. There would be little performance improvement if the original workload already exceeded the system can handle. Thus, it is less likely to improve the memtrace for multi-threaded applications.
the lock contention on the queue: all application threads and sideline threads need exclusive access to the queue to produce and consume the traces, which could be the performance bottleneck.
the working set: when application threads and sideline threads generate and write out profiles, more memory buffers are used, and larger working set could cause more cache misses.
the disk bandwidth: the sideline threads performance depends on the disk bandwidth, having too many sideline threads may even hurt the performance due to resource contention.

zhaoqin · 2017-05-01T19:43:20Z

I created a mico-benchmark for experiment, which basically create a few thread and perform the same task in each thread.

    for (i = 0; i < num_threads; i++) {
        pthread_create(&thread[i], NULL, thread_func, NULL);
    }
    for (i = 0; i < num_threads; i++) {
        pthread_join(thread[i], NULL);
    }

Two hardware platform are tested:

Desktop: Xeon(R) CPU E5-1650 0 @ 3.20GHz, 6 core with hyper-threading, cache size 12288 KB,
harddrive,
/dev/sda:
Timing cached reads: 23184 MB in 2.00 seconds = 11619.28 MB/sec
Timing buffered disk reads: 520 MB in 3.01 seconds = 172.81 MB/sec

2 Laptop: Core(TM) i7-4712HQ CPU @ 2.30GHz, 4 core with hyper-threading, cache size 6144 KB,
SSD
Timing cached reads: 20210 MB in 2.00 seconds = 10128.99 MB/sec
Timing buffered disk reads: 802 MB in 3.00 seconds = 266.92 MB/sec

Experimental results on Desktop:
Execution time (seconds):

a\s	native	0 (w/o write)	0 (w/ write)	1	2	4	8	size
1	0.316	1.005	11.086	10.865	9.614	10.562	10.182	2.2GB
2	0.325	1.033	31.843	31.663	31.469	30.895	31.362	4.4GB
4	0.341	1.091	66.276	64.038	62.763	68.738	71.858	8.7GB

Experimental results on Laptop:
Execution time (seconds):

a\s	native	0 (w/o write)	0 (w/ write)	1	2	4	8	size
1	0.325	1.005	4.765	4.610	4.393	4.863	7.789	2.2GB
2	0.336	1.024	9.018	13.671	9.929	11.146	13.484	4.4GB
4	0.380	1.091	22.057	23.689	22.821	24.968	29.630	8.7GB
8	0.473	1.667	49.481	60.944	62.887	60.068	59.672	18GB

The data suggest the disk write bandwidth is the limitation on my experiment.
It takes 10 sec for writing 2.2GB and 30 sec for writting 4.4GB on my desktop, i.e., ~200MB/sec, and 4 sec for writing 2.2GB and 9 sec for writing 4.4GB on my laptop, i.e., ~500MB/sec.
Single thread profile write is already reaching the bandwidth limit, so more slides does not really speed up anything, but make the performance varies a lot.

zhaoqin · 2017-05-03T20:57:14Z

The pull request #2319 contains the threading pool implementation.

derekbruening · 2017-09-05T21:03:17Z

Here are some simple opt ideas from prior discussions:

** TODO opt: lean proc for buffer-full clean call
like memtrace_x86.c does

** TODO opt: re-measure fault buffer-full handling

** TODO opt: avoid lea and just store base reg

Instead of:

mov %ecx -> 0x44(%eax)
=>
spill %edx
lea 0x44(%eax) -> %edx
store %edx -> buffer
restore %edx

Just do:

store %eax -> buffer

And reconstruct +0x44 in post-proc.

Keep lea for index reg:
mov %ecx -> 0x44(%eax,%ebx)

** TODO opt: single entry for consecutive same-base memrefs

mov %ecx -> 0x44(%eax)
mov %edx -> 0x48(%eax)
mov %ebx -> 0x4c(%eax)
=>
store %eax -> buffer
<reconstruct 2nd and 3rd entries>

toshipiazza · 2017-09-12T08:57:07Z

I may add that I'd be interested in seeing a re-evaluation of the faulting buffer performance done. iirc, we opted not to land drx_buf into the clients because at least one drcachesim test timed out with the faulting buffer implementation.

At the time I didn't really do any benchmarks, and what tests I did run were on a crummy VM. This single test case I evaluated was also heavily multithreaded, and I'm wondering if potential overlocking like in #2114 could have indirectly contributed to the problem.

I have an implementation of the trace samples which use drx_buf here if anyone's interested.

Adds drutil_insert_get_mem_addr_ex() which returns whether the passed-in extra scratch register was used. Adds a test. Leverages the new function to avoid redundant loads in drcachesim. Issue: #2001

For offline traces for a "disp(base)" memref, only stores the base and adds the disp in raw2trace post-processing, as it's statically known. The base can be directly written as it's already in a register, reducing scratch register pressure. Moves the second scratch register reservation into the instru_t routines so we can skip it for this optimization of just writing the base reg for "disp(base)" memrefs. Issue: #2001

For offline traces for a "disp(base)" memref, only stores the base and adds the disp in raw2trace post-processing, as it's statically known. The base can be directly written as it's already in a register, reducing scratch register pressure. This is only done on x86 for now for simplicity. Moves the second scratch register reservation into the instru_t routines so we can skip it for this optimization of just writing the base reg for "disp(base)" memrefs. Issue: #2001

Fixes a bug in the new drutil_insert_get_mem_addr_ex() feature: it wasn't initializing the output prameter on all paths. Issue: #2001

In the drcachesim tracer drmemtrace, we avoid looping over the trace entries prior to writing them out if there is no processing required (for offline virtually-addressed traces). Issue: #2001

Reduces drcachesim offline tracing overhead by eliding rip-relative and same-unmodified-base addresses from the recorded trace, reconstructing their values during post-processing. In measurements this is pretty significant, removing 12%-17% of entries that need to be written out during tracing. Adds identification of elidable memory operands to instru_offline_t, exported both to the runtime tracer and the raw2trace post-processor. Changes raw2trace's instruction cache key to a pair <tag,pc> to handle tail-duplicated blocks. Adds elision identification through a constructed instrlist when first encountering a block tag. Adds a new struct memref_summary_t to store elision information with each cached memory operand. Increases the offline file version and adds versioning support for backward compatibility with no-elision trace, as well as to make it easier to keep compatibility when more elision cases are added in the future. Adds a file type to the offline file header to identify filtered traces as a sanity check and to avoid extra work when there are no elided addresses at all. Adds a statistics interface to retrieve raw2trace metrics. The initial metric is the number of elided addresses. Issue: #2001

Reduces drcachesim offline tracing overhead by eliding rip-relative and same-unmodified-base addresses from the recorded trace, reconstructing their values during post-processing. In measurements this is pretty significant, removing 12%-17% of entries that need to be written out during tracing. Adds identification of elidable memory operands to instru_offline_t, exported both to the runtime tracer and the raw2trace post-processor. Changes raw2trace's instruction cache key to a pair <tag,pc> to handle tail-duplicated blocks. Adds elision identification through a constructed instrlist when first encountering a block tag. Adds a new struct memref_summary_t to store elision information with each cached memory operand. Increases the offline file version and adds versioning support for backward compatibility with no-elision traces, as well as to make it easier to keep compatibility when more elision cases are added in the future. Adds a file type to the offline file header to identify filtered traces as a sanity check and to avoid extra work when there are no elided addresses at all. Another file type flag identifies whether any optimizations (this and the existing displacement elision) are present, making it possible to disable them for testing purposes. Adds a -disable_optimizations flag for this. Adds a new test burst_traceopts which runs assembly code sequences covering corner cases twice, once with and once without optimizations. It then postprocesses each, and compares the final trace entries using the external analyzer iterator interface. This found bugs during development and provides confidence that these optimizations are safe. Improves the pre-existing displacement elision optimization by sharing code between the tracer and raw2trace via offline_instru_t::opnd_disp_is_elidable() and by adding test cases to the new test. Also implements displacement elision for ARM and AArch64, which is required for proper address elision without also recording displacements. The new test includes AArch64 and ARM assembly code. The AArch64 was tested by temporarily enabling these static-DR tests (unfortunately i#2007 prevents us from enabling them on Travis for now). The ARM assembly builds but is not testable due to missing start/stop features on ARM> Adds a statistics interface to retrieve raw2trace metrics. The initial metric is the number of elided addresses. Includes a part of PR #3120 (the memset in d_r_config_exit()) plus a '#' option prefix to work around #2661. Fixes a bug revealed by the tighter post-processing constraints with elision: Do not count an artificial jump from a trampoline return as an instruction in the recorded block tag entry. Counting it resulted in a duplicated instruction during post-processing. Issue: #2001, #2661

derekbruening added Type-Feature Performance Component-DRTool labels Sep 7, 2016

derekbruening mentioned this issue Apr 30, 2017

Reduce drmemtrace/drcachesim overhead #2398

Closed

derekbruening self-assigned this Sep 5, 2017

derekbruening mentioned this issue Jan 5, 2018

i#2001 trace perf: add drutil_insert_get_mem_addr_ex() #2794

Merged

derekbruening mentioned this issue Jan 5, 2018

i#2001 trace perf: do not store disp #2795

Merged

derekbruening added a commit that referenced this issue Feb 13, 2018

i#2001 trace perf: fix bug in drutil_insert_get_mem_addr_ex()

7f8c147

Fixes a bug in the new drutil_insert_get_mem_addr_ex() feature: it wasn't initializing the output prameter on all paths. Issue: #2001

derekbruening mentioned this issue Feb 13, 2018

i#2001 trace perf: fix bug in drutil_insert_get_mem_addr_ex() #2831

Merged

derekbruening added a commit that referenced this issue Feb 14, 2018

i#2001 trace perf: fix bug in drutil_insert_get_mem_addr_ex() (#2831)

06f7d0f

Fixes a bug in the new drutil_insert_get_mem_addr_ex() feature: it wasn't initializing the output prameter on all paths. Issue: #2001

derekbruening added the Component-DrMemtrace label Oct 30, 2018

derekbruening mentioned this issue May 16, 2019

online drcachesim is too slow (8x slower than cachegrind): regression? #3614

Open

This was referenced Dec 9, 2019

i#2001 trace perf: elide rip-rel and same-base addrs #3977

Merged

AArch64 build failures with static DR #2007

Closed

derekbruening mentioned this issue Aug 12, 2020

drmemtrace address elision ignoring non-memref writes, leading to incorrect addresses #4403

Closed

This was referenced May 11, 2021

Add AArch64 stack pointer elision in drmemtrace offline traces #4898

Open

Elide push and pop addresses from drmemtrace traces #4913

Open

Store just the first and last addresses of rep string loops #4914

Open

derekbruening mentioned this issue Mar 16, 2022

Add snappy compression to drmemtrace output #5427

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve memory trace performance #2001

improve memory trace performance #2001

derekbruening commented Sep 7, 2016 •

edited

Loading

derekbruening commented Sep 19, 2016

derekbruening commented Mar 30, 2017 •

edited

Loading

derekbruening commented Apr 30, 2017

zhaoqin commented May 1, 2017

zhaoqin commented May 1, 2017

zhaoqin commented May 3, 2017

derekbruening commented Sep 5, 2017

toshipiazza commented Sep 12, 2017

improve memory trace performance #2001

improve memory trace performance #2001

Comments

derekbruening commented Sep 7, 2016 • edited Loading

memtrace_binary sample perf: 70x (SSD) to 180x (HDD); 4x-25x (ave 18x) w/o disk; w/ no PC 36x (SSD)

drcachesim tracer performance => 2x slower b/c of icache entries

#790: try online compression with zlib

try zlib format instead of gz format, where we can set high speed => Z_BEST_SPEED is faster than uncompressed for HDD, but still not SSD

derekbruening commented Sep 19, 2016

derekbruening commented Mar 30, 2017 • edited Loading

Optimized trace format for offline traces with post-processing: 67x SSD, 42x SSD just data; 13.7x w/o disk; 10.5x just data

derekbruening commented Apr 30, 2017

zhaoqin commented May 1, 2017

zhaoqin commented May 1, 2017

zhaoqin commented May 3, 2017

derekbruening commented Sep 5, 2017

toshipiazza commented Sep 12, 2017

derekbruening commented Sep 7, 2016 •

edited

Loading

derekbruening commented Mar 30, 2017 •

edited

Loading