Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve memory trace performance #2001

Open
derekbruening opened this issue Sep 7, 2016 · 8 comments
Open

improve memory trace performance #2001

derekbruening opened this issue Sep 7, 2016 · 8 comments

Comments

@derekbruening
Copy link
Contributor

derekbruening commented Sep 7, 2016

This is something of a broad issue covering analyzing and improving the performance of the following:

  • drcachesim's traces, both for online and offline trace analysis
  • the memtrace and instrace samples

Xref #1929: memtrace slowness due to unbuffered printf
Xref #790: online trace compression

I wanted a place to dump some of my notes on this. The #790 notes are somewhat duplicated in that issue:

memtrace_binary sample perf: 70x (SSD) to 180x (HDD); 4x-25x (ave 18x) w/o disk; w/ no PC 36x (SSD)

mcf test:

% cd /extsw/spec2006/mcf-test
% /usr/bin/time ./mcf_base.gcc-64bit inp.in
1.63user 0.03system 0:01.67elapsed 99%CPU (0avgtext+0avgdata 159668maxresident)k
% /usr/bin/time /work/dr/git/exports/bin64/drrun -c /work/dr/git/exports/samples/bin64/libmemtrace_x86_binary.so -- ./mcf_base.gcc-64bit inp.in
18.41user 10.45system 5:05.96elapsed 9%CPU (0avgtext+0avgdata 164712maxresident)k

=>
clearly i/o bound: 9% CPU. produces a 41GB file with 1.3 billion memrefs.
slowdown: 183x

More like 70x on laptop, and higher %CPU:

% /usr/bin/time  ./mcf_base.clang-64bit inp.in
2.00user 0.09system 0:02.09elapsed 99%CPU (0avgtext+0avgdata 159648maxresident)k
% /usr/bin/time ~/dr/git/build_x64_rel/bin64/drrun -c ~/dr/git/build_x64_rel/api/bin/libmemtrace_x86_binary.so -- ./mcf_base.clang-64bit inp.in
15.18user 21.44system 2:23.12elapsed 25%CPU (0avgtext+0avgdata 165128maxresident)k

Because it's got an SSD? Or also b/c CPU is slower (so higher CPU-to-disk
ratio; also slower native)?

Disabling dr_write_file:

% /usr/bin/time /work/dr/git/exports/bin64/drrun -c /work/dr/git/exports/samples/bin64/libmemtrace_x86_binary.so -- ./mcf_base.gcc-64bit inp.in
9.28user 0.04system 0:09.41elapsed 99%CPU (0avgtext+0avgdata 164908maxresident)k

=> 5.6x
That's PC, read or write, size, and address.
It should be easy to improve by 2x by removing read/write and size
(statically recoverable) and only including PC once per bb or even less.

But it's much worse on other spec. Taking too long to do a ref run of
everything but bmarks at the point I killed the run, 9 hours in:

% spec2k6cmp result/CINT2006.045.ref.txt result/res-native-allc-int
400.perlbench         0.00 (     0 /    207)
401.bzip2            16.28 (  5029 /    309)
403.gcc              22.69 (  4447 /    196)
429.mcf               4.27 (   994 /    233)
445.gobmk            18.95 (  5761 /    304)
456.hmmer            25.61 (  7095 /    277)
458.sjeng            18.37 (  5788 /    315)
462.libquantum        0.00 (     0 /    299)
464.h264ref           0.00 (     0 /    327)
471.omnetpp           0.00 (     0 /    271)
473.astar             0.00 (     0 /    267)
483.xalancbmk         0.00 (     0 /    150)
average              17.70

Qin: "if memtrace is 100x, if you can make the profile 1/5 the size, can hit 20x"

Can shrink some fields, but not to 1/5. Online gzip compression should easily give 1/5.
Simple test: I see >20x gzip compression (though w/ naive starting format):

% /work/dr/git/exports/bin64/drrun -c /work/dr/git/build_x64_dbg/api/bin/libmemtrace_x86_binary.so -- ls
% ls -sh /work/dr/git/build_x64_dbg/api/bin/*.log
12M /work/dr/git/build_x64_dbg/api/bin/memtrace.ls.15577.0000.log
% gzip /work/dr/git/build_x64_dbg/api/bin/*.log
% ls -sh /work/dr/git/build_x64_dbg/api/bin/*.log*
528K /work/dr/git/build_x64_dbg/api/bin/memtrace.ls.15577.0000.log.gz

Removing the PC field:

% /usr/bin/time /work/dr/git/exports/bin64/drrun -c /work/dr/git/exports/samples/bin64/libmemtrace_x86_binary.so -- ./mcf_base.gcc-64bit inp.in
13.26user 11.07system 3:31.28elapsed 11%CPU (0avgtext+0avgdata 164684maxresident)k

=>
still i/o bound: 11% CPU. produces a 31GB file.
slowdown: 126x

On laptop:

% /usr/bin/time ~/dr/git/build_x64_rel/bin64/drrun -c ~/dr/git/build_x64_rel/api/bin/libmemtrace_x86_binary.so -- ./mcf_base.clang-64bit inp.in
13.14user 15.50system 1:15.94elapsed 37%CPU (0avgtext+0avgdata 164736maxresident)k

Up to 37%CPU, and 36x slowdown.

drcachesim tracer performance => 2x slower b/c of icache entries

Switching from mcf test to bzip2 test b/c it's a little closer to the 18x
average performance for the memtrace sample not writing to disk and so
is more representative:

native:

% /usr/bin/time ./bzip2_base.gcc-64bit dryer.jpg 2
1.95user 0.00system 0:01.96elapsed 100%CPU (0avgtext+0avgdata 17176maxresident)k

No disk writes at all:

% /usr/bin/time /work/dr/git/exports/bin64/drrun -c /work/dr/git/exports/samples/bin64/libmemtrace_x86_binary.so -- ./bzip2_base.gcc-64bit dryer.jpg 2
30.23user 0.01system 0:30.49elapsed 99%CPU (0avgtext+0avgdata 24300maxresident)k

That's 15.6x.

% /usr/bin/time /work/dr/git/build_x64_rel/bin64/drrun -c /work/dr/git/build_x64_rel/clients/lib64/release/libdrmemtrace.so -- ./bzip2_base.gcc-64bit dryer.jpg 2
60.32user 0.01system 1:00.34elapsed 99%CPU (0avgtext+0avgdata 25564maxresident)k

30.8x! 2x vs memtrace, b/c it's including icache info, presumably.

Currently trace_entry_t is 4+8 => 16 bytes b/c of alignment (we didn't pack
it: b/c we only care about 32-bit?).

Packing trace_entry_t w/o any other changes to the struct:

55.70user 0.01system 0:55.72elapsed 99%CPU (0avgtext+0avgdata 25560maxresident)k

Also compressing size+type from 4 bytes into 2 bytes:
(Might need extra escape entry for memsz > 256)

54.11user 0.00system 0:54.12elapsed 100%CPU (0avgtext+0avgdata 25264maxresident)k

Also shrinking pc/addr field to 4 bytes:

53.57user 0.01system 0:53.59elapsed 99%CPU (0avgtext+0avgdata 25060maxresident)k

Also removing INSTR_BUNDLE (always has preceding abs pc so redundant):

48.63user 0.01system 0:48.99elapsed 99%CPU (0avgtext+0avgdata 24952maxresident)k

10.7x = Also removing all instr entries (thus there's no PC at all):

20.88user 0.01system 0:21.05elapsed 99%CPU (0avgtext+0avgdata 24896maxresident)k

Having the instr bundles and all the instr boundary info coming from the
tracer seems worth it for online simulation, where having the simulator go
dig it up from disassembly of a giant binary is going to be slower than the
tracer providing it. But for offline, it does seem
like we want to really optimize the tracing -- thus we need a split tracer!

14.3x = Adding back one instr entry per bb (1st instr in bb):

27.94user 0.00system 0:27.95elapsed 100%CPU (0avgtext+0avgdata 24768maxresident)k

Significant cost for instr-entry-per-bb: 33% more expensive.
Maybe we can leverage traces to bring it down, having one instr entry per
trace + a bit per cbr + an extra entry per mbr crossed?!?

#790: try online compression with zlib

With the private loader, should be able to just use zlib library directly.

% /usr/bin/time /work/dr/git/exports/bin64/drrun -c /work/dr/git/exports/samples/bin64/libmemtrace_x86_gz.so -- ./mcf_base.gcc-64bit inp.in
484.16user 1.91system 8:13.96elapsed 98%CPU (0avgtext+0avgdata 165552maxresident)k

It produces a 4GB file (vs 41GB uncompressed binary) but it is much slower!
295x vs native, 1.6x vs uncompressed.
98% CPU, too.

try zlib format instead of gz format, where we can set high speed => Z_BEST_SPEED is faster than uncompressed for HDD, but still not SSD

Z_BEST_SPEED

Have to use the deflate interface directly and the zlib compression format.
The gz interface uses the gzip compression format and apparently has no
interface to set the speed vs size.

% /usr/bin/time /work/dr/git/exports/bin64/drrun -c /work/dr/git/exports/samples/bin64/libmemtrace_x86_gz.so -- ./mcf_base.gcc-64bit inp.in
181.29user 1.91system 3:10.70elapsed 96%CPU (0avgtext+0avgdata 165220maxresident)k

It produces a 4.5GB file and is significantly faster than uncompressed,
but it's still 114x vs native.

On laptop it makes a 4.3GB file (should have saved to see if really
different) and:

229.51user 2.56system 3:53.45elapsed 99%CPU (0avgtext+0avgdata 167372maxresident)k

So even Z_BEST_SPEED is slower than uncompressed on an SSD!

@derekbruening
Copy link
Contributor Author

Xref #1738: optimize cache simulator

For drcachesim, the tracer's trace_entry_t has padding for 64-bit. We can eliminate the padding if having 4-byte-aligned 8-byte accesses is not a bigger perf loss than the gain from shrinking the memory and pipe footprint. This is almost certainly true for x86, but we should measure for ARM and AArch64.

@derekbruening
Copy link
Contributor Author

derekbruening commented Mar 30, 2017

Optimized trace format for offline traces with post-processing: 67x SSD, 42x SSD just data; 13.7x w/o disk; 10.5x just data

See full numbers in #1729 at
#1729 (comment) and #1729 (comment)

@derekbruening
Copy link
Contributor Author

Xref #2299

@zhaoqin
Copy link
Contributor

zhaoqin commented May 1, 2017

One idea to reduce memtrace overhead is to use asynchronous write so that the profile collection and trace dump could be performed in parallel. The basic implementation is to create a sideline threading pool and a producer-consumer queue. The application threads produce the traces and put them into the queue, while the sideline threads consume them and write them into disks.

There are several factors may affect the performance, for example:

  1. the overall workload: sideline threads and passing traces via queue are additional workload comparing to the synchronous write implementation. There would be little performance improvement if the original workload already exceeded the system can handle. Thus, it is less likely to improve the memtrace for multi-threaded applications.
  2. the lock contention on the queue: all application threads and sideline threads need exclusive access to the queue to produce and consume the traces, which could be the performance bottleneck.
  3. the working set: when application threads and sideline threads generate and write out profiles, more memory buffers are used, and larger working set could cause more cache misses.
  4. the disk bandwidth: the sideline threads performance depends on the disk bandwidth, having too many sideline threads may even hurt the performance due to resource contention.

@zhaoqin
Copy link
Contributor

zhaoqin commented May 1, 2017

I created a mico-benchmark for experiment, which basically create a few thread and perform the same task in each thread.

    for (i = 0; i < num_threads; i++) {
        pthread_create(&thread[i], NULL, thread_func, NULL);
    }
    for (i = 0; i < num_threads; i++) {
        pthread_join(thread[i], NULL);
    }

Two hardware platform are tested:

  1. Desktop: Xeon(R) CPU E5-1650 0 @ 3.20GHz, 6 core with hyper-threading, cache size 12288 KB,
    harddrive,
    /dev/sda:
    Timing cached reads: 23184 MB in 2.00 seconds = 11619.28 MB/sec
    Timing buffered disk reads: 520 MB in 3.01 seconds = 172.81 MB/sec

2 Laptop: Core(TM) i7-4712HQ CPU @ 2.30GHz, 4 core with hyper-threading, cache size 6144 KB,
SSD
Timing cached reads: 20210 MB in 2.00 seconds = 10128.99 MB/sec
Timing buffered disk reads: 802 MB in 3.00 seconds = 266.92 MB/sec

Experimental results on Desktop:
Execution time (seconds):

a\s native 0 (w/o write) 0 (w/ write) 1 2 4 8 size
1 0.316 1.005 11.086 10.865 9.614 10.562 10.182 2.2GB
2 0.325 1.033 31.843 31.663 31.469 30.895 31.362 4.4GB
4 0.341 1.091 66.276 64.038 62.763 68.738 71.858 8.7GB

Experimental results on Laptop:
Execution time (seconds):

a\s native 0 (w/o write) 0 (w/ write) 1 2 4 8 size
1 0.325 1.005 4.765 4.610 4.393 4.863 7.789 2.2GB
2 0.336 1.024 9.018 13.671 9.929 11.146 13.484 4.4GB
4 0.380 1.091 22.057 23.689 22.821 24.968 29.630 8.7GB
8 0.473 1.667 49.481 60.944 62.887 60.068 59.672 18GB

The data suggest the disk write bandwidth is the limitation on my experiment.
It takes 10 sec for writing 2.2GB and 30 sec for writting 4.4GB on my desktop, i.e., ~200MB/sec, and 4 sec for writing 2.2GB and 9 sec for writing 4.4GB on my laptop, i.e., ~500MB/sec.
Single thread profile write is already reaching the bandwidth limit, so more slides does not really speed up anything, but make the performance varies a lot.

@zhaoqin
Copy link
Contributor

zhaoqin commented May 3, 2017

The pull request #2319 contains the threading pool implementation.

@derekbruening
Copy link
Contributor Author

Here are some simple opt ideas from prior discussions:

** TODO opt: lean proc for buffer-full clean call
like memtrace_x86.c does

** TODO opt: re-measure fault buffer-full handling

** TODO opt: avoid lea and just store base reg

Instead of:

mov %ecx -> 0x44(%eax)
=>
spill %edx
lea 0x44(%eax) -> %edx
store %edx -> buffer
restore %edx

Just do:

store %eax -> buffer

And reconstruct +0x44 in post-proc.

Keep lea for index reg:
mov %ecx -> 0x44(%eax,%ebx)

** TODO opt: single entry for consecutive same-base memrefs

mov %ecx -> 0x44(%eax)
mov %edx -> 0x48(%eax)
mov %ebx -> 0x4c(%eax)
=>
store %eax -> buffer
<reconstruct 2nd and 3rd entries>

@derekbruening derekbruening self-assigned this Sep 5, 2017
@toshipiazza
Copy link
Contributor

I may add that I'd be interested in seeing a re-evaluation of the faulting buffer performance done. iirc, we opted not to land drx_buf into the clients because at least one drcachesim test timed out with the faulting buffer implementation.

At the time I didn't really do any benchmarks, and what tests I did run were on a crummy VM. This single test case I evaluated was also heavily multithreaded, and I'm wondering if potential overlocking like in #2114 could have indirectly contributed to the problem.

I have an implementation of the trace samples which use drx_buf here if anyone's interested.

derekbruening added a commit that referenced this issue Jan 5, 2018
Adds drutil_insert_get_mem_addr_ex() which returns whether the passed-in
extra scratch register was used.  Adds a test.

Leverages the new function to avoid redundant loads in drcachesim.

Issue: #2001
derekbruening added a commit that referenced this issue Jan 5, 2018
Adds drutil_insert_get_mem_addr_ex() which returns whether the passed-in
extra scratch register was used.  Adds a test.

Leverages the new function to avoid redundant loads in drcachesim.

Issue: #2001
derekbruening added a commit that referenced this issue Jan 5, 2018
For offline traces for a "disp(base)" memref, only stores the base and adds
the disp in raw2trace post-processing, as it's statically known.  The base
can be directly written as it's already in a register, reducing scratch
register pressure.

Moves the second scratch register reservation into the instru_t routines so
we can skip it for this optimization of just writing the base reg for
"disp(base)" memrefs.

Issue: #2001
derekbruening added a commit that referenced this issue Jan 6, 2018
For offline traces for a "disp(base)" memref, only stores the base and adds
the disp in raw2trace post-processing, as it's statically known.  The base
can be directly written as it's already in a register, reducing scratch
register pressure.  This is only done on x86 for now for simplicity.

Moves the second scratch register reservation into the instru_t routines so
we can skip it for this optimization of just writing the base reg for
"disp(base)" memrefs.

Issue: #2001
derekbruening added a commit that referenced this issue Feb 13, 2018
Fixes a bug in the new drutil_insert_get_mem_addr_ex() feature: it wasn't
initializing the output prameter on all paths.

Issue: #2001
derekbruening added a commit that referenced this issue Feb 14, 2018
Fixes a bug in the new drutil_insert_get_mem_addr_ex() feature: it wasn't
initializing the output prameter on all paths.

Issue: #2001
derekbruening pushed a commit that referenced this issue Apr 18, 2018
In the drcachesim tracer drmemtrace, we avoid looping over the trace entries prior to writing them out if there is no processing required (for offline virtually-addressed traces).

Issue: #2001
derekbruening added a commit that referenced this issue Dec 9, 2019
Reduces drcachesim offline tracing overhead by eliding rip-relative
and same-unmodified-base addresses from the recorded trace,
reconstructing their values during post-processing.  In measurements
this is pretty significant, removing 12%-17% of entries that need to
be written out during tracing.

Adds identification of elidable memory operands to instru_offline_t,
exported both to the runtime tracer and the raw2trace post-processor.

Changes raw2trace's instruction cache key to a pair <tag,pc> to handle
tail-duplicated blocks.  Adds elision identification through a
constructed instrlist when first encountering a block tag.  Adds a new
struct memref_summary_t to store elision information with each cached
memory operand.

Increases the offline file version and adds versioning support for
backward compatibility with no-elision trace, as well as to make it
easier to keep compatibility when more elision cases are added in the
future.

Adds a file type to the offline file header to identify filtered
traces as a sanity check and to avoid extra work when there are no
elided addresses at all.

Adds a statistics interface to retrieve raw2trace metrics.  The
initial metric is the number of elided addresses.

Issue: #2001
derekbruening added a commit that referenced this issue Dec 15, 2019
Reduces drcachesim offline tracing overhead by eliding rip-relative
and same-unmodified-base addresses from the recorded trace,
reconstructing their values during post-processing.  In measurements
this is pretty significant, removing 12%-17% of entries that need to
be written out during tracing.

Adds identification of elidable memory operands to instru_offline_t,
exported both to the runtime tracer and the raw2trace post-processor.

Changes raw2trace's instruction cache key to a pair <tag,pc> to handle
tail-duplicated blocks.  Adds elision identification through a
constructed instrlist when first encountering a block tag.  Adds a new
struct memref_summary_t to store elision information with each cached
memory operand.

Increases the offline file version and adds versioning support for
backward compatibility with no-elision traces, as well as to make it
easier to keep compatibility when more elision cases are added in the
future.

Adds a file type to the offline file header to identify filtered
traces as a sanity check and to avoid extra work when there are
no elided addresses at all.  Another file type flag identifies
whether any optimizations (this and the existing displacement
elision) are present, making it possible to disable them for
testing purposes.  Adds a -disable_optimizations flag for this.

Adds a new test burst_traceopts which runs assembly code
sequences covering corner cases twice, once with and once without
optimizations.  It then postprocesses each, and compares the
final trace entries using the external analyzer iterator
interface.  This found bugs during development and provides
confidence that these optimizations are safe.

Improves the pre-existing displacement elision optimization by sharing
code between the tracer and raw2trace via
offline_instru_t::opnd_disp_is_elidable() and by adding test cases to
the new test.  Also implements displacement elision for ARM and
AArch64, which is required for proper address elision without also
recording displacements.

The new test includes AArch64 and ARM assembly code.  The AArch64
was tested by temporarily enabling these static-DR
tests (unfortunately i#2007 prevents us from enabling them on
Travis for now).  The ARM assembly builds but is not testable due
to missing start/stop features on ARM>

Adds a statistics interface to retrieve raw2trace metrics.  The
initial metric is the number of elided addresses.

Includes a part of PR #3120 (the memset in d_r_config_exit())
plus a '#' option prefix to work around #2661.

Fixes a bug revealed by the tighter post-processing constraints
with elision: Do not count an artificial jump from a trampoline
return as an instruction in the recorded block tag entry.
Counting it resulted in a duplicated instruction during
post-processing.

Issue: #2001, #2661
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants