-
Notifications
You must be signed in to change notification settings - Fork 566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve memory trace performance #2001
Comments
Xref #1738: optimize cache simulator For drcachesim, the tracer's trace_entry_t has padding for 64-bit. We can eliminate the padding if having 4-byte-aligned 8-byte accesses is not a bigger perf loss than the gain from shrinking the memory and pipe footprint. This is almost certainly true for x86, but we should measure for ARM and AArch64. |
Optimized trace format for offline traces with post-processing: 67x SSD, 42x SSD just data; 13.7x w/o disk; 10.5x just dataSee full numbers in #1729 at |
Xref #2299 |
One idea to reduce memtrace overhead is to use asynchronous write so that the profile collection and trace dump could be performed in parallel. The basic implementation is to create a sideline threading pool and a producer-consumer queue. The application threads produce the traces and put them into the queue, while the sideline threads consume them and write them into disks. There are several factors may affect the performance, for example:
|
I created a mico-benchmark for experiment, which basically create a few thread and perform the same task in each thread.
Two hardware platform are tested:
2 Laptop: Core(TM) i7-4712HQ CPU @ 2.30GHz, 4 core with hyper-threading, cache size 6144 KB, Experimental results on Desktop:
Experimental results on Laptop:
The data suggest the disk write bandwidth is the limitation on my experiment. |
The pull request #2319 contains the threading pool implementation. |
Here are some simple opt ideas from prior discussions:
|
I may add that I'd be interested in seeing a re-evaluation of the faulting buffer performance done. iirc, we opted not to land drx_buf into the clients because at least one drcachesim test timed out with the faulting buffer implementation. At the time I didn't really do any benchmarks, and what tests I did run were on a crummy VM. This single test case I evaluated was also heavily multithreaded, and I'm wondering if potential overlocking like in #2114 could have indirectly contributed to the problem. I have an implementation of the trace samples which use drx_buf here if anyone's interested. |
Adds drutil_insert_get_mem_addr_ex() which returns whether the passed-in extra scratch register was used. Adds a test. Leverages the new function to avoid redundant loads in drcachesim. Issue: #2001
Adds drutil_insert_get_mem_addr_ex() which returns whether the passed-in extra scratch register was used. Adds a test. Leverages the new function to avoid redundant loads in drcachesim. Issue: #2001
For offline traces for a "disp(base)" memref, only stores the base and adds the disp in raw2trace post-processing, as it's statically known. The base can be directly written as it's already in a register, reducing scratch register pressure. Moves the second scratch register reservation into the instru_t routines so we can skip it for this optimization of just writing the base reg for "disp(base)" memrefs. Issue: #2001
For offline traces for a "disp(base)" memref, only stores the base and adds the disp in raw2trace post-processing, as it's statically known. The base can be directly written as it's already in a register, reducing scratch register pressure. This is only done on x86 for now for simplicity. Moves the second scratch register reservation into the instru_t routines so we can skip it for this optimization of just writing the base reg for "disp(base)" memrefs. Issue: #2001
Fixes a bug in the new drutil_insert_get_mem_addr_ex() feature: it wasn't initializing the output prameter on all paths. Issue: #2001
Fixes a bug in the new drutil_insert_get_mem_addr_ex() feature: it wasn't initializing the output prameter on all paths. Issue: #2001
In the drcachesim tracer drmemtrace, we avoid looping over the trace entries prior to writing them out if there is no processing required (for offline virtually-addressed traces). Issue: #2001
Reduces drcachesim offline tracing overhead by eliding rip-relative and same-unmodified-base addresses from the recorded trace, reconstructing their values during post-processing. In measurements this is pretty significant, removing 12%-17% of entries that need to be written out during tracing. Adds identification of elidable memory operands to instru_offline_t, exported both to the runtime tracer and the raw2trace post-processor. Changes raw2trace's instruction cache key to a pair <tag,pc> to handle tail-duplicated blocks. Adds elision identification through a constructed instrlist when first encountering a block tag. Adds a new struct memref_summary_t to store elision information with each cached memory operand. Increases the offline file version and adds versioning support for backward compatibility with no-elision trace, as well as to make it easier to keep compatibility when more elision cases are added in the future. Adds a file type to the offline file header to identify filtered traces as a sanity check and to avoid extra work when there are no elided addresses at all. Adds a statistics interface to retrieve raw2trace metrics. The initial metric is the number of elided addresses. Issue: #2001
Reduces drcachesim offline tracing overhead by eliding rip-relative and same-unmodified-base addresses from the recorded trace, reconstructing their values during post-processing. In measurements this is pretty significant, removing 12%-17% of entries that need to be written out during tracing. Adds identification of elidable memory operands to instru_offline_t, exported both to the runtime tracer and the raw2trace post-processor. Changes raw2trace's instruction cache key to a pair <tag,pc> to handle tail-duplicated blocks. Adds elision identification through a constructed instrlist when first encountering a block tag. Adds a new struct memref_summary_t to store elision information with each cached memory operand. Increases the offline file version and adds versioning support for backward compatibility with no-elision traces, as well as to make it easier to keep compatibility when more elision cases are added in the future. Adds a file type to the offline file header to identify filtered traces as a sanity check and to avoid extra work when there are no elided addresses at all. Another file type flag identifies whether any optimizations (this and the existing displacement elision) are present, making it possible to disable them for testing purposes. Adds a -disable_optimizations flag for this. Adds a new test burst_traceopts which runs assembly code sequences covering corner cases twice, once with and once without optimizations. It then postprocesses each, and compares the final trace entries using the external analyzer iterator interface. This found bugs during development and provides confidence that these optimizations are safe. Improves the pre-existing displacement elision optimization by sharing code between the tracer and raw2trace via offline_instru_t::opnd_disp_is_elidable() and by adding test cases to the new test. Also implements displacement elision for ARM and AArch64, which is required for proper address elision without also recording displacements. The new test includes AArch64 and ARM assembly code. The AArch64 was tested by temporarily enabling these static-DR tests (unfortunately i#2007 prevents us from enabling them on Travis for now). The ARM assembly builds but is not testable due to missing start/stop features on ARM> Adds a statistics interface to retrieve raw2trace metrics. The initial metric is the number of elided addresses. Includes a part of PR #3120 (the memset in d_r_config_exit()) plus a '#' option prefix to work around #2661. Fixes a bug revealed by the tighter post-processing constraints with elision: Do not count an artificial jump from a trampoline return as an instruction in the recorded block tag entry. Counting it resulted in a duplicated instruction during post-processing. Issue: #2001, #2661
This is something of a broad issue covering analyzing and improving the performance of the following:
Xref #1929: memtrace slowness due to unbuffered printf
Xref #790: online trace compression
I wanted a place to dump some of my notes on this. The #790 notes are somewhat duplicated in that issue:
memtrace_binary sample perf: 70x (SSD) to 180x (HDD); 4x-25x (ave 18x) w/o disk; w/ no PC 36x (SSD)
mcf test:
=>
clearly i/o bound: 9% CPU. produces a 41GB file with 1.3 billion memrefs.
slowdown: 183x
More like 70x on laptop, and higher %CPU:
Because it's got an SSD? Or also b/c CPU is slower (so higher CPU-to-disk
ratio; also slower native)?
Disabling dr_write_file:
=> 5.6x
That's PC, read or write, size, and address.
It should be easy to improve by 2x by removing read/write and size
(statically recoverable) and only including PC once per bb or even less.
But it's much worse on other spec. Taking too long to do a ref run of
everything but bmarks at the point I killed the run, 9 hours in:
Qin: "if memtrace is 100x, if you can make the profile 1/5 the size, can hit 20x"
Can shrink some fields, but not to 1/5. Online gzip compression should easily give 1/5.
Simple test: I see >20x gzip compression (though w/ naive starting format):
Removing the PC field:
=>
still i/o bound: 11% CPU. produces a 31GB file.
slowdown: 126x
On laptop:
Up to 37%CPU, and 36x slowdown.
drcachesim tracer performance => 2x slower b/c of icache entries
Switching from mcf test to bzip2 test b/c it's a little closer to the 18x
average performance for the memtrace sample not writing to disk and so
is more representative:
native:
No disk writes at all:
That's 15.6x.
30.8x! 2x vs memtrace, b/c it's including icache info, presumably.
Currently trace_entry_t is 4+8 => 16 bytes b/c of alignment (we didn't pack
it: b/c we only care about 32-bit?).
Packing trace_entry_t w/o any other changes to the struct:
Also compressing size+type from 4 bytes into 2 bytes:
(Might need extra escape entry for memsz > 256)
Also shrinking pc/addr field to 4 bytes:
Also removing INSTR_BUNDLE (always has preceding abs pc so redundant):
10.7x = Also removing all instr entries (thus there's no PC at all):
Having the instr bundles and all the instr boundary info coming from the
tracer seems worth it for online simulation, where having the simulator go
dig it up from disassembly of a giant binary is going to be slower than the
tracer providing it. But for offline, it does seem
like we want to really optimize the tracing -- thus we need a split tracer!
14.3x = Adding back one instr entry per bb (1st instr in bb):
Significant cost for instr-entry-per-bb: 33% more expensive.
Maybe we can leverage traces to bring it down, having one instr entry per
trace + a bit per cbr + an extra entry per mbr crossed?!?
#790: try online compression with zlib
With the private loader, should be able to just use zlib library directly.
It produces a 4GB file (vs 41GB uncompressed binary) but it is much slower!
295x vs native, 1.6x vs uncompressed.
98% CPU, too.
try zlib format instead of gz format, where we can set high speed => Z_BEST_SPEED is faster than uncompressed for HDD, but still not SSD
Z_BEST_SPEED
Have to use the deflate interface directly and the zlib compression format.
The gz interface uses the gzip compression format and apparently has no
interface to set the speed vs size.
It produces a 4.5GB file and is significantly faster than uncompressed,
but it's still 114x vs native.
On laptop it makes a 4.3GB file (should have saved to see if really
different) and:
So even Z_BEST_SPEED is slower than uncompressed on an SSD!
The text was updated successfully, but these errors were encountered: