-
Notifications
You must be signed in to change notification settings - Fork 566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redo rep string expansion to avoid the instruction fetch on subsequent iterations #4948
Comments
Adds a new emulation instrumentation interface with simpler convenience routines for identifying the original application instruction fetch and operands. Uses the new interface in memtrace_simple. Separate commits will update memval_simple, memtrace_x86, and drmemtrace. Adds an x86 test that runs the memtrace_simple sample on the allasm_repstr app and ensures we see 2-byte "rep movs" instructions instead of the 1-byte "movs" that the existing sample client records due to not being emulation-aware. Adds a -log_to_stderr flag to the sample to make the test simpler and not have to deal with log files. Adding new events is separated into #4947 which we may attempt in the future. Removing rep string non-fetched instructions is #4948. -------------------------------------------------- Testing: memtrace_simple before: $ bin64/drrun -c api/bin/libmemtrace_simple.so -- suite/tests/bin/allasm_repstr && cat `ls -1td api/bin/*.log|head -1` Format: <data address>: <data size>, <(r)ead/(w)rite/opcode> 0x401018: 1, movs 0x40200e: 1, r 0x402000: 1, w 0x401018: 1, movs 0x40200f: 1, r 0x402001: 1, w 0x401018: 1, movs 0x402010: 1, r 0x402002: 1, w 0x401018: 1, movs 0x402011: 1, r 0x402003: 1, w 0x401018: 1, movs 0x402012: 1, r 0x402004: 1, w After: Format: <data address>: <data size>, <(r)ead/(w)rite/opcode> 0x401018: 2, rep movs 0x40200e: 1, r 0x402000: 1, w 0x401018: 2, rep movs 0x40200f: 1, r 0x402001: 1, w 0x401018: 2, rep movs 0x402010: 1, r 0x402002: 1, w 0x401018: 2, rep movs 0x402011: 1, r 0x402003: 1, w 0x401018: 2, rep movs 0x402012: 1, r 0x402004: 1, w memtrace_simple on a gather expansion: $ ninja && bin64/drrun -c api/bin/libmemtrace_simple.so -log_to_stderr -- suite/tests/bin/allasm_scattergather Format: <data address>: <data size>, <(r)ead/(w)rite/opcode> 0x40105c: 11, mov 0x402022: 4, w 0x401067: 11, mov 0x402026: 4, w 0x401072: 11, mov 0x40202a: 4, w 0x40108a: 7, mov 0x402026: 4, r 0x40108a: 7, mov 0x40202e: 4, r 0x40108a: 7, mov 0x40202a: 4, r => Format: <data address>: <data size>, <(r)ead/(w)rite/opcode> 0x40105c: 11, mov 0x402022: 4, w 0x401067: 11, mov 0x402026: 4, w 0x401072: 11, mov 0x40202a: 4, w 0x40108a: 10, vpgatherdd 0x402026: 4, r 0x40202e: 4, r 0x40202a: 4, r -------------------------------------------------- Issue: #4865
Adds a new emulation instrumentation interface with simpler convenience routines for identifying the original application instruction fetch and operands. Renames DR_EMULATE_FIRST_INSTR to DR_EMULATE_IS_FIRST_INSTR. Uses the new interface in memtrace_simple. Separate commits will update memval_simple, memtrace_x86, and drmemtrace. Adds an x86 test that runs the memtrace_simple sample on the allasm_repstr app and ensures we see 2-byte "rep movs" instructions instead of the 1-byte "movs" that the existing sample client records due to not being emulation-aware. Adds a -log_to_stderr flag to the sample to make the test simpler and not have to deal with log files. Adding new events is separated into #4947 which we may attempt in the future. Removing rep string non-fetched instructions is #4948. -------------------------------------------------- Testing: memtrace_simple before: $ bin64/drrun -c api/bin/libmemtrace_simple.so -- suite/tests/bin/allasm_repstr && cat `ls -1td api/bin/*.log|head -1` Format: <data address>: <data size>, <(r)ead/(w)rite/opcode> 0x401018: 1, movs 0x40200e: 1, r 0x402000: 1, w 0x401018: 1, movs 0x40200f: 1, r 0x402001: 1, w 0x401018: 1, movs 0x402010: 1, r 0x402002: 1, w 0x401018: 1, movs 0x402011: 1, r 0x402003: 1, w 0x401018: 1, movs 0x402012: 1, r 0x402004: 1, w After: Format: <data address>: <data size>, <(r)ead/(w)rite/opcode> 0x401018: 2, rep movs 0x40200e: 1, r 0x402000: 1, w 0x401018: 2, rep movs 0x40200f: 1, r 0x402001: 1, w 0x401018: 2, rep movs 0x402010: 1, r 0x402002: 1, w 0x401018: 2, rep movs 0x402011: 1, r 0x402003: 1, w 0x401018: 2, rep movs 0x402012: 1, r 0x402004: 1, w memtrace_simple on a gather expansion: $ ninja && bin64/drrun -c api/bin/libmemtrace_simple.so -log_to_stderr -- suite/tests/bin/allasm_scattergather Format: <data address>: <data size>, <(r)ead/(w)rite/opcode> 0x40105c: 11, mov 0x402022: 4, w 0x401067: 11, mov 0x402026: 4, w 0x401072: 11, mov 0x40202a: 4, w 0x40108a: 7, mov 0x402026: 4, r 0x40108a: 7, mov 0x40202e: 4, r 0x40108a: 7, mov 0x40202a: 4, r => Format: <data address>: <data size>, <(r)ead/(w)rite/opcode> 0x40105c: 11, mov 0x402022: 4, w 0x401067: 11, mov 0x402026: 4, w 0x401072: 11, mov 0x40202a: 4, w 0x40108a: 10, vpgatherdd 0x402026: 4, r 0x40202e: 4, r 0x40202a: 4, r -------------------------------------------------- Issue: #4865
Re-visiting since we want this for #3995 to ensure the instruction counts used line up with PMU data: I don't understand B) unless it makes the whole loop internal to the block (after the unrolled 1st iter). That seems dangerous: DR relies on loops going between blocks to bound signal delivery times, perform performant "unlink" flushes, build traces, and other operations. |
For C), I assume it's thinking of making the PC of the 2nd block after the rep byte. But how do we know the app doesn't |
Trying to re-write the choices here: A) On entry if TLS flag not set, record instr fetch and set flag; on final B) Keep entire loop inside block and unroll 1st iter to do fetch. Has all C) Try to split across two blocks, one for 1st iter and 2nd for rest. Use D) #4915: Remove all but 1st fetch in raw2trace, and add analysis to do the same |
For drmemtrace, instruction counting for delaying traces, gaps between trace windows (xref #3995), and trace window durations are all inflated by these non-fetched instructions. |
For the AArch64 scatter/gather expansion we decided to unroll the loop to avoid this issue. If we implement a better solution for repstr it might be worth revisiting scatter/gather expansion too. |
This is related to #4915 where we may remove the instr fetches for subsequent rep string iterations, currently marked as special "non-fetched" instructions, from the offline drcachesim trace post-processing. Here though the goal is remove the instruction fetch for all clients by changing the expansion.
Today the single-block-external-loop expansion results in a string operation instruction in each iteration:
There are 3 ways we could remove it (aside from an offline post-processing model as in #4915):
A) Use runtime state:
This is expensive.
B) Unroll the 1st iter in the expansion. This adds a bunch more branches and we'll want to try to ensure it won't mess up drreg or other linear-flow libraries.
C) Unroll the 1st iter across 2 blocks: 1st iter block plus rest-of-loop block. This may be harder than B due to needing unique PC entries.
Probably B would be the way to go.
The text was updated successfully, but these errors were encountered: