Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redo rep string expansion to avoid the instruction fetch on subsequent iterations #4948

Open
derekbruening opened this issue Jun 16, 2021 · 5 comments

Comments

@derekbruening
Copy link
Contributor

This is related to #4915 where we may remove the instr fetches for subsequent rep string iterations, currently marked as special "non-fetched" instructions, from the offline drcachesim trace post-processing. Here though the goal is remove the instruction fetch for all clients by changing the expansion.

Today the single-block-external-loop expansion results in a string operation instruction in each iteration:

$ bin64/drrun -c api/bin/libmemtrace_simple.so -- suite/tests/bin/allasm_repstr && cat `ls -1td api/bin/*.log|head -1`
Format: <data address>: <data size>, <(r)ead/(w)rite/opcode>
0x401018:  2, rep movs
0x40200e:  1, r
0x402000:  1, w
0x401018:  2, rep movs
0x40200f:  1, r
0x402001:  1, w
0x401018:  2, rep movs
0x402010:  1, r
0x402002:  1, w
0x401018:  2, rep movs
0x402011:  1, r
0x402003:  1, w
0x401018:  2, rep movs
0x402012:  1, r
0x402004:  1, w

There are 3 ways we could remove it (aside from an offline post-processing model as in #4915):

A) Use runtime state:

if TLS.flag == 0
  record instr fetch
  set TLS.flag to 1
<today's expansion but without emulated instr fetch>
if loop is going to be not taken (check ecx)
  set TLS.flag to 0
loop

This is expensive.

B) Unroll the 1st iter in the expansion. This adds a bunch more branches and we'll want to try to ensure it won't mess up drreg or other linear-flow libraries.

C) Unroll the 1st iter across 2 blocks: 1st iter block plus rest-of-loop block. This may be harder than B due to needing unique PC entries.

Probably B would be the way to go.

derekbruening added a commit that referenced this issue Jun 18, 2021
Adds a new emulation instrumentation interface with simpler
convenience routines for identifying the original application
instruction fetch and operands.

Uses the new interface in memtrace_simple.  Separate commits will
update memval_simple, memtrace_x86, and drmemtrace.

Adds an x86 test that runs the memtrace_simple sample on the
allasm_repstr app and ensures we see 2-byte "rep movs" instructions
instead of the 1-byte "movs" that the existing sample client records
due to not being emulation-aware.  Adds a -log_to_stderr flag to the
sample to make the test simpler and not have to deal with log files.

Adding new events is separated into #4947 which we may attempt in the
future.  Removing rep string non-fetched instructions is #4948.

--------------------------------------------------
Testing:

memtrace_simple before:
    $ bin64/drrun -c api/bin/libmemtrace_simple.so -- suite/tests/bin/allasm_repstr && cat `ls -1td api/bin/*.log|head -1`
    Format: <data address>: <data size>, <(r)ead/(w)rite/opcode>
    0x401018:  1, movs
    0x40200e:  1, r
    0x402000:  1, w
    0x401018:  1, movs
    0x40200f:  1, r
    0x402001:  1, w
    0x401018:  1, movs
    0x402010:  1, r
    0x402002:  1, w
    0x401018:  1, movs
    0x402011:  1, r
    0x402003:  1, w
    0x401018:  1, movs
    0x402012:  1, r
    0x402004:  1, w
After:
    Format: <data address>: <data size>, <(r)ead/(w)rite/opcode>
    0x401018:  2, rep movs
    0x40200e:  1, r
    0x402000:  1, w
    0x401018:  2, rep movs
    0x40200f:  1, r
    0x402001:  1, w
    0x401018:  2, rep movs
    0x402010:  1, r
    0x402002:  1, w
    0x401018:  2, rep movs
    0x402011:  1, r
    0x402003:  1, w
    0x401018:  2, rep movs
    0x402012:  1, r
    0x402004:  1, w

memtrace_simple on a gather expansion:
    $ ninja && bin64/drrun -c api/bin/libmemtrace_simple.so -log_to_stderr -- suite/tests/bin/allasm_scattergather
    Format: <data address>: <data size>, <(r)ead/(w)rite/opcode>
    0x40105c: 11, mov
    0x402022:  4, w
    0x401067: 11, mov
    0x402026:  4, w
    0x401072: 11, mov
    0x40202a:  4, w
    0x40108a:  7, mov
    0x402026:  4, r
    0x40108a:  7, mov
    0x40202e:  4, r
    0x40108a:  7, mov
    0x40202a:  4, r
  =>
    Format: <data address>: <data size>, <(r)ead/(w)rite/opcode>
    0x40105c: 11, mov
    0x402022:  4, w
    0x401067: 11, mov
    0x402026:  4, w
    0x401072: 11, mov
    0x40202a:  4, w
    0x40108a: 10, vpgatherdd
    0x402026:  4, r
    0x40202e:  4, r
    0x40202a:  4, r
--------------------------------------------------

Issue: #4865
derekbruening added a commit that referenced this issue Jun 18, 2021
Adds a new emulation instrumentation interface with simpler
convenience routines for identifying the original application
instruction fetch and operands.

Renames DR_EMULATE_FIRST_INSTR to DR_EMULATE_IS_FIRST_INSTR.

Uses the new interface in memtrace_simple.  Separate commits will
update memval_simple, memtrace_x86, and drmemtrace.

Adds an x86 test that runs the memtrace_simple sample on the
allasm_repstr app and ensures we see 2-byte "rep movs" instructions
instead of the 1-byte "movs" that the existing sample client records
due to not being emulation-aware.  Adds a -log_to_stderr flag to the
sample to make the test simpler and not have to deal with log files.

Adding new events is separated into #4947 which we may attempt in the
future.  Removing rep string non-fetched instructions is #4948.

--------------------------------------------------
Testing:

memtrace_simple before:
    $ bin64/drrun -c api/bin/libmemtrace_simple.so -- suite/tests/bin/allasm_repstr && cat `ls -1td api/bin/*.log|head -1`
    Format: <data address>: <data size>, <(r)ead/(w)rite/opcode>
    0x401018:  1, movs
    0x40200e:  1, r
    0x402000:  1, w
    0x401018:  1, movs
    0x40200f:  1, r
    0x402001:  1, w
    0x401018:  1, movs
    0x402010:  1, r
    0x402002:  1, w
    0x401018:  1, movs
    0x402011:  1, r
    0x402003:  1, w
    0x401018:  1, movs
    0x402012:  1, r
    0x402004:  1, w
After:
    Format: <data address>: <data size>, <(r)ead/(w)rite/opcode>
    0x401018:  2, rep movs
    0x40200e:  1, r
    0x402000:  1, w
    0x401018:  2, rep movs
    0x40200f:  1, r
    0x402001:  1, w
    0x401018:  2, rep movs
    0x402010:  1, r
    0x402002:  1, w
    0x401018:  2, rep movs
    0x402011:  1, r
    0x402003:  1, w
    0x401018:  2, rep movs
    0x402012:  1, r
    0x402004:  1, w

memtrace_simple on a gather expansion:
    $ ninja && bin64/drrun -c api/bin/libmemtrace_simple.so -log_to_stderr -- suite/tests/bin/allasm_scattergather
    Format: <data address>: <data size>, <(r)ead/(w)rite/opcode>
    0x40105c: 11, mov
    0x402022:  4, w
    0x401067: 11, mov
    0x402026:  4, w
    0x401072: 11, mov
    0x40202a:  4, w
    0x40108a:  7, mov
    0x402026:  4, r
    0x40108a:  7, mov
    0x40202e:  4, r
    0x40108a:  7, mov
    0x40202a:  4, r
  =>
    Format: <data address>: <data size>, <(r)ead/(w)rite/opcode>
    0x40105c: 11, mov
    0x402022:  4, w
    0x401067: 11, mov
    0x402026:  4, w
    0x401072: 11, mov
    0x40202a:  4, w
    0x40108a: 10, vpgatherdd
    0x402026:  4, r
    0x40202e:  4, r
    0x40202a:  4, r
--------------------------------------------------

Issue: #4865
@derekbruening
Copy link
Contributor Author

Re-visiting since we want this for #3995 to ensure the instruction counts used line up with PMU data: I don't understand B) unless it makes the whole loop internal to the block (after the unrolled 1st iter). That seems dangerous: DR relies on loops going between blocks to bound signal delivery times, perform performant "unlink" flushes, build traces, and other operations.

@derekbruening
Copy link
Contributor Author

For C), I assume it's thinking of making the PC of the 2nd block after the rep byte. But how do we know the app doesn't
itself jump to that point? It's unlikely but possible.

@derekbruening
Copy link
Contributor Author

Trying to re-write the choices here:

A) On entry if TLS flag not set, record instr fetch and set flag; on final
iter, clear flag. I assume have to try to clear flag on abnormal loop
break too: a signal. So adds overhead and extra branches.

B) Keep entire loop inside block and unroll 1st iter to do fetch. Has all
the downsides of a loop inside a block: messes up DR's trace building,
signal delivery delay bounds, unlink flushing, shared fragment deletion
promptness, etc.

C) Try to split across two blocks, one for 1st iter and 2nd for rest. Use
PC+1 (skipping rep byte) for tag for 2nd and hope app never targets that
PC. When see a plain string op have to decode backward I guess or have
some recorded state so know it's this artificial loop body.

D) #4915: Remove all but 1st fetch in raw2trace, and add analysis to do the same
when dumping a trace buffer to get the live instr counts to match too --
which also fixes online. But then instead of a single contiguous buffer
being written/transmitted you have to do it in many pieces to skip the
extra fetches, or waste time memcpying it on top of itself back to
contiguous.

@derekbruening
Copy link
Contributor Author

For drmemtrace, instruction counting for delaying traces, gaps between trace windows (xref #3995), and trace window durations are all inflated by these non-fetched instructions.

@jackgallagher-arm
Copy link
Collaborator

For the AArch64 scatter/gather expansion we decided to unroll the loop to avoid this issue. If we implement a better solution for repstr it might be worth revisiting scatter/gather expansion too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants