Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect multiple trace samples from a single benchmark run #3995

Open
prasun3 opened this issue Dec 16, 2019 · 26 comments
Open

Collect multiple trace samples from a single benchmark run #3995

prasun3 opened this issue Dec 16, 2019 · 26 comments

Comments

@prasun3
Copy link
Contributor

prasun3 commented Dec 16, 2019

Add a pointer to any prior users list discussion.
Currently we can collect one trace sample during a benchmark run. We use '-trace_after_instrs' and '-exit_after_tracing' to select a trace point.

Is your feature request related to a problem? Please describe.
We need to run the benchmark multiple times to collect traces throughout the benchmark execution. For long running benchmark, this may take a very long time.

Describe the solution you'd like
Have a way to ‘trace x million insts every y million insts’. A more advanced method would be to have a config file listing the sampling windows.

Do you have any implementation in mind for this feature?
No

Describe alternatives you've considered
Currently we plan to run the benchmark with a diff sampling point.

Additional context
None.

@johnfxgalea
Copy link
Contributor

johnfxgalea commented Dec 16, 2019

Thank you for the request. I guess one quick solution is to use a script that runs the benchmark multiple times with updated param values for -trace_after_instrs. However, I do understand this is not ideal for long benchmarks. Would be happy to look at a PR if you wish to contribute the functionality.

@derekbruening
Copy link
Contributor

Presumably the approach of recording a single long trace covering all desired execution windows and then splitting it up into pieces offline will not work due to such a trace simply being too large to easily store?

@prasun3
Copy link
Contributor Author

prasun3 commented Dec 17, 2019

That is right. We see pretty large trace sizes and run into disk space issues.

@derekbruening
Copy link
Contributor

derekbruening commented Feb 10, 2020

Xref an existing feature request to use annotations added to the application to delineate phase regions and have the tracer recognize the annotations and enable/disable recording a the boundaries: #2478.

Also note that another method of creating multiple traces from one execution is to insert start/stop commands into the application. This is well-supported today, in particular with static linkage of the tracer into the application, and we have a number of regression tests of this. E.g., see https://github.com/DynamoRIO/dynamorio/blob/master/clients/drcachesim/tests/burst_static.cpp

Those two approaches both require modifying the application. This issue here covers specifying boundaries on an unmodified application.

@derekbruening
Copy link
Contributor

@prasun3 -- would your use case prefer to modify the application with annotations in the source code to delineate precise tracing regions? Or you would prefer this feature as filed to trace a certain number of instructions without regard to any corresponding application phases or code boundaries?

@prasun3
Copy link
Contributor Author

prasun3 commented Feb 11, 2020

Or you would prefer this feature as filed to trace a certain number of instructions without regard to any corresponding application phases or code boundaries?

We would prefer this approach -- based on instruction count.

@derekbruening derekbruening self-assigned this Feb 25, 2020
@derekbruening
Copy link
Contributor

My proposal is to imlement this using #4134. The existing -trace_after_instrs does a flush, which is expensive. If we want to swap a bunch of times it may be more efficient to use the multi-version support being added for #4134.

@derekbruening
Copy link
Contributor

Xref #3107 as another proposal for delimiting tracing regions

derekbruening added a commit that referenced this issue Apr 20, 2020
Converts the existing -trace_after_instrs delayed tracing feature to
use the drbbdup multi-instrumentation library with two cases: counting
instructions, and full tracing.

The drbbdup case encoding is a global std::atomic, written to using
language features which are lock-free and safe for client use.

This will lay the groundwork for the full i#3995 feature of repeatedly
swapping between the two cases.

TODO: For function tracing, we need to invoke drwrap only for the full
tracing case and not for the instruction counting case.  The plan is
to add a drwrap mode where drwrap does not use its own insertion event
and instead the user invokes drwrap from its insertion event.

TODO: For AArch64, drbbdup needs to handle reachability and encoding
issues with loading the global case value into a register.

Issue: #3995
derekbruening added a commit that referenced this issue Apr 20, 2020
Adds new options -trace_for_instrs and -retrace_every_instrs to
drcachesim for periodic trace bustrs of an unmodified application.

TODO: Implement these using the new drbbdup framework by repeatedly
alternating among the cases.

Issue: #3995
@5surim
Copy link

5surim commented Apr 19, 2021

@derekbruening
Hello Derek,

I found this thread while looking for a way to do periodic tracing in a single run.
Your new options -trace_for_instrs and -retrace_every_instrs to drcachesim will be very useful in my case for periodic trace bustrs of an unmodified application.
Options have been added, but the behavior for the options seems not developed yet. Are you still planning on adding implementation for this? (TODO: Implement these using the new drbbdup framework by repeatedly alternating among the cases.)

Best,
Surim

@derekbruening
Copy link
Contributor

Right, there were still a number of issues here and it ended up de-prioritized and so was not finished. First I'm going to dump my notes from a year ago:

TODO use i#4134 drbbdup to swap bet instru [6/11]

DONE app2app used to pass user_data w/ info on repstr to analysis => storing in TLS

CLOSED: [2020-04-19 Sun 22:40]

DONE drmgr_is_first_nonlabel_instr => John added drbbdup_is_first_nonlabel_instr()

CLOSED: [2020-04-19 Sun 22:40]

Could deduce it in orig analysis cb and pass it through.
=> John added drbbdup_is_first_nonlabel_instr()

TODO drwrap interactions: priority, and how do drwrap for one case but not another?

Another issue: drcachesim needs its insert instrumentation callback to go
after drwrap's (see memtrace_pri comments). It looks like drbbdup
hardcodes the insert event to be at DRMGR_PRIORITY_DRBBDUP = -1500, which
is too early since DRMGR_PRIORITY_INSERT_DRWRAP is 500.

I'm not sure I can work around this one. The app2app I could move the user
data into TLS, and the first_nonlabel I could store myself. But the
priority does not seem solvable from the outside.

I guess this raises a larger issue: we want the drwrap clean call to only
be inserted for one of our drbbdup cases. But with no control over drwrap,
we can't arrange that, and drwrap is going to go insert its clean call at
the start before the drbbdup case dispatch I would guess.

How to solve?

If drbbdup were integrated inside drmgr: would non-bbdup-aware users of
drmgr like drwrap be invoked for the default case insertion and not the
other cases? So the user can either have wrapping for just one case (has
to be default)? Provide an option where user can pick one case, or all
cases?

Or, drwrap has to turn its control model around and have the user call
"do_drwrap_instru" from its insertion event, with drwrap not registering
instru events (but still registering modload, etc.)?

Leaning toward the latter: add a drwrap global flag.

DONE let user provide TLS memop for encoding to avoid mem2mem move

CLOSED: [2020-04-19 Sun 23:20]

Looks like this today (here coming from non-TLS but I would change that):

load case encoding:
 +37   m4 @0x00007f4384a5ab20  48 bf e0 8c a1 04 44 mov    $0x00007f4404a18ce0 -> %rdi
                               7f 00 00
 +47   m4 @0x00007f4384a5b880  8b 3f                mov    (%rdi)[4byte] -> %edi
 +49   m4 @0x00007f4384a5a920  65 48 89 3c 25 00 01 mov    %rdi -> %gs:0x00000100[8byte]
                               00 00
 +58   m4 @0x00007f4384a5b600                       <label>
 +58   m4 @0x00007f4384a5ae00  48 b8 01 00 00 00 00 mov    $0x0000000000000001 -> %rax
                               00 00 00
 +68   m4 @0x00007f4384a5b6d0  65 48 39 04 25 00 01 cmp    %gs:0x00000100[8byte] %rax
                               00 00

Either I provide TLS opnd, or I can query the TLS offset drbbdup is using
and I can write to it.

Wait, actually I can just write directly to it: assuming the slot isn't
cleared or otherwise touched by drbbdup.

Maybe drbbdup can document that it guarantees to not write to the slot
itself, so users know they can have insert_encode be a nop (why not let it
be NULL?)

DONE barrier on load for cmp in every bb? or send signal to every thread: is only API for that dr_suspend_all_other_threads?? => atomic load, though NYI inside drbbdup for non-x86

CLOSED: [2020-04-19 Sun 23:20]

The issue for a globally changing case is memory visibility. If I make my encoding address a single global, I would need a barrier here for aarchxx. The alternative is to have the encoding be TLS and when the value changes go force all the threads to update their values -- although I'm not sure how to do that. If I could run code as each thread I could have each go do an acquire load. I can do that on UNIX by sending a signal. I suppose I could do NtSetContextThread on Windows twice to run my code sequence. I would want to add DR API support for those. dr_suspend_all_other_threads() could do it I guess by setting the mcontext and then resuming -- but there's no per-thread control point to restore w/o having a wait point and another suspend-the-world.

I don't know how performance of a barrier at the top of a every block compares to a heavyweight interrupt-all-threads-when-change approach. Certainly the barrier is much much easier to implement (esp when #4215 is done).

How do you envision a typical use case changing the encoding? Are all the use cases you're thinking of wanting local changes where only the current thread changes its own encoding independently of all others?

Should there then be an option for whether to use an acquire load or a regular load? (This is why I was proposing a register interface before.)

DONE calling drbbdup_register_case_encoding and passing the default encoding: it lets me make a duplicate, and calls both => John fixed

CLOSED: [2020-04-19 Sun 23:21]

DONE labels not preserved from analysis to insert phase => now preserved

CLOSED: [2020-04-19 Sun 22:45]

Used for elision, and same func used in raw2trace so don't want special iterator

TODO add for_trace, translating, and dr_emit_flags_t to cb's to avoid needing _ex versions later for complex users?
TODO runtime_case_opnd problems
  1. For a global: can I use opnd_create_rel_addr() on all platforms?
    Client lib is reachable by default (for x86 2G reachability; how far can
    A64 reach? LDR only reaches +-1MB!!)
    We need drbbdup's XINST_CREATE_load to auto-convert to a pc-rel load on AArch64.

  2. The size of drbbdup_options_t.runtime_case_opnd is not specified.
    For me I'd like to make it std::atomic but that won't work if my code
    writes just one byte but drbbdup goes and reads 4 or 8 bytes. I also get
    runtime errors if I use a non-pointer-size:
    ERROR: Could not find encoding for: mov 0x00007f006f43fce0[4byte] -> %rax

TODO redundant spill-restore code

After dispatching to BBDUP_MODE_COUNT, the flags are restored and
immediately re-spilled before the drx_insert_counter_update():

 +44   m4 @0x00007fe75f0a6880  48 39 05 91 79 0d 80 cmp    0x00007fe7df17a298[8byte] %rax
 +51   m4 @0x00007fe75f0a5920  0f 85 48 00 00 00    jnz    @0x00007fe75f0a6348[8byte]
 +57   m4 @0x00007fe75f0a5e00  65 48 a1 10 01 00 00 mov    %gs:0x00000110[8byte] -> %rax
                               00 00 00 00
 +68   m4 @0x00007fe75f0a66d0  04 7f                add    $0x7f %al -> %al
 +70   m4 @0x00007fe75f0a5ca0  9e                   sahf   %ah
 +71   m4 @0x00007fe75f0a5e68  65 48 a1 08 01 00 00 mov    %gs:0x00000108[8byte] -> %rax
                               00 00 00 00
 +82   m4 @0x00007fe75f0a6750  65 48 a3 e8 00 00 00 mov    %rax -> %gs:0x000000e8[8byte]
                               00 00 00 00
 +93   m4 @0x00007fe75f0a5ee8  9f                   lahf    -> %ah
 +94   m4 @0x00007fe75f0a6050  0f 90 c0             seto    -> %al
 +97   m4 @0x00007fe75f0a61c8  48 83 05 e0 13 fc 7f add    $0x0000000000000002 <rel> 0x00007fe7df063ce8[8byte] -> <rel> 0x00007fe7df063ce8[8byte]
                               02

It's b/c drbbdup doesn't use drreg for that flags restore.

TODO can't use for existing code w/o aarch support!

Need barrier to read encoding for aarch

TODO use drbbdup for regular instru too? is there overhead w/ only 1 case?

Can tell it not to duplicate at all on a per-bb basis.

TODO i#4226: problem: drbbdup has fragment_deleted event!

ext/drbbdup/drbbdup.c: dr_register_delete_event(deleted_frag);

TODO re-implement delayed tracing using drbbdup too

TODO function tracing: needs drwrap changes discussed above

For function tracing, we need to invoke drwrap only for the full tracing
case and not for the instruction counting case. The plan is to add a
drwrap mode where drwrap does not use its own insertion event and instead
the user invokes drwrap from its insertion event.

TODO need AArch64 drbbdup reachability and encoding issues with loading the global case value into a register

TODO Set opts.dup_limit to 1 as noted by John

TODO Set event_bb_analyze_orig and event_bb_analyze_orig_cleanup to NULL because they do nothing. Same goes for event_bb_retrieve_mode. (Again as noted by John.)

TODO change -max_trace_size to swap to instr-count instead of continuing to trace when limit is reached?

TODO support a config file for variable lengths of each burst?

TODO measure perf: ensure no overhead w/ only 1 case

TODO measure perf vs flush for delay

@derekbruening
Copy link
Contributor

derekbruening commented Apr 19, 2021

So I started on this a year ago and as the two commit messages show I started on the first step of refactoring the existing delayed tracing to use drbbdup. At the time, drbbdup was just being developed, and I shared the branch https://github.com/DynamoRIO/dynamorio/tree/i3995-multi-burst for discussions with the drbbdup author @johnfxgalea who FTR had these comments which I do not think were acted upon yet on my side:

Thanks, I had a look and overall the implementation seems to be good.

Just some minor issues:

  1. The dup limit is the number of additional cases, excluding the default case. Therefore, you could have set this to 1. Essentially, you defined an additional slot for nothing. However, this is not a big deal as drbbdup does not produce a wasted duplication of the basic block because it only acts on defined cases.

opts.dup_limit = 2;

  1. You could have set event_bb_analyze_orig and event_bb_analyze_orig_cleanup to NULL because they do nothing. Same goes for event_bb_retrieve_mode

  2. I see a lot of changes from using “instr" to “where" during the insertion stage due to no fault of your own but as a requirement stemming from drbbdup. I looked at the docs and I don’t seem to motivate the reasons behind requirement. Essentially, drbbdup cannot duplicate syscall/cti instruction but must leave such instructions at the end of the basic block. In other to provide different case instrumentation for these instruction, instrumentation must be inserted with respect to “where". I’ll update the docs.

After this the feature was de-prioritized and other work took precedence.

For the refactoring step of having the existing delay use drbbdup:

And then:

  • Implement the new options using the new drbbdup framework by repeatedly alternating among the cases.

Un-assigning to me for now as I am not sure when I would have time for it. If someone else wants to pick it up that would be appreciated.

@derekbruening derekbruening removed their assignment Apr 19, 2021
@prasun3
Copy link
Contributor Author

prasun3 commented Apr 23, 2021

Reading some of the stuff had me wondering: with the bbdup mechanism would it possible to implement switching between different tracers too? For example, we could collect an "L0_filter" trace (which could be used for cache warmup) and then switch to a full instruction trace?

@johnfxgalea
Copy link
Contributor

johnfxgalea commented Apr 23, 2021

@prasun3 That is a pretty good idea! In theory, it should be possible but not sure about the technical effort required. Personally, I created drbbdup for research on taint analysis, so much of my investigation on the approach focuses on that application.

I'd be happy to take the initiative in taking on this PR, but all my time available for DynamoRIO maintenance is being spent on drreg atm.

@L-Chambers
Copy link

Is there any sort of timeline for the DrCacheSim flags -trace_for_instrs and -retrace_every_instrs flags to be implemented?

Thank you.

@derekbruening
Copy link
Contributor

Is there any sort of timeline for the DrCacheSim flags -trace_for_instrs and -retrace_every_instrs flags to be implemented?

Please go ahead and pick up the branch where it was left if you are interested in this feature -- as the comments above note it is not clear someone else will have time to take this on.

@derekbruening
Copy link
Contributor

derekbruening commented Jan 6, 2022

We have renewed interest in this and may revive it: first checking whether anyone else has put work into this that was not pushed to a branch? I didn't see any pushes beyond my initial work from before.

We're looking at two features:

  1. Trace for N instructions every M instructions

  2. Specify many windows via precise start and length points. The first feature could use this mechanism but this likely requires a config file and it might be nice to have convenience parameters that don't need a separate file.

One final idea: for long periods of no tracing: detach to native and use PMU to count instrs for re-attach, for very low overhead when not tracing.

@derekbruening derekbruening self-assigned this Jan 6, 2022
derekbruening added a commit that referenced this issue Mar 7, 2022
Converts the existing -trace_after_instrs delayed tracing feature to
use the drbbdup multi-instrumentation library with two cases: counting
instructions, and full tracing.  It also uses the drwrap support for
drbbdup via its "control inversion".

Removes the #4893 workaround where function tracing via drwrap could
not be delayed and the tracer instead discarded the data.  Now we have
proper delaying.

The drbbdup case encoding is a global std::atomic, written to using
language features which are lock-free and safe for client use.

This will lay the groundwork for the full i#3995 feature of repeatedly
swapping between the two cases.

Tested the #4893 removal on a small app that calls "malloc":
    $ bin64/drrun -t drcachesim -record_heap -offline -- ~/dr/test/mprot && bin64/drrun -t drcachesim -indir $(ls -1td drmem*.dir | head -1) -simulator_type basic_counts
    drmemtrace exiting process 1161557; traced 62158 references.
    Total counts:
          135566 total (fetched) instructions
              ...
               4 total function id markers
               2 total function return address markers
               2 total function argument markers
               2 total function return value markers

    $ bin64/drrun -t drcachesim -trace_after_instrs 10M -record_heap -offline -- ~/dr/test/mprot && bin64/drrun -t drcachesim -indir $(ls -1td drmem*.dir | head -1) -simulator_type basic_counts
    drmemtrace exiting process 1161726; traced 1 references.
    Basic counts tool results:
    Total counts:
               0 total (fetched) instructions
              ...
               0 total function id markers
               0 total function return address markers
               0 total function argument markers
               0 total function return value markers

The drstatecmp-fuzz test was overly sensitive to the delay trigger: de-flaked it by removing the "Generate" output so the trigger point does not vary as much based on the build and trigger technique, to avoid failure with this change.

Issue: #3995, #4893
@derekbruening
Copy link
Contributor

The branch https://github.com/DynamoRIO/dynamorio/tree/i3995-multi-burst has been subsumed by PR #5393 so I am deleting it now.

derekbruening added a commit that referenced this issue Mar 8, 2022
Adds new options -trace_for_instrs and -retrace_every_instrs to
drcachesim for periodic trace bursts of an unmodified application.
Implements them by adapting the existing drbbdup cases for switching
between -trace_after_instrs and full tracing.

Adds documentation on the new options.

Adds instru_t::get_instr_count to count instuctions while tracing, to
know when a tracing burst window is finished.  Uses a local counter
only added to the global every 10K instructions to avoid
synchronization costs.

Adds a new marker with the ordinal of the trace window.  This marker
is added to each buffer header.  This, combined with a new check for
the window having changed to ensure a buffer dump at the end of each
block, limits the possible window drift to one block's worth of data.

Augments raw2trace to avoid delaying a branch across a window change.

Augments the view tool to mark window changes and delay timestamp
output to group with the proper window (it is difficult to actually
reorder timestamp and window entries).

Augments the basic_counts tool to track and display per-window global
statistics.

Augments the invariant_checker tool to not complain on a control-flow
gap across a window.  Adds a test of this: but disables it for Windows
temporarily due to more emulation interopability issues which #5390
covers.

Adds a simple online test and a simple offline test that just confirm
multiple windows are hit on simple_app.  Adds an assembly test with
precise values for the windows.

Issue: #3995, #5390
derekbruening added a commit that referenced this issue Mar 9, 2022
Adds new options -trace_for_instrs and -retrace_every_instrs to
drcachesim for periodic trace bursts of an unmodified application.
Implements them by adapting the existing drbbdup cases for switching
between -trace_after_instrs and full tracing.

Adds documentation on the new options.

Adds instru_t::get_instr_count to count instuctions while tracing, to
know when a tracing burst window is finished.  Uses a local counter
only added to the global every 10K instructions to avoid
synchronization costs.

Adds a new marker with the ordinal of the trace window.  This marker
is added to each buffer header.  This, combined with a new check for
the window having changed to ensure a buffer dump at the end of each
block, limits the possible window drift to one block's worth of data.

Augments raw2trace to avoid delaying a branch across a window change.

Augments the view tool to mark window changes and delay timestamp
output to group with the proper window (it is difficult to actually
reorder timestamp and window entries).

Augments the basic_counts tool to track and display per-window global
statistics.

Augments the invariant_checker tool to not complain on a control-flow
gap across a window.  Adds a test of this: but disables it for Windows
temporarily due to more emulation interopability issues which #5390
covers.

Adds a simple online test and a simple offline test that just confirm
multiple windows are hit on simple_app.  Adds an assembly test with
precise values for the windows.

Issue: #3995, #5390
derekbruening added a commit that referenced this issue Mar 9, 2022
Adds new options -trace_for_instrs and -retrace_every_instrs to
drcachesim for periodic trace bursts of an unmodified application.
Implements them by adapting the existing drbbdup cases for switching
between -trace_after_instrs and full tracing.

Adds documentation on the new options.

Adds instru_t::get_instr_count to count instuctions while tracing, to
know when a tracing burst window is finished.  Uses a local counter
only added to the global every 10K instructions to avoid
synchronization costs.

Adds a new marker with the ordinal of the trace window.  This marker
is added to each buffer header.  This, combined with a new check for
the window having changed to ensure a buffer dump at the end of each
block, limits the possible window drift to one block's worth of data.

Augments raw2trace to avoid delaying a branch across a window change.

Augments the view tool to mark window changes and delay timestamp
output to group with the proper window (it is difficult to actually
reorder timestamp and window entries).

Augments the basic_counts tool to track and display per-window global
statistics.

Augments the invariant_checker tool to not complain on a control-flow
gap across a window.  Adds a test of this: but disables it for Windows
temporarily due to more emulation interopability issues which #5390
covers.

Adds a simple online test and a simple offline test that just confirm
multiple windows are hit on simple_app.  Adds an assembly test with
precise values for the windows.

Issue: #3995, #5390
@derekbruening
Copy link
Contributor

derekbruening commented Mar 9, 2022

We have a number of follow-up clean/extension items. Perhaps some should be split into their own issues:

 * XXX i#3995: To implement -max_trace_size with drbbdup cases (with thread-private
 * encodings), or support nudges enabling tracing, or have a single -trace_for_instrs
 * transition to something lower-cost than counting, we will likely add a 3rd mode that
 * has zero instrumentation.  We also would use the 3rd mode for just -trace_for_instrs
 * with no -retrace_every_instrs.  For now we have just 2 as the case dispatch is more
 * efficient that way.

derekbruening added a commit that referenced this issue Mar 9, 2022
Adds new options -trace_for_instrs and -retrace_every_instrs to
drcachesim for periodic trace bursts of an unmodified application.
Implements them by adapting the existing drbbdup cases for switching
between -trace_after_instrs and full tracing.

Adds documentation on the new options.

Adds instru_t::get_instr_count to count instuctions while tracing, to
know when a tracing burst window is finished.  Uses a local counter
only added to the global every 10K instructions to avoid
synchronization costs.

Adds a new marker with the ordinal of the trace window.  This marker
is added to each buffer header.  This, combined with a new check for
the window having changed to ensure a buffer dump at the end of each
block, limits the possible window drift to one block's worth of data.

Augments raw2trace to avoid delaying a branch across a window change.

Augments the view tool to mark window changes and delay timestamp
output to group with the proper window (it is difficult to actually
reorder timestamp and window entries).

Augments the basic_counts tool to track and display per-window global
statistics.

Augments the invariant_checker tool to not complain on a control-flow
gap across a window.  Adds a test of this: but disables it
temporarily due to more emulation interopability issues which #5390
covers.

Adds a simple online test and a simple offline test that just confirm
multiple windows are hit on simple_app.  Adds an assembly test with
precise values for the windows.

Issue: #3995, #5390
@derekbruening
Copy link
Contributor

Pasting some design notes for this feature. Maybe this could turn into a design doc on the web page:

Design Point: Separate Traces v. Merged-with-Markers

Focusing on a use case of a series of 50 10-billion-instruction traces for a SPEC benchmark, there are two main ways to store them. We could create 50 independent sets of trace files, each with its own metadata and separate set of sharded data files. A simulator could either simulate all 50 separately and aggregate just the resulting statistics, or a single instance of a simulator could fast-forward between each sample to maintain architectural state and simulate the full execution that way.

The alternative is to store all the data in one set of data files, with metadata markers inserted to indicate the division points between the samples. This doesn’t support the separate simulation model up front, though we could provide an iterator interface that skips ahead to a target window and stops at the end of that window (or the simulator could be modified to stop when it sees a sample separation marker). However, this will not be as efficient for parallel simulation with separate simulator instances for each window, since the skipping ahead will take some time. This arrangement does more easily support the fast-forward single-simulator-instance approach, and more readily fits with upstream online simulation.

In terms of implementation, there are several factors to consider here.

Separate raw files

If we want separate final traces, at first the simplest approach is to produce a separate set of raw files for each tracing window. These would be post-processed separately and independently.

However, implementing this split does not fit well in the current interfaces. To work with other filesystems, we have separated out the i/o and in particular directory creation.

For upstream use with files on the local disk, we could add creation of a new directory (and duplication of the module file) for each window by the tracing thread that hits the end-of-window trigger. The other threads would each create a new output raw file each time they transitioned to a new window (see also the Proposal A discussion below).

Splitting during raw2trace

Alternatively, we could keep a single raw file for each thread and split it up into per-window final trace files during postprocessing by the raw2trace tool. We would use markers inserted at the window transition points to identify where to separate.

raw2trace would need to create a new output dir and duplicate the trace headers and module file. Like for separate raw files, this goes against the current i/o separation where today we pass in a list of all the input and output files up front and raw2trace never opens a file on its own, to better support proprietary filesystems with upstream code.

Another concern here is hitting file size limits with a single raw file across many sample traces. For the example above of 50 10-billion-instruction traces, if we assume an average of 2 dynamic instructions per raw entry, each window might contain 5GB of data, reaching 250GB for all 50. Furthermore, the final trace is even larger.

The file size problem gets worse if we use a constant sampling interval across SPEC2017. Some SPEC2017 benchmarks have many more instructions than others. The bwaves_s benchmark has 382 trillion instructions, so a constant interval might result in it having 50x more data than other benchmarks, exceeding the file size limit. A constant number of samples is preferred for this reason.

Splitting during analysis

Given the complexities of splitting in earlier steps, and given that we may want to use a single simulator instance to process all of the sample traces, and given that for upstream online analysis we will likely also have a single simulator instance: perhaps we should not try to split the samples and instead treat the 50 samples as a single trace with internal markers indicating the window division.

Online and offline analyzers can use window ID markers to fast-forward and align each thread to the next window. Maybe the existing serial iterator can have built-in support for aligning the windows.

If single-file final traces will exist, we would need to update all our existing analyzers to handle the gaps in the traces: reset state for function and callstack trackers; keep per-window numbers for statistics gatherers.

We can also create an analyzer that splits a final trace up if we do want separate traces.

Decision: Split during analysis

Separate files seems to be the most flexible and useful setup for our expected use cases, in particular parallel simulation. But given that separating early in the pipeline is complex, we’ll split in the analysis phase, initially with a manual tool since we do not plan to have automatically-gathered multi-window traces.

We’ll update some simple inspection and sanity tools (view, basic_counts, and invariant_checker) to handle and visualize windows, but we’ll assume that trace windows will be split before being analyzed by any more complex analysis tools. For online traces we will probably stick with multi-window-at-once.

We’ll create a tool to manually split up multi-window trace files.

Design Point: Continuous Control v. Re-Attach

One method of obtaining multiple traces is to repeat today’s bursts over and over, with a full detach from the application after each trace. However, each attach point is expensive, with long periods of profiling and code cache pre-population. While a scheme of sharing the profiling and perhaps code cache could be developed while keeping a full detach, a simpler approach is to remain in control but switch from tracing to instruction counting in between tracing windows. Instruction counting is needed to determine where to start the next window in any case.

Instruction counting through instrumentation is not cheap, incurring perhaps a 1.5x slowdown. Compared to the 50x overhead while tracing, however, it is acceptable. If lower overhead is desired in the future, a scheme using a full detach and using hardware performance counters to count instruction can be investigated. The decision for the initial implementation, however, is to use the simpler alternating tracing and counting instrumentation windows.

Design Point: Instrumentation Dispatch v. Flushing

As the prior section concluded, we plan to alternate between tracing and instruction counting.
There are two main approaches to varying instrumentation during execution: inserting all cases up front with a dispatch to the desired current scheme, and replacing instrumentation by flushing the system’s software code cache when changing schemes.

Flushing is an expensive process, and can be fragile as the lower-overhead forms of flushing open up race conditions between threads executing the old and new code cache contents. Its complexity is one reason we are deciding to instead us a dispatch approach for our initial implementation.

With dispatch, we insert both tracing and counting instrumentation for each block in the software code cache. Dispatch code at the top of the block selects which scheme to use. The current mode, either tracing or counting, is stored in memory and needs to be synchronized across all threads.

The simplest method of synchronizing the instrumentation mode is to store it in a global variable, have the dispatch code use a load-acquire to read it, and modify it with a store-release. There is overhead to a load-acquire at the top of every block, but experimentation shows that it is reasonable compared to the overhead of the instrumentation itself even for instruction counting mode, and its simplicity makes it our choice for the initial implementation.

The mechanisms for creating the dispatch and separate copies for the modes is provided for us by the upstream drbbdup library. This library was, however, missing some key pieces we had to add.

@derekbruening
Copy link
Contributor

Handling Phase Transitions

For a normal memtrace burst, we completely detach from the server at the end of our desired trace duration. This detach process synchronizes with every application thread.

For multi-window traces, we are using multi-case dispatched instrumentation where we change the instrumentation type for each window. We have no detach to go through and wake up all threads and have them flush their trace buffers and we're deliberately trying to avoid a global synchronization point. Yet we would prefer perfect transitions between windows, whether that's separate raw files or accurately-placed markers.

Key step: Add end-of-block phase change check

We do flush prior to a syscall, so a thread at a kernel wait point should have an empty buffer and not be a concern.

The main concern is a thread not in a wait state that happens to not be scheduled consistently for a long time and so does not fill up its buffer until well after the window ends.

We can augment the current end-of-block flush check which today looks for the buffer being full. We can add a check for the prior window having ended, by having a global window ordinal and storing its value per thread at the start of filling up a new buffer. (This is better than simply checking the drbbdup mode value for being in non-tracing mode as that will not catch a double mode change.) If the prior window has ended, we can flush the buffer, or simply add a marker, depending on the scheme (see below).

A thread that receives a signal mid-block (it would have to be a synchronous signal as DR waits until the end of the block for asynchronous) will skip its end-of-block checks and redirect to run the app's signal handler: but it would hit the checks for the first block of the handler.

The worst case inaccuracy here is a thread who starts writing in window N but ends up unscheduled until a much later window M. But at most one basic block's worth of trace entries will be placed into window N even though it happened later. Thus we have "basic block accuracy", which is pretty good, as typically a basic block only contains a few instructions.

Proposal A: Separate raw files split at flush time

If we're splitting raw files (see above), we would use the end-of-block window-change flush to emit a thread exit and create a new file. In post-processing, we'd add any missing thread exits to threads that don't have them, to cover waiting threads who never reached a flush.

As discussed above, the trigger thread would create a new directory for each window. A just-finished buffer is writtent to the directory corresponding to the window for its start point.

A thread that is unscheduled for a long time could have a nearly-full buffer that is not written out until many windows later, but it would be written to the old directory for the old window. The next buffer would go to a new file in the new window, with no files in the in-between window directories.

(Originally we thought this scheme would have buffer-level inaccuracy (and talked about using timestamps at the start and end of each buffer to detect): but that would only be if it wrote out to the current window dir.)

Proposal B: Label buffers with owning window

If we add the window ordinal to every buffer header, we can identify which window they belong to, and avoid the need to separate raw files. A window-end flush ensures a buffer belongs solely to the window identified in its header; the next buffer will have the new window value.

This scheme can be used with file splitting during raw2trace, or waiting until analysis. Each thread has one raw file which contains all windows during the execution.

Proposal C: Trigger thread identifies buffer transition point of the other threads

For this proposal, the thread triggering the end of the window walks the other threads and identifies the phase transition point inside the buffer, telling the post-processor where to split them.

I considered having the triggerer also flush the buffers, but that is challenging with a race with the owner also flushing. Plus, it still requires post-processing help to identify the precise point for splitting the buffer (without synchronization the triggerer can only get close).

To avoid barriers on common case trace buffer writes, we use a lazy scheme where the triggerer does not modify the trace buffers themselves, but instead marks which portion has been written using a separate variable never accessed in a fastpath.

Implementation:

  • The tracer maintains a global list of thread buffers using a global mutex on thread init and exit.

  • Each trace buffer has a corresponding externally_written variable holding a distance into the buffer that was written out by another thread.

  • On hitting the trace window endpoint threshold, the triggering thread grabs the mutex and walks the buffers.

    The triggerer doesn't have the current buffer position pointer. Instead it walks the buffer until it reaches zeroed memory (we zero the buffer after each flush). We have no synchronization with the owning thread: but observing writes out of order should be ok since we'll just miss one by stopping too early. We need to fix things up in post-processing in any case, because we need the phase transition to be at a clean point (we can't identify that point from triggerer: if we end at an instr entry, we don't know if some memrefs are coming afterward or not). In post-processing we adjust that position to the end of the block, and we split the buffer contents around that point to the neighboring traces.

    The triggerer does a store-release of the furthest-writting point into the externally_written variable.

  • Whenever a trace writes out its buffer, it does a load-acquire on the externally_written variable and if non-zero it writes out a marker in the buffer header. Post-processing reads the marker and uses it to split the buffer at the nearest block boundary after the marker value.

  • If windows are small enough that the triggerer doesn't complete its buffer walk before a new window starts: other thread buffers may completely go into the new window. That seems ok: if the windows are that small, in the absence of application synchronization the resulting window split should be a possible thread ordering.

This scheme ends up with block-level accuracy since the trigger thread's marked transition point must be adjusted to essentially a block boundary in post-processing. Thus, it does not seem any better than the other schemes, and it is more complex.

Online Traces

It makes sense for offline to treat each window trace as separate and simulate them separately (though aggregating the results to cover the whole application).

But for online: we have the same instance of the simulator or analysis tool throughout the whole application run. It will get confused if it throws away thread bookkeeping on a thread exit for a window.

Either we have a window-controller simulator who spins up and down a new instance of the real target simulator/tool on each window, or we introduce new "end phase/start phase" markers. If we have split offline traces, those would only be for online though which does not sound appealing. Simulators/tools would need special handling for them: reporting statistics for the phase while continuing to aggregate for a multi-phase report or something.

We might want combined files for offline too, as discussed above. That would unify the two, which is appealing.

@derekbruening
Copy link
Contributor

For x86 we also want to eliminate counting of non-fetched rep string insructions which make the instruction counts used for the windows and gaps not match what the PMU will say: this is #4948.

derekbruening added a commit that referenced this issue Mar 10, 2022
…5408)

Adds a new design document discussing decisions and tradeoffs for the
multi-window trace feature.

Issue: #3995
derekbruening added a commit that referenced this issue Apr 6, 2022
Adds a drmemtrace feature under a new on-by-default -split_windows
option to create a separate subdirectory with a separate set of raw
files per traced window.  This avoids disk space issues with a single
file, and splitting at the raw stage is relatively simple for regular
drmemtrace usage (though not as simple for external users of the file
i/o redirection).

Files in raw/window.NNNN/ subdirectories are mirrored in
trace/window.NNNN/ subdirectories upon being post-processed.
Post-processing handles just the first window by default; the others
must be explicitly passed as input directories in separate
post-processing invocations.

This changes the non-window behavior to not create an output file
until tracing starts, which necessitated changing the
tool.drcacheoff.delay-func test to check for no output files as a
slightly different type of test.

Adds a test of split-file offline windows.

Fixes an infinite loop bug in raw2trace hit when a file is truncated:
hit while the windows were buggy and missing footers.

Issue: #3995
derekbruening added a commit that referenced this issue Apr 6, 2022
Adds a drmemtrace feature under a new on-by-default -split_windows
option to create a separate subdirectory with a separate set of raw
files per traced window.  This avoids disk space issues with a single
file, and splitting at the raw stage is relatively simple for regular
drmemtrace usage (though not as simple for external users of the file
i/o redirection).

Files in raw/window.NNNN/ subdirectories are mirrored in
trace/window.NNNN/ subdirectories upon being post-processed.
Post-processing handles just the first window by default; the others
must be explicitly passed as input directories in separate
post-processing invocations.

This changes the non-window behavior to not create an output file
until tracing starts, which necessitated changing the
tool.drcacheoff.delay-func test to check for no output files as a
slightly different type of test.

Adds a test of split-file offline windows.

Fixes an infinite loop bug in raw2trace hit when a file is truncated:
hit while the windows were buggy and missing footers.

Issue: #3995
derekbruening added a commit that referenced this issue May 9, 2022
Fixes bugs where a huge number of files are opened in the first window
of a multi-window drmemtrace run, and extra headers are added as well.

Augments the windows-split test to run long enough to reproduce this
problem, and adds to its output to detect errors in post-processing.

Issue: #3995
derekbruening added a commit that referenced this issue May 9, 2022
Fixes bugs where a huge number of files are opened in the first window
of a multi-window drmemtrace run, and extra headers are added as well.

Augments the windows-split test to run long enough to reproduce this
problem, and adds to its output to detect errors in post-processing.

Issue: #3995
edeiana added a commit that referenced this issue Aug 21, 2024
Currently we can only perform window-tracing at regular intervals
on pre-compiled binaries using the -trace_after_instrs, -trace_for_instrs,
-retrace_every_instrs options.
Sometimes it's useful to trace windows of different sizes at irregular
instruction intervals (e.g., for tracing simpoints).

We do so introducing a new option:
```
-trace_instr_intervals_file path/to/instr/intervals.csv
```
which takes a CSV file where every line has <start,duration> pairs
representing intervals in terms of number of instructions.

The implementation relies on the same window-tracing mechanism used by
-trace_after_instrs, -trace_for_instrs, -retrace_every_instrs.
We add a level of indirection to obtain the values of these options
through the `get_trace_after_instrs_value()`, `get_trace_for_instrs_value()`,
and `get_retrace_every_instrs_value()` functions. This allows us to change
the returned value of these options depending on the window we are tracing
at that point. We do so using two global, read-only vectors containing
the to-trace/to-not-trace number of instructions, and an atomic (also
global) index that we increment every time we finish tracing a window to
go to the next one in the two vectors.

We add a new end-to-end test: tool.drcachesim.irregular-windows-simple.

Issue #3995
edeiana added a commit that referenced this issue Sep 17, 2024
Currently we can only perform window-tracing at regular intervals
on binaries using the -trace_after_instrs, -trace_for_instrs,
-retrace_every_instrs options.
Sometimes it's useful to trace windows of different sizes at irregular
instruction intervals (e.g., for tracing simpoints).

We do so introducing a new option:
```
-trace_instr_intervals_file path/to/instr/intervals.csv
```
which takes a CSV file where every line has <start,duration> pairs
representing intervals in terms of number of instructions.

The implementation relies on the same window-tracing mechanism
used by -trace_after_instrs, -trace_for_instrs, -retrace_every_instrs.
We add a level of indirection to obtain the values of these options
through
the `get_initial_no_trace_for_instrs_value()`,
`get_current_trace_for_instrs_value()`,
and `get_current_no_trace_for_instrs_value()` functions respectively.
This allows us to change the returned value of these options depending
on
the window we are tracing at that point.
We do so using a global, read-only vector `irregular_windows_list`
containing the
trace_for_instrs/no_trace_for_instrs window, and an atomic (also global)
index
`irregular_window_idx` that we increment every time we finish tracing a
window
to go to the next one in the vector.

We add a new end-to-end test: tool.drcachesim.irregular-windows-simple.

Issue #3995
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants