-
Notifications
You must be signed in to change notification settings - Fork 570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collect multiple trace samples from a single benchmark run #3995
Comments
Thank you for the request. I guess one quick solution is to use a script that runs the benchmark multiple times with updated param values for -trace_after_instrs. However, I do understand this is not ideal for long benchmarks. Would be happy to look at a PR if you wish to contribute the functionality. |
Presumably the approach of recording a single long trace covering all desired execution windows and then splitting it up into pieces offline will not work due to such a trace simply being too large to easily store? |
That is right. We see pretty large trace sizes and run into disk space issues. |
Xref an existing feature request to use annotations added to the application to delineate phase regions and have the tracer recognize the annotations and enable/disable recording a the boundaries: #2478. Also note that another method of creating multiple traces from one execution is to insert start/stop commands into the application. This is well-supported today, in particular with static linkage of the tracer into the application, and we have a number of regression tests of this. E.g., see https://github.com/DynamoRIO/dynamorio/blob/master/clients/drcachesim/tests/burst_static.cpp Those two approaches both require modifying the application. This issue here covers specifying boundaries on an unmodified application. |
@prasun3 -- would your use case prefer to modify the application with annotations in the source code to delineate precise tracing regions? Or you would prefer this feature as filed to trace a certain number of instructions without regard to any corresponding application phases or code boundaries? |
We would prefer this approach -- based on instruction count. |
Xref #3107 as another proposal for delimiting tracing regions |
Converts the existing -trace_after_instrs delayed tracing feature to use the drbbdup multi-instrumentation library with two cases: counting instructions, and full tracing. The drbbdup case encoding is a global std::atomic, written to using language features which are lock-free and safe for client use. This will lay the groundwork for the full i#3995 feature of repeatedly swapping between the two cases. TODO: For function tracing, we need to invoke drwrap only for the full tracing case and not for the instruction counting case. The plan is to add a drwrap mode where drwrap does not use its own insertion event and instead the user invokes drwrap from its insertion event. TODO: For AArch64, drbbdup needs to handle reachability and encoding issues with loading the global case value into a register. Issue: #3995
Adds new options -trace_for_instrs and -retrace_every_instrs to drcachesim for periodic trace bustrs of an unmodified application. TODO: Implement these using the new drbbdup framework by repeatedly alternating among the cases. Issue: #3995
@derekbruening I found this thread while looking for a way to do periodic tracing in a single run. Best, |
Right, there were still a number of issues here and it ended up de-prioritized and so was not finished. First I'm going to dump my notes from a year ago: TODO use i#4134 drbbdup to swap bet instru [6/11]DONE app2app used to pass user_data w/ info on repstr to analysis => storing in TLSCLOSED: [2020-04-19 Sun 22:40] DONE drmgr_is_first_nonlabel_instr => John added drbbdup_is_first_nonlabel_instr()CLOSED: [2020-04-19 Sun 22:40] Could deduce it in orig analysis cb and pass it through. TODO drwrap interactions: priority, and how do drwrap for one case but not another?Another issue: drcachesim needs its insert instrumentation callback to go I'm not sure I can work around this one. The app2app I could move the user I guess this raises a larger issue: we want the drwrap clean call to only How to solve? If drbbdup were integrated inside drmgr: would non-bbdup-aware users of Or, drwrap has to turn its control model around and have the user call Leaning toward the latter: add a drwrap global flag. DONE let user provide TLS memop for encoding to avoid mem2mem moveCLOSED: [2020-04-19 Sun 23:20] Looks like this today (here coming from non-TLS but I would change that):
Either I provide TLS opnd, or I can query the TLS offset drbbdup is using Wait, actually I can just write directly to it: assuming the slot isn't Maybe drbbdup can document that it guarantees to not write to the slot DONE barrier on load for cmp in every bb? or send signal to every thread: is only API for that dr_suspend_all_other_threads?? => atomic load, though NYI inside drbbdup for non-x86CLOSED: [2020-04-19 Sun 23:20] The issue for a globally changing case is memory visibility. If I make my encoding address a single global, I would need a barrier here for aarchxx. The alternative is to have the encoding be TLS and when the value changes go force all the threads to update their values -- although I'm not sure how to do that. If I could run code as each thread I could have each go do an acquire load. I can do that on UNIX by sending a signal. I suppose I could do NtSetContextThread on Windows twice to run my code sequence. I would want to add DR API support for those. dr_suspend_all_other_threads() could do it I guess by setting the mcontext and then resuming -- but there's no per-thread control point to restore w/o having a wait point and another suspend-the-world. I don't know how performance of a barrier at the top of a every block compares to a heavyweight interrupt-all-threads-when-change approach. Certainly the barrier is much much easier to implement (esp when #4215 is done). How do you envision a typical use case changing the encoding? Are all the use cases you're thinking of wanting local changes where only the current thread changes its own encoding independently of all others? Should there then be an option for whether to use an acquire load or a regular load? (This is why I was proposing a register interface before.) DONE calling drbbdup_register_case_encoding and passing the default encoding: it lets me make a duplicate, and calls both => John fixedCLOSED: [2020-04-19 Sun 23:21] DONE labels not preserved from analysis to insert phase => now preservedCLOSED: [2020-04-19 Sun 22:45] Used for elision, and same func used in raw2trace so don't want special iterator TODO add for_trace, translating, and dr_emit_flags_t to cb's to avoid needing _ex versions later for complex users?TODO runtime_case_opnd problems
TODO redundant spill-restore codeAfter dispatching to BBDUP_MODE_COUNT, the flags are restored and
It's b/c drbbdup doesn't use drreg for that flags restore. TODO can't use for existing code w/o aarch support!Need barrier to read encoding for aarch TODO use drbbdup for regular instru too? is there overhead w/ only 1 case?Can tell it not to duplicate at all on a per-bb basis. TODO i#4226: problem: drbbdup has fragment_deleted event!ext/drbbdup/drbbdup.c: dr_register_delete_event(deleted_frag); TODO re-implement delayed tracing using drbbdup tooTODO function tracing: needs drwrap changes discussed aboveFor function tracing, we need to invoke drwrap only for the full tracing TODO need AArch64 drbbdup reachability and encoding issues with loading the global case value into a registerTODO Set opts.dup_limit to 1 as noted by JohnTODO Set event_bb_analyze_orig and event_bb_analyze_orig_cleanup to NULL because they do nothing. Same goes for event_bb_retrieve_mode. (Again as noted by John.)TODO change -max_trace_size to swap to instr-count instead of continuing to trace when limit is reached?TODO support a config file for variable lengths of each burst?TODO measure perf: ensure no overhead w/ only 1 caseTODO measure perf vs flush for delay |
So I started on this a year ago and as the two commit messages show I started on the first step of refactoring the existing delayed tracing to use drbbdup. At the time, drbbdup was just being developed, and I shared the branch https://github.com/DynamoRIO/dynamorio/tree/i3995-multi-burst for discussions with the drbbdup author @johnfxgalea who FTR had these comments which I do not think were acted upon yet on my side:
After this the feature was de-prioritized and other work took precedence. For the refactoring step of having the existing delay use drbbdup:
And then:
Un-assigning to me for now as I am not sure when I would have time for it. If someone else wants to pick it up that would be appreciated. |
Reading some of the stuff had me wondering: with the bbdup mechanism would it possible to implement switching between different tracers too? For example, we could collect an "L0_filter" trace (which could be used for cache warmup) and then switch to a full instruction trace? |
@prasun3 That is a pretty good idea! In theory, it should be possible but not sure about the technical effort required. Personally, I created drbbdup for research on taint analysis, so much of my investigation on the approach focuses on that application. I'd be happy to take the initiative in taking on this PR, but all my time available for DynamoRIO maintenance is being spent on drreg atm. |
Is there any sort of timeline for the DrCacheSim flags -trace_for_instrs and -retrace_every_instrs flags to be implemented? Thank you. |
Please go ahead and pick up the branch where it was left if you are interested in this feature -- as the comments above note it is not clear someone else will have time to take this on. |
We have renewed interest in this and may revive it: first checking whether anyone else has put work into this that was not pushed to a branch? I didn't see any pushes beyond my initial work from before. We're looking at two features:
One final idea: for long periods of no tracing: detach to native and use PMU to count instrs for re-attach, for very low overhead when not tracing. |
Converts the existing -trace_after_instrs delayed tracing feature to use the drbbdup multi-instrumentation library with two cases: counting instructions, and full tracing. It also uses the drwrap support for drbbdup via its "control inversion". Removes the #4893 workaround where function tracing via drwrap could not be delayed and the tracer instead discarded the data. Now we have proper delaying. The drbbdup case encoding is a global std::atomic, written to using language features which are lock-free and safe for client use. This will lay the groundwork for the full i#3995 feature of repeatedly swapping between the two cases. Tested the #4893 removal on a small app that calls "malloc": $ bin64/drrun -t drcachesim -record_heap -offline -- ~/dr/test/mprot && bin64/drrun -t drcachesim -indir $(ls -1td drmem*.dir | head -1) -simulator_type basic_counts drmemtrace exiting process 1161557; traced 62158 references. Total counts: 135566 total (fetched) instructions ... 4 total function id markers 2 total function return address markers 2 total function argument markers 2 total function return value markers $ bin64/drrun -t drcachesim -trace_after_instrs 10M -record_heap -offline -- ~/dr/test/mprot && bin64/drrun -t drcachesim -indir $(ls -1td drmem*.dir | head -1) -simulator_type basic_counts drmemtrace exiting process 1161726; traced 1 references. Basic counts tool results: Total counts: 0 total (fetched) instructions ... 0 total function id markers 0 total function return address markers 0 total function argument markers 0 total function return value markers The drstatecmp-fuzz test was overly sensitive to the delay trigger: de-flaked it by removing the "Generate" output so the trigger point does not vary as much based on the build and trigger technique, to avoid failure with this change. Issue: #3995, #4893
The branch https://github.com/DynamoRIO/dynamorio/tree/i3995-multi-burst has been subsumed by PR #5393 so I am deleting it now. |
Adds new options -trace_for_instrs and -retrace_every_instrs to drcachesim for periodic trace bursts of an unmodified application. Implements them by adapting the existing drbbdup cases for switching between -trace_after_instrs and full tracing. Adds documentation on the new options. Adds instru_t::get_instr_count to count instuctions while tracing, to know when a tracing burst window is finished. Uses a local counter only added to the global every 10K instructions to avoid synchronization costs. Adds a new marker with the ordinal of the trace window. This marker is added to each buffer header. This, combined with a new check for the window having changed to ensure a buffer dump at the end of each block, limits the possible window drift to one block's worth of data. Augments raw2trace to avoid delaying a branch across a window change. Augments the view tool to mark window changes and delay timestamp output to group with the proper window (it is difficult to actually reorder timestamp and window entries). Augments the basic_counts tool to track and display per-window global statistics. Augments the invariant_checker tool to not complain on a control-flow gap across a window. Adds a test of this: but disables it for Windows temporarily due to more emulation interopability issues which #5390 covers. Adds a simple online test and a simple offline test that just confirm multiple windows are hit on simple_app. Adds an assembly test with precise values for the windows. Issue: #3995, #5390
Adds new options -trace_for_instrs and -retrace_every_instrs to drcachesim for periodic trace bursts of an unmodified application. Implements them by adapting the existing drbbdup cases for switching between -trace_after_instrs and full tracing. Adds documentation on the new options. Adds instru_t::get_instr_count to count instuctions while tracing, to know when a tracing burst window is finished. Uses a local counter only added to the global every 10K instructions to avoid synchronization costs. Adds a new marker with the ordinal of the trace window. This marker is added to each buffer header. This, combined with a new check for the window having changed to ensure a buffer dump at the end of each block, limits the possible window drift to one block's worth of data. Augments raw2trace to avoid delaying a branch across a window change. Augments the view tool to mark window changes and delay timestamp output to group with the proper window (it is difficult to actually reorder timestamp and window entries). Augments the basic_counts tool to track and display per-window global statistics. Augments the invariant_checker tool to not complain on a control-flow gap across a window. Adds a test of this: but disables it for Windows temporarily due to more emulation interopability issues which #5390 covers. Adds a simple online test and a simple offline test that just confirm multiple windows are hit on simple_app. Adds an assembly test with precise values for the windows. Issue: #3995, #5390
Adds new options -trace_for_instrs and -retrace_every_instrs to drcachesim for periodic trace bursts of an unmodified application. Implements them by adapting the existing drbbdup cases for switching between -trace_after_instrs and full tracing. Adds documentation on the new options. Adds instru_t::get_instr_count to count instuctions while tracing, to know when a tracing burst window is finished. Uses a local counter only added to the global every 10K instructions to avoid synchronization costs. Adds a new marker with the ordinal of the trace window. This marker is added to each buffer header. This, combined with a new check for the window having changed to ensure a buffer dump at the end of each block, limits the possible window drift to one block's worth of data. Augments raw2trace to avoid delaying a branch across a window change. Augments the view tool to mark window changes and delay timestamp output to group with the proper window (it is difficult to actually reorder timestamp and window entries). Augments the basic_counts tool to track and display per-window global statistics. Augments the invariant_checker tool to not complain on a control-flow gap across a window. Adds a test of this: but disables it for Windows temporarily due to more emulation interopability issues which #5390 covers. Adds a simple online test and a simple offline test that just confirm multiple windows are hit on simple_app. Adds an assembly test with precise values for the windows. Issue: #3995, #5390
We have a number of follow-up clean/extension items. Perhaps some should be split into their own issues:
|
Adds new options -trace_for_instrs and -retrace_every_instrs to drcachesim for periodic trace bursts of an unmodified application. Implements them by adapting the existing drbbdup cases for switching between -trace_after_instrs and full tracing. Adds documentation on the new options. Adds instru_t::get_instr_count to count instuctions while tracing, to know when a tracing burst window is finished. Uses a local counter only added to the global every 10K instructions to avoid synchronization costs. Adds a new marker with the ordinal of the trace window. This marker is added to each buffer header. This, combined with a new check for the window having changed to ensure a buffer dump at the end of each block, limits the possible window drift to one block's worth of data. Augments raw2trace to avoid delaying a branch across a window change. Augments the view tool to mark window changes and delay timestamp output to group with the proper window (it is difficult to actually reorder timestamp and window entries). Augments the basic_counts tool to track and display per-window global statistics. Augments the invariant_checker tool to not complain on a control-flow gap across a window. Adds a test of this: but disables it temporarily due to more emulation interopability issues which #5390 covers. Adds a simple online test and a simple offline test that just confirm multiple windows are hit on simple_app. Adds an assembly test with precise values for the windows. Issue: #3995, #5390
Pasting some design notes for this feature. Maybe this could turn into a design doc on the web page: Design Point: Separate Traces v. Merged-with-MarkersFocusing on a use case of a series of 50 10-billion-instruction traces for a SPEC benchmark, there are two main ways to store them. We could create 50 independent sets of trace files, each with its own metadata and separate set of sharded data files. A simulator could either simulate all 50 separately and aggregate just the resulting statistics, or a single instance of a simulator could fast-forward between each sample to maintain architectural state and simulate the full execution that way. The alternative is to store all the data in one set of data files, with metadata markers inserted to indicate the division points between the samples. This doesn’t support the separate simulation model up front, though we could provide an iterator interface that skips ahead to a target window and stops at the end of that window (or the simulator could be modified to stop when it sees a sample separation marker). However, this will not be as efficient for parallel simulation with separate simulator instances for each window, since the skipping ahead will take some time. This arrangement does more easily support the fast-forward single-simulator-instance approach, and more readily fits with upstream online simulation. In terms of implementation, there are several factors to consider here. Separate raw filesIf we want separate final traces, at first the simplest approach is to produce a separate set of raw files for each tracing window. These would be post-processed separately and independently. However, implementing this split does not fit well in the current interfaces. To work with other filesystems, we have separated out the i/o and in particular directory creation. For upstream use with files on the local disk, we could add creation of a new directory (and duplication of the module file) for each window by the tracing thread that hits the end-of-window trigger. The other threads would each create a new output raw file each time they transitioned to a new window (see also the Proposal A discussion below). Splitting during raw2traceAlternatively, we could keep a single raw file for each thread and split it up into per-window final trace files during postprocessing by the raw2trace tool. We would use markers inserted at the window transition points to identify where to separate. raw2trace would need to create a new output dir and duplicate the trace headers and module file. Like for separate raw files, this goes against the current i/o separation where today we pass in a list of all the input and output files up front and raw2trace never opens a file on its own, to better support proprietary filesystems with upstream code. Another concern here is hitting file size limits with a single raw file across many sample traces. For the example above of 50 10-billion-instruction traces, if we assume an average of 2 dynamic instructions per raw entry, each window might contain 5GB of data, reaching 250GB for all 50. Furthermore, the final trace is even larger. The file size problem gets worse if we use a constant sampling interval across SPEC2017. Some SPEC2017 benchmarks have many more instructions than others. The bwaves_s benchmark has 382 trillion instructions, so a constant interval might result in it having 50x more data than other benchmarks, exceeding the file size limit. A constant number of samples is preferred for this reason. Splitting during analysisGiven the complexities of splitting in earlier steps, and given that we may want to use a single simulator instance to process all of the sample traces, and given that for upstream online analysis we will likely also have a single simulator instance: perhaps we should not try to split the samples and instead treat the 50 samples as a single trace with internal markers indicating the window division. Online and offline analyzers can use window ID markers to fast-forward and align each thread to the next window. Maybe the existing serial iterator can have built-in support for aligning the windows. If single-file final traces will exist, we would need to update all our existing analyzers to handle the gaps in the traces: reset state for function and callstack trackers; keep per-window numbers for statistics gatherers. We can also create an analyzer that splits a final trace up if we do want separate traces. Decision: Split during analysisSeparate files seems to be the most flexible and useful setup for our expected use cases, in particular parallel simulation. But given that separating early in the pipeline is complex, we’ll split in the analysis phase, initially with a manual tool since we do not plan to have automatically-gathered multi-window traces. We’ll update some simple inspection and sanity tools (view, basic_counts, and invariant_checker) to handle and visualize windows, but we’ll assume that trace windows will be split before being analyzed by any more complex analysis tools. For online traces we will probably stick with multi-window-at-once. We’ll create a tool to manually split up multi-window trace files. Design Point: Continuous Control v. Re-AttachOne method of obtaining multiple traces is to repeat today’s bursts over and over, with a full detach from the application after each trace. However, each attach point is expensive, with long periods of profiling and code cache pre-population. While a scheme of sharing the profiling and perhaps code cache could be developed while keeping a full detach, a simpler approach is to remain in control but switch from tracing to instruction counting in between tracing windows. Instruction counting is needed to determine where to start the next window in any case. Instruction counting through instrumentation is not cheap, incurring perhaps a 1.5x slowdown. Compared to the 50x overhead while tracing, however, it is acceptable. If lower overhead is desired in the future, a scheme using a full detach and using hardware performance counters to count instruction can be investigated. The decision for the initial implementation, however, is to use the simpler alternating tracing and counting instrumentation windows. Design Point: Instrumentation Dispatch v. FlushingAs the prior section concluded, we plan to alternate between tracing and instruction counting. Flushing is an expensive process, and can be fragile as the lower-overhead forms of flushing open up race conditions between threads executing the old and new code cache contents. Its complexity is one reason we are deciding to instead us a dispatch approach for our initial implementation. With dispatch, we insert both tracing and counting instrumentation for each block in the software code cache. Dispatch code at the top of the block selects which scheme to use. The current mode, either tracing or counting, is stored in memory and needs to be synchronized across all threads. The simplest method of synchronizing the instrumentation mode is to store it in a global variable, have the dispatch code use a load-acquire to read it, and modify it with a store-release. There is overhead to a load-acquire at the top of every block, but experimentation shows that it is reasonable compared to the overhead of the instrumentation itself even for instruction counting mode, and its simplicity makes it our choice for the initial implementation. The mechanisms for creating the dispatch and separate copies for the modes is provided for us by the upstream drbbdup library. This library was, however, missing some key pieces we had to add. |
Handling Phase TransitionsFor a normal memtrace burst, we completely detach from the server at the end of our desired trace duration. This detach process synchronizes with every application thread. For multi-window traces, we are using multi-case dispatched instrumentation where we change the instrumentation type for each window. We have no detach to go through and wake up all threads and have them flush their trace buffers and we're deliberately trying to avoid a global synchronization point. Yet we would prefer perfect transitions between windows, whether that's separate raw files or accurately-placed markers. Key step: Add end-of-block phase change checkWe do flush prior to a syscall, so a thread at a kernel wait point should have an empty buffer and not be a concern. The main concern is a thread not in a wait state that happens to not be scheduled consistently for a long time and so does not fill up its buffer until well after the window ends. We can augment the current end-of-block flush check which today looks for the buffer being full. We can add a check for the prior window having ended, by having a global window ordinal and storing its value per thread at the start of filling up a new buffer. (This is better than simply checking the drbbdup mode value for being in non-tracing mode as that will not catch a double mode change.) If the prior window has ended, we can flush the buffer, or simply add a marker, depending on the scheme (see below). A thread that receives a signal mid-block (it would have to be a synchronous signal as DR waits until the end of the block for asynchronous) will skip its end-of-block checks and redirect to run the app's signal handler: but it would hit the checks for the first block of the handler. The worst case inaccuracy here is a thread who starts writing in window N but ends up unscheduled until a much later window M. But at most one basic block's worth of trace entries will be placed into window N even though it happened later. Thus we have "basic block accuracy", which is pretty good, as typically a basic block only contains a few instructions. Proposal A: Separate raw files split at flush timeIf we're splitting raw files (see above), we would use the end-of-block window-change flush to emit a thread exit and create a new file. In post-processing, we'd add any missing thread exits to threads that don't have them, to cover waiting threads who never reached a flush. As discussed above, the trigger thread would create a new directory for each window. A just-finished buffer is writtent to the directory corresponding to the window for its start point. A thread that is unscheduled for a long time could have a nearly-full buffer that is not written out until many windows later, but it would be written to the old directory for the old window. The next buffer would go to a new file in the new window, with no files in the in-between window directories. (Originally we thought this scheme would have buffer-level inaccuracy (and talked about using timestamps at the start and end of each buffer to detect): but that would only be if it wrote out to the current window dir.) Proposal B: Label buffers with owning windowIf we add the window ordinal to every buffer header, we can identify which window they belong to, and avoid the need to separate raw files. A window-end flush ensures a buffer belongs solely to the window identified in its header; the next buffer will have the new window value. This scheme can be used with file splitting during raw2trace, or waiting until analysis. Each thread has one raw file which contains all windows during the execution. Proposal C: Trigger thread identifies buffer transition point of the other threadsFor this proposal, the thread triggering the end of the window walks the other threads and identifies the phase transition point inside the buffer, telling the post-processor where to split them. I considered having the triggerer also flush the buffers, but that is challenging with a race with the owner also flushing. Plus, it still requires post-processing help to identify the precise point for splitting the buffer (without synchronization the triggerer can only get close). To avoid barriers on common case trace buffer writes, we use a lazy scheme where the triggerer does not modify the trace buffers themselves, but instead marks which portion has been written using a separate variable never accessed in a fastpath. Implementation:
This scheme ends up with block-level accuracy since the trigger thread's marked transition point must be adjusted to essentially a block boundary in post-processing. Thus, it does not seem any better than the other schemes, and it is more complex. Online TracesIt makes sense for offline to treat each window trace as separate and simulate them separately (though aggregating the results to cover the whole application). But for online: we have the same instance of the simulator or analysis tool throughout the whole application run. It will get confused if it throws away thread bookkeeping on a thread exit for a window. Either we have a window-controller simulator who spins up and down a new instance of the real target simulator/tool on each window, or we introduce new "end phase/start phase" markers. If we have split offline traces, those would only be for online though which does not sound appealing. Simulators/tools would need special handling for them: reporting statistics for the phase while continuing to aggregate for a multi-phase report or something. We might want combined files for offline too, as discussed above. That would unify the two, which is appealing. |
Adds a new design document discussing decisions and tradeoffs for the multi-window trace feature. Issue: #3995
For x86 we also want to eliminate counting of non-fetched rep string insructions which make the instruction counts used for the windows and gaps not match what the PMU will say: this is #4948. |
Adds a drmemtrace feature under a new on-by-default -split_windows option to create a separate subdirectory with a separate set of raw files per traced window. This avoids disk space issues with a single file, and splitting at the raw stage is relatively simple for regular drmemtrace usage (though not as simple for external users of the file i/o redirection). Files in raw/window.NNNN/ subdirectories are mirrored in trace/window.NNNN/ subdirectories upon being post-processed. Post-processing handles just the first window by default; the others must be explicitly passed as input directories in separate post-processing invocations. This changes the non-window behavior to not create an output file until tracing starts, which necessitated changing the tool.drcacheoff.delay-func test to check for no output files as a slightly different type of test. Adds a test of split-file offline windows. Fixes an infinite loop bug in raw2trace hit when a file is truncated: hit while the windows were buggy and missing footers. Issue: #3995
Adds a drmemtrace feature under a new on-by-default -split_windows option to create a separate subdirectory with a separate set of raw files per traced window. This avoids disk space issues with a single file, and splitting at the raw stage is relatively simple for regular drmemtrace usage (though not as simple for external users of the file i/o redirection). Files in raw/window.NNNN/ subdirectories are mirrored in trace/window.NNNN/ subdirectories upon being post-processed. Post-processing handles just the first window by default; the others must be explicitly passed as input directories in separate post-processing invocations. This changes the non-window behavior to not create an output file until tracing starts, which necessitated changing the tool.drcacheoff.delay-func test to check for no output files as a slightly different type of test. Adds a test of split-file offline windows. Fixes an infinite loop bug in raw2trace hit when a file is truncated: hit while the windows were buggy and missing footers. Issue: #3995
Fixes bugs where a huge number of files are opened in the first window of a multi-window drmemtrace run, and extra headers are added as well. Augments the windows-split test to run long enough to reproduce this problem, and adds to its output to detect errors in post-processing. Issue: #3995
Fixes bugs where a huge number of files are opened in the first window of a multi-window drmemtrace run, and extra headers are added as well. Augments the windows-split test to run long enough to reproduce this problem, and adds to its output to detect errors in post-processing. Issue: #3995
Currently we can only perform window-tracing at regular intervals on pre-compiled binaries using the -trace_after_instrs, -trace_for_instrs, -retrace_every_instrs options. Sometimes it's useful to trace windows of different sizes at irregular instruction intervals (e.g., for tracing simpoints). We do so introducing a new option: ``` -trace_instr_intervals_file path/to/instr/intervals.csv ``` which takes a CSV file where every line has <start,duration> pairs representing intervals in terms of number of instructions. The implementation relies on the same window-tracing mechanism used by -trace_after_instrs, -trace_for_instrs, -retrace_every_instrs. We add a level of indirection to obtain the values of these options through the `get_trace_after_instrs_value()`, `get_trace_for_instrs_value()`, and `get_retrace_every_instrs_value()` functions. This allows us to change the returned value of these options depending on the window we are tracing at that point. We do so using two global, read-only vectors containing the to-trace/to-not-trace number of instructions, and an atomic (also global) index that we increment every time we finish tracing a window to go to the next one in the two vectors. We add a new end-to-end test: tool.drcachesim.irregular-windows-simple. Issue #3995
Currently we can only perform window-tracing at regular intervals on binaries using the -trace_after_instrs, -trace_for_instrs, -retrace_every_instrs options. Sometimes it's useful to trace windows of different sizes at irregular instruction intervals (e.g., for tracing simpoints). We do so introducing a new option: ``` -trace_instr_intervals_file path/to/instr/intervals.csv ``` which takes a CSV file where every line has <start,duration> pairs representing intervals in terms of number of instructions. The implementation relies on the same window-tracing mechanism used by -trace_after_instrs, -trace_for_instrs, -retrace_every_instrs. We add a level of indirection to obtain the values of these options through the `get_initial_no_trace_for_instrs_value()`, `get_current_trace_for_instrs_value()`, and `get_current_no_trace_for_instrs_value()` functions respectively. This allows us to change the returned value of these options depending on the window we are tracing at that point. We do so using a global, read-only vector `irregular_windows_list` containing the trace_for_instrs/no_trace_for_instrs window, and an atomic (also global) index `irregular_window_idx` that we increment every time we finish tracing a window to go to the next one in the vector. We add a new end-to-end test: tool.drcachesim.irregular-windows-simple. Issue #3995
Add a pointer to any prior users list discussion.
Currently we can collect one trace sample during a benchmark run. We use '-trace_after_instrs' and '-exit_after_tracing' to select a trace point.
Is your feature request related to a problem? Please describe.
We need to run the benchmark multiple times to collect traces throughout the benchmark execution. For long running benchmark, this may take a very long time.
Describe the solution you'd like
Have a way to ‘trace x million insts every y million insts’. A more advanced method would be to have a config file listing the sampling windows.
Do you have any implementation in mind for this feature?
No
Describe alternatives you've considered
Currently we plan to run the benchmark with a diff sampling point.
Additional context
None.
The text was updated successfully, but these errors were encountered: