From 33f464919c7407beb946c2560eb58e2a46d29be6 Mon Sep 17 00:00:00 2001
From: Derek Bruening <bruening@google.com>
Date: Wed, 9 Mar 2022 18:39:51 -0500
Subject: [PATCH] i#3995 multi-window: Add design doc for multi-window-trace
 feature

Adds a new design document discussing decisions and tradeoffs for the
multi-window trace feature.

Issue: #3995
---
 api/docs/developers.dox         |   3 +-
 api/docs/multi_trace_window.dox | 396 ++++++++++++++++++++++++++++++++
 2 files changed, 398 insertions(+), 1 deletion(-)
 create mode 100644 api/docs/multi_trace_window.dox

diff --git a/api/docs/developers.dox b/api/docs/developers.dox
index 9b10006e69a..64639ff207c 100644
--- a/api/docs/developers.dox
+++ b/api/docs/developers.dox
@@ -1,5 +1,5 @@
 /* ******************************************************************************
- * Copyright (c) 2021 Google, Inc.  All rights reserved.
+ * Copyright (c) 2021-2022 Google, Inc.  All rights reserved.
  * ******************************************************************************/
 
 /*
@@ -54,6 +54,7 @@ Information for DynamoRIO developers/contributors:
 - \subpage page_ldstex
 - \subpage page_external_decoder
 - \subpage page_scatter_gather_emulation
+- \subpage page_multi_trace_window
 
  ****************************************************************************
  */
diff --git a/api/docs/multi_trace_window.dox b/api/docs/multi_trace_window.dox
new file mode 100644
index 00000000000..7dfdb84a30c
--- /dev/null
+++ b/api/docs/multi_trace_window.dox
@@ -0,0 +1,396 @@
+/* ******************************************************************************
+ * Copyright (c) 2022 Google, Inc.  All rights reserved.
+ * ******************************************************************************/
+
+/*
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * * Redistributions of source code must retain the above copyright notice,
+ *   this list of conditions and the following disclaimer.
+ *
+ * * Redistributions in binary form must reproduce the above copyright notice,
+ *   this list of conditions and the following disclaimer in the documentation
+ *   and/or other materials provided with the distribution.
+ *
+ * * Neither the name of Google, Inc. nor the names of its contributors may be
+ *   used to endorse or promote products derived from this software without
+ *   specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL VMWARE, INC. OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
+ * DAMAGE.
+ */
+
+/**
+ ****************************************************************************
+\page page_multi_trace_window Multi-Window Memtraces
+
+\tableofcontents
+
+# Overview
+
+Memory address traces (or "memtraces") are critical tools for analyzing application
+behavior and performance.  DynamoRIO's \ref page_drcachesim provides memtrace
+gathering and analyzing tools.  Today, these memtraces are gathered either during a
+full execution of a standalone application, or during a short burst of execution on a
+server.  For long-running benchmarks with phased behavior, however, we would like to
+instead gather a series of smaller memtrace samples to ease simulation while still
+representing the application in the aggregate.  This sampling may also be useful for
+servers with phased behavior.
+
+# Initial Use Case: SPEC2017
+
+An initial use case driving this work is the gathering of memtraces from SPEC2017.
+These benchmarks execute tens of trillions of instructions each and include phased
+behavior.  Our goal is to gather uniformly sampled memtrace sequences across the
+whole execution.  To support a full warmup for each sample, our target is 10 billion
+instructions in each.  Something like 50 samples per benchmark should suffice to
+capture major phases.  With these parameters, for a 25-trillion-instruction
+benchmark, that would look like a series of 10 billion instruction traces each
+followed by 490 billion instructions of untraced execution.
+
+# Design Point: Separate Traces v. Merged-with-Markers
+
+Focusing on a use case of a series of 50 10-billion-instruction traces for a SPEC
+benchmark, there are two main ways to store them.  We could create 50 independent
+sets of trace files, each with its own metadata and separate set of sharded data
+files.  A simulator could either simulate all 50 separately and aggregate just the
+resulting statistics, or a single instance of a simulator could fast-forward between
+each sample to maintain architectural state and simulate the full execution that way.
+
+The alternative is to store all the data in one set of data files, with metadata
+markers inserted to indicate the division points between the samples.  This doesn’t
+support the separate simulation model up front, though we could provide an iterator
+interface that skips ahead to a target window and stops at the end of that window (or
+the simulator could be modified to stop when it sees a sample separation marker).
+However, this will not be as efficient for parallel simulation with separate
+simulator instances for each window, since the skipping ahead will take some time.
+This arrangement does more easily support the fast-forward single-simulator-instance
+approach, and more readily fits with upstream online simulation.
+
+In terms of implementation, there are several factors to consider here.
+
+## Separate raw files
+
+If we want separate final traces, at first the simplest approach is to produce a
+separate set of raw files for each tracing window.  These would be post-processed
+separately and independently.
+
+However, implementing this split does not fit well in the current interfaces.  To
+work with other filesystems, we have separated out the i/o and in particular
+directory creation.
+
+For upstream use with files on the local disk, we could add creation of a new
+directory (and duplication of the module file) for each window by the tracing thread
+that hits the end-of-window trigger.  The other threads would each create a new
+output raw file each time they transitioned to a new window (see also the Proposal A
+discussion below).
+
+## Splitting during raw2trace
+
+Alternatively, we could keep a single raw file for each thread and split it up into
+per-window final trace files during postprocessing by the raw2trace tool.  We would
+use markers inserted at the window transition points to identify where to separate.
+
+raw2trace would need to create a new output dir and duplicate the trace headers and
+module file.  Like for separate raw files, this goes against the current i/o
+separation where today we pass in a list of all the input and output files up front
+and raw2trace never opens a file on its own, to better support proprietary
+filesystems with upstream code.
+
+Another concern here is hitting file size limits with a single raw file across many
+sample traces.  For the example above of 50 10-billion-instruction traces, if we
+assume an average of 2 dynamic instructions per raw entry, each window might contain
+5GB of data, reaching 250GB for all 50.  Furthermore, the final trace is even larger.
+
+The file size problem gets worse if we use a constant sampling interval across
+SPEC2017.  Some SPEC2017 benchmarks have many more instructions than others.  The
+bwaves_s benchmark has 382 trillion instructions, so a constant interval might result
+in it having 50x more data than other benchmarks, exceeding the file size limit.  A
+constant number of samples is preferred for this reason.
+
+## Splitting during analysis
+
+Given the complexities of splitting in earlier steps, and given that we may want to
+use a single simulator instance to process all of the sample traces, and given that
+for upstream online analysis we will likely also have a single simulator instance:
+perhaps we should not try to split the samples and instead treat the 50 samples as a
+single trace with internal markers indicating the window division.
+
+Online and offline analyzers can use window ID markers to fast-forward and align each
+thread to the next window.  Maybe the existing serial iterator can have built-in
+support for aligning the windows.
+
+If single-file final traces will exist, we would need to update all our existing analyzers to handle the gaps in the traces: reset state for function and callstack trackers; keep per-window numbers for statistics gatherers.
+
+We can also create an analyzer that splits a final trace up if we do want separate
+traces.
+
+## Decision: Split during analysis
+
+Separate files seems to be the most flexible and useful setup for our expected use
+cases, in particular parallel simulation.  But given that separating early in the
+pipeline is complex, we’ll split in the analysis phase, initially with a manual tool
+since we do not plan to have automatically-gathered multi-window traces.
+
+We’ll update some simple inspection and sanity tools (view, basic_counts, and
+invariant_checker) to handle and visualize windows, but we’ll assume that trace
+windows will be split before being analyzed by any more complex analysis tools.  For
+online traces we will probably stick with multi-window-at-once.
+
+We’ll create a tool to manually split up multi-window trace files.
+
+# Design Point: Continuous Control v. Re-Attach
+
+One method of obtaining multiple traces is to repeat today’s bursts over and over,
+with a full detach from the application after each trace.  However, each attach point
+is expensive, with long periods of profiling and code cache pre-population.  While a
+scheme of sharing the profiling and perhaps code cache could be developed while
+keeping a full detach, a simpler approach is to remain in control but switch from
+tracing to instruction counting in between tracing windows.  Instruction counting is
+needed to determine where to start the next window in any case.
+
+Instruction counting through instrumentation is not cheap, incurring perhaps a 1.5x
+slowdown.  Compared to the 50x overhead while tracing, however, it is acceptable.  If
+lower overhead is desired in the future, a scheme using a full detach and using
+hardware performance counters to count instruction can be investigated.  The decision
+for the initial implementation, however, is to use the simpler alternating tracing
+and counting instrumentation windows.
+
+# Design Point: Instrumentation Dispatch v. Flushing
+
+As the prior section concluded, we plan to alternate between tracing and instruction
+counting.  There are two main approaches to varying instrumentation during execution:
+inserting all cases up front with a dispatch to the desired current scheme, and
+replacing instrumentation by flushing the system’s software code cache when changing
+schemes.
+
+Flushing is an expensive process, and can be fragile as the lower-overhead forms of
+flushing open up race conditions between threads executing the old and new code cache
+contents.  Its complexity is one reason we are deciding to instead us a dispatch
+approach for our initial implementation.
+
+With dispatch, we insert both tracing and counting instrumentation for each block in
+the software code cache.  Dispatch code at the top of the block selects which scheme
+to use.  The current mode, either tracing or counting, is stored in memory and needs
+to be synchronized across all threads.
+
+The simplest method of synchronizing the instrumentation mode is to store it in a
+global variable, have the dispatch code use a load-acquire to read it, and modify it
+with a store-release.  There is overhead to a load-acquire at the top of every block,
+but experimentation shows that it is reasonable compared to the overhead of the
+instrumentation itself even for instruction counting mode, and its simplicity makes
+it our choice for the initial implementation.
+
+The mechanisms for creating the dispatch and separate copies for the modes is
+provided for us by the upstream drbbdup library.  This library was, however, missing
+some key pieces we had to add.
+
+## AArch64 support for drbbdup
+
+The drbbdup library had no AArch64 support initially.  It needed some updates to
+generated code sequences, as well as handling of the weak memory model on AArch64.
+To support storing the mode variable in global memory with the short reach of AArch64
+memory addressing modes, we added explicit support for using a scratch register to
+hold the address.  For the weak memory mode, explicit support for using a
+load-acquire to read the value was added.
+
+## Function wrapping support for drbbdup
+
+The function wrapping library used when tracing does not easily work with drbbdup out
+of the box, due to the different control models.  We added an inverted control mode
+to the wrapping library to allow enabling wrapping only for the tracing mode and not
+for the counting mode.
+
+However, there is a complication here where a transition to counting mode while in
+the middle of a wrapped function can cause problems if cleanup from existing the
+function is skipped.  We ended up adding support for cleanup-only execution of
+function wrapping actions which is invoked while counting, with full wrapping actions
+enabled while tracing.
+
+A final minor change to support DRWRAP_REPLACE_RETADDR was to use the tag rather than
+the first instruction’s application PC, since that wrapping scheme creates some
+blocks with nothing but synthetic instructions with no corresponding application PC.
+
+## Write-xor-execute support for drbbdup
+
+The drbbdup library uses its own generated code area for out-of-line callouts, but
+only for dynamically discovered instrumentation cases.  This does not play well with
+write-xor-execute security features where we need to separate writable and executable
+code and have a special mechanism for DynamoRIO’s code cache that does not currently
+extend to tools using their own caches.  To solve this for our use case which has
+only static cases, we simply disabled the drbbdup code cache when a new flag is set
+disabling dynamic case support.
+
+## Emulation support for drbbdup
+
+The memtrace code uses emulation sequence markers for instrumented the proper
+application instructions in the face of various expansions for repeated string loops
+and scatter-gather instructions.  However, the drbbdup library presents its own
+interface layer which hides the markers and in fact results in memtrace missing some
+application instructions.
+
+## Consider partial detach with PMU instruction counting for non-tracing windows?
+
+We could reduce the overhead of the non-tracing windows where we’re counting
+instructions by using the PMU to count for us.  We would do something like a detach
+but keep our code cache (and assume no code changes by the application) and re-attach
+when the PMU counting reaches the target.
+
+# Handling Phase Transitions
+
+For a normal memtrace burst, we completely detach from the server at the end of our
+desired trace duration.  This detach process synchronizes with every application
+thread.
+
+For multi-window traces, we are using multi-case dispatched instrumentation where we
+change the instrumentation type for each window.  We have no detach to go through and
+wake up all threads and have them flush their trace buffers and we're deliberately
+trying to avoid a global synchronization point.  Yet we would prefer perfect
+transitions between windows, whether that's separate raw files or accurately-placed
+markers.
+
+## Key step: Add end-of-block phase change check
+
+We do flush prior to a syscall, so a thread at a kernel wait point should have an
+empty buffer and not be a concern.
+
+The main concern is a thread not in a wait state that happens to not be scheduled
+consistently for a long time and so does not fill up its buffer until well after the
+window ends.
+
+We can augment the current end-of-block flush check which today looks for the buffer
+being full.  We can add a check for the prior window having ended, by having a global
+window ordinal and storing its value per thread at the start of filling up a new
+buffer.  (This is better than simply checking the drbbdup mode value for being in
+non-tracing mode as that will not catch a double mode change.)  If the prior window
+has ended, we can flush the buffer, or simply add a marker, depending on the scheme
+(see below).
+
+A thread that receives a signal mid-block (it would have to be a synchronous signal
+as DR waits until the end of the block for asynchronous) will skip its end-of-block
+checks and redirect to run the app's signal handler: but it would hit the checks for
+the first block of the handler.
+
+The worst case inaccuracy here is a thread who starts writing in window N but ends up
+unscheduled until a much later window M.  But at most one basic block's worth of
+trace entries will be placed into window N even though it happened later.  Thus we
+have "basic block accuracy", which is pretty good, as typically a basic block only
+contains a few instructions.
+
+## Proposal A: Separate raw files split at flush time
+
+If we're splitting raw files (see above), we would use the end-of-block window-change
+flush to emit a thread exit and create a new file.  In post-processing, we'd add any
+missing thread exits to threads that don't have them, to cover waiting threads who
+never reached a flush.
+
+As discussed above, the trigger thread would create a new directory for each window.
+A just-finished buffer is writtent to the directory corresponding to the window for
+its start point.
+
+A thread that is unscheduled for a long time could have a nearly-full buffer that is
+not written out until many windows later, but it would be written to the old
+directory for the old window.  The next buffer would go to a new file in the new
+window, with no files in the in-between window directories.
+
+(Originally we thought this scheme would have buffer-level inaccuracy (and talked
+about using timestamps at the start and end of each buffer to detect): but that would
+only be if it wrote out to the current window dir.)
+
+## Proposal B: Label buffers with owning window
+
+If we add the window ordinal to every buffer header, we can identify which window
+they belong to, and avoid the need to separate raw files.  A window-end flush ensures
+a buffer belongs solely to the window identified in its header; the next buffer will
+have the new window value.
+
+This scheme can be used with file splitting during raw2trace, or waiting until
+analysis.  Each thread has one raw file which contains all windows during the
+execution.
+
+## Proposal C: Trigger thread identifies buffer transition point of the other threads
+
+For this proposal, the thread triggering the end of the window walks the other
+threads and identifies the phase transition point inside the buffer, telling the
+post-processor where to split them.
+
+I considered having the triggerer also flush the buffers, but that is challenging
+with a race with the owner also flushing.  Plus, it still requires post-processing
+help to identify the precise point for splitting the buffer (without synchronization
+the triggerer can only get close).
+
+To avoid barriers on common case trace buffer writes, we use a lazy scheme where the
+triggerer does not modify the trace buffers themselves, but instead marks which
+portion has been written using a separate variable never accessed in a fastpath.
+
+Implementation:
+
++ The tracer maintains a global list of thread buffers using a global mutex on thread
+  init and exit.
+
++ Each trace buffer has a corresponding externally_written variable holding a
+  distance into the buffer that was written out by another thread.
+
++ On hitting the trace window endpoint threshold, the triggering thread grabs the
+  mutex and walks the buffers.
+
+  The triggerer doesn't have the current buffer position pointer.  Instead it walks
+  the buffer until it reaches zeroed memory (we zero the buffer after each flush).
+  We have no synchronization with the owning thread: but observing writes out of
+  order should be ok since we'll just miss one by stopping too early.  We need to fix
+  things up in post-processing in any case, because we need the phase transition to
+  be at a clean point (we can't identify that point from triggerer: if we end at an
+  instr entry, we don't know if some memrefs are coming afterward or not).  In
+  post-processing we adjust that position to the end of the block, and we split the
+  buffer contents around that point to the neighboring traces.
+
+  The triggerer does a store-release of the furthest-writting point into the
+  externally_written variable.
+
++ Whenever a trace writes out its buffer, it does a load-acquire on the
+  externally_written variable and if non-zero it writes out a marker in the buffer
+  header.  Post-processing reads the marker and uses it to split the buffer at the
+  nearest block boundary after the marker value.
+
++ If windows are small enough that the triggerer doesn't complete its buffer walk
+  before a new window starts: other thread buffers may completely go into the new
+  window.  That seems ok: if the windows are that small, in the absence of
+  application synchronization the resulting window split should be a possible thread
+  ordering.
+
+This scheme ends up with block-level accuracy since the trigger thread's marked
+transition point must be adjusted to essentially a block boundary in post-processing.
+Thus, it does not seem any better than the other schemes, and it is more complex.
+
+# Online Traces
+
+It makes sense for offline to treat each window trace as separate and simulate them
+separately (though aggregating the results to cover the whole application).
+
+But for online: we have the same instance of the simulator or analysis tool
+throughout the whole application run.  It will get confused if it throws away thread
+bookkeeping on a thread exit for a window.
+
+Either we have a window-controller simulator who spins up and down a new instance of
+the real target simulator/tool on each window, or we introduce new "end phase/start
+phase" markers.  If we have split offline traces, those would only be for online
+though which does not sound appealing.  Simulators/tools would need special handling
+for them: reporting statistics for the phase while continuing to aggregate for a
+multi-phase report or something.
+
+We might want combined files for offline too, as discussed above.  That would unify
+the two, which is appealing.
+
+
+ ****************************************************************************
+ */