xpti/doc/SYCL_Tracing_Implementation.md

# SYCL Tracing - Part I

In order to understand the various language constructs that are available
in the SYCL layer, instrumenting them to provide a correlation between the
end-user source code and the kernels that execute on a device must be made.
A lightweight tracing framework (XPTI) was developed to facilitate this
through explicit instrumentation of all the language constructs. The goal
of this framework is to provide a low-overhead solution that tools can use
to build performance analytical models. Based on tests, simulations and
projections, the framework API can capture telemetry information for
~60-70,000 events/sec with overheads less than 1% of the application runtime.

This document outlines the use of this framework API at various points in
the SYCL runtime to trace the language constructs. Each language construct
used/expressed by the developer is associated with a source location payload
information that includes the function name, source file name and line
number where the construct is expressed. Using the source location
information, a hash is created for the language construct along with a
corresponding unique ID. The framework provides the ability to propagate
this ID all the way to the driver layers for the device.

This document discusses where in the SYCL runtime instrumentation has been
added and the reasons behind adding this instrumentation.

> **NOTE:** This document is better viewed with [Markdown Reader](https://chrome.google.com/webstore/detail/markdown-reader/gpoigdifkoadgajcincpilkjmejcaanc?hl=en) plugin for chrome or the [Markdown Preview Extension](https://github.com/shd101wyy/vscode-markdown-preview-enhanced/releases) for Visual Studio Code.

## Instrumentation Trace Points

This section will document all the places in the SYCL runtime that have been
instrumented to capture the asynchronous task graphs created by the runtime.
The task graphs are captured as graph, nodes and edges:

>  - The graph encapsulates all of the disjoint task graphs generated by the application.
>  - The nodes capture operations that are performed, such as kernel
executions or memory transfers
>  - The edges represent dependence relationships, the representation of
which mimics control flow as opposed to a dependence graph. The source node
in an edge must complete before the target node can begin execution.

 All code changes to enable this have been guarded by
 `XPTI_ENABLE_INSTRUMENTATION` macro and the CMake files have been updated to
 have this as an option which is enabled by default and this change is under
 `llvm/sycl/CMakeLists.txt`.

```cmake
...
# Create a soft option for enabling or disabling the instrumentation
# of the SYCL runtime
option(SYCL_ENABLE_XPTI_TRACING "Enable tracing of SYCL constructs" ON)
```

### The Graph

Any SYCL/DPC++ application can submit command groups to any active queue
during the lifetime of the application. Each submission is handled by the
runtime and the asynchronous task graphs are updated to reflect the new
submission. This may be as simple as adding a new node to the task-graph or
adding multiple nodes to the graph, where one of the nodes represents the
computation and the others dependent memory transfers.

To model this, we create a global graph for every application instantiation
and all kernel executions in the applications are added as nodes in this
global graph. In the SYCL runtime, there is no obvious location where the
creation of the global graph can be inserted as many objects are
instantiated statically. Currently, we embed the graph creation in the
plugin interface (PI) layer `initialize()` call. In this call, we will
perform two operations:

1. Initialize all listeners and create a trace event to represent the graph.
This is done in `sycl/include/sycl/detail/pi.cpp`.
2. Send a `graph_create` event to all subscribers. This notification
will only be sent once.

### The Nodes

The command group lambdas are captured and encapsulated in a `Command`
object. This object is evaluated for dependencies on data/memory or external
OpenCL events and an asynchronous task graph is built by mapping all these
dependencies, before it is enqueued on the device. In order to capture the
command groups (nodes) and the dependencies (edges), the base class
`Command` and any derived classes that are of interest are instrumented.

In this section, we discuss the instrumentation of the Command object in two
parts: (1) The changes made to capture end-user source code details for
language constructs (2) The instrumentation that handles capturing the
relevant metadata.

1. In order to capture end-user source code information, we have implemented
`sycl::detail::code_location` class that uses the builtin functions
in the compiler. However, equivalent implementations are unavailable on
Windows and separate cross-platform implementation might be used in the
future. To mitigate this, the Windows implementation will always report
`unknown_file`, `unknown_func` and a line number of 0 for source
file, function name and line number. We handle this case while processing
this information.

   The source information of a language construct, such as source file,
   function name, line number and column number allow us to determine if a
   Command that was previously created for a construct is being created
   again. In such cases, we will not emit a `node_create` event, but we
   will bump up the instance count recording the number of instances
   created. Secondly, the source information allows us to associate a unique
   ID with the source location and propagate it all the way to the driver,
   if possible. This will allow us to associate a Kernel event with a source
   location at all times. All instrumentation that identifies a command
   object of a given type and emits the `node_create` event is located
   in the `emitInstrumentationData()` and must be implemented by all
   derived classes.

   To enable this source location information, we start with enabling the
   public methods in the queue class, such as `queue.submit()`,
   `queue.parallel_for()`, `queue.wait()`, etc to include a default
   argument that captures the source location information. The location of
   the line in the caller that makes the call to `queue.submit()`,
   `queue.parallel_for()`, etc is represented in this default argument.
   These changes are present in `queue.hpp` and `ordered_queue.hpp`.
   The default arguments for all public functions are guarded by
   `#ifdef SYCL_INSTRUMENTATION_METADATA` that is currently enabled by
   default.

   The location information, when captured, is propagated all the way to the
   `CommandGroup` object. So, for every `CommandGroup` object, we
   will have the corresponding source location in end-user code where the
   command group is submitted to the queue. This metadata is propagated by
   the instrumentation to the subscribers of the stream.

2. The base `Command class` and all derived classes are instrumented to capture
   the relevant information for each command object and a `node_create` event is
   generated.

### The Node instance

Once a command object is created, it is enqueued on the device for
execution. To capture the execution of this node instance, we instrument the
`enqueue()` method to determine the cost of this computation or memory
related kernel. As the commands are enqueued, the enqueue method emits a
pair of events indicating the `task_begin` and `task_end`events that
capture the duration of the enqueued command.  For commands that are
asynchronous, the pair of events capture just the kernel submission and the
actual execution of the command on the device is tracked through the
`cl_event` returned by the enqueue operation. In the case of host kernel
execution or commands that are synchronous, the cost is measured directly.

In the case of the command being submitted to an OpenCL device, we capture
the event of the submitted kernel and propagate it to the subscriber tool.
It is up to the tool to register a callback for this event completion and
close the task opened for the command object.

### The Edges

As discussed in the previous section, the command groups submitted to the
device queues form nodes in the asynchronous tasks graphs created by
the SYCL runtime. In addition to these nodes, based on the memory references
(through accessors or USM pointers), additional nodes to `allocate`,
`copy` and `release` are created and they are necessary for the
computation kernels to run. The computation kernel has dependencies on the
memory objects and these dependencies are recorded as `event`s and in
our model we represent them as edges between the dependent nodes.

Tools monitoring the event stream then can start capturing the asynchronous
task graph as it is being built. As dependencies are added to a command
object, the instrumentation emits these dependencies as `edge_create`
events. Each of these `edge_create`events encapsulate the two command
objects that have a dependency through this edge. The source object of this
edge event must complete execution first before the target object of the
edge can begin execution.

To instrument this part of the code, the `Command::addDep` methods of
the Command object are instrumented to create the trace points and notify
all subscribers.

The `Release` command, as implemented in the SYCL runtime, has a
reference to the memory object, but no explicit dependencies are created. To
model the edges correctly, we instrument the `waitForRecordToFinish` method in
the `Scheduler` where the release operation waits on all the
dependent operations to complete to capture the edges.

This concludes all the changes that were made to the SYCL runtime to support
tracing. The next section talks about the XPTI framework that allows
applications and runtimes to efficiently capture, record and emit trace
notifications for important events during the run.

# SYCL Tracing - Part II
The architecture of the XPTI, when described at a rudimentary level, allows
you to instrument any runtime or application and link it with the static
stub library that implements all the functions in the library. If the
tracing is enabled through the environment variable
`XPTI_TRACE_ENABLE=1`, the stub library checks to see if the framework
dispatcher is registered. This dispatcher is registered through an
environment variable `XPTI_FRAMEWORK_DISPATCHER=/path/to/libxptifw.so`.
If tracing is turned on and the dynamic loading of the framework dispatcher
is successful, then the proxy library creates a dispatch table for all the
trampoline functions used in the instrumentation of the application or
runtime.

The static library is the only dependency for building the SYCL runtime (or
any application/runtime that uses this instrumentation mechanism) and is
currently available under `llvm/xpti`. In the current organization at
the file system level, the API specification for the instrumentation
framework is available with the static library under `llvm/xpti/include`.
The framework is divided into two parts: (1) the implementation of the proxy/
stub library that will be compiled to create a static library to be linked
with SYCL runtime and (2) a dynamic library with can be registered as the
framework dispatcher and is not required by the SYCL runtime that is not a
part of the `llvm` project. The dynamic library depends on the API
specification that is a part of the static proxy library. Using the
specification in the `llvm/xpti`, an implementation of the dynamic
library can be built.

## The Proxy library

The proxy library implements all of the public functions that are a part of
the XPTI tracing infrastructure. Each function however is a stub that checks
to see if tracing has been enabled. If so, it forwards the call to the same
function in the framework dispatcher or the dynamic component of this
framework.

   ```c++
   XPTI_EXPORT_API xpti::result_t xptiInitialize(const char *stream,
      uint32_t maj,
      uint32_t min,
      const char *version) {
      // Static object g_loader will return true if
      // XPTI_TRACE_ENABLE=1 and the dynamic dispatcher
      // library in XPTI_FRAMEWORK_DISPATCHER is valid
      // and successfully loaded.
      //
      if (xpti::g_loader.noErrors()) {
          void *f = xpti::g_loader.functionByIndex(XPTI_INITIALIZE);
          if (f) {
              return (*(xpti_initialize_t)f)(stream, maj, min, version);
          }
      }
      return xpti::result_t::XPTI_RESULT_FAIL;
   }
   ```