diff --git a/sycl/doc/Compiler-HLD.svg b/sycl/doc/Compiler-HLD.svg new file mode 100755 index 0000000000000..989dcdbf7304c --- /dev/null +++ b/sycl/doc/Compiler-HLD.svg @@ -0,0 +1,2695 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Page-1 + + + + + + + + + + + + + + + + + + + + + + + + + Data Object + SourceFile.cpp + + Sheet.2 + + + + + + + Sheet.3 + + + + + SourceFile.cpp + + + Dynamic Connector + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Task + Compiler driver + + Sheet.5 + + + + + + + Sheet.6 + + Sheet.7 + + + + Sheet.8 + + + + + Sheet.9 + + + + + + + + + + + Sheet.10 + + Sheet.11 + + + + Sheet.12 + + + + Sheet.13 + + + + + + + Sheet.14 + + + + + + + + + + + Sheet.15 + + Sheet.16 + + + + Sheet.17 + + + Sheet.18 + + + Sheet.19 + + Sheet.20 + + + + Sheet.21 + + + + Sheet.22 + + + Sheet.23 + + + Sheet.24 + + + Sheet.25 + + + Sheet.26 + + + + + + + + + Compiler driver + + + Dynamic Connector.51 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Task.28 + SYCL device front-end compiler + + Sheet.29 + + + + + + + Sheet.30 + + Sheet.31 + + + + Sheet.32 + + + + + Sheet.33 + + + + + + + + + + + Sheet.34 + + Sheet.35 + + + + Sheet.36 + + + + Sheet.37 + + + + + + + Sheet.38 + + + + + + + + + + + Sheet.39 + + Sheet.40 + + + + Sheet.41 + + + Sheet.42 + + + Sheet.43 + + Sheet.44 + + + + Sheet.45 + + + + Sheet.46 + + + Sheet.47 + + + Sheet.48 + + + Sheet.49 + + + Sheet.50 + + + + + + + + + SYCL device front-end compiler + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Task.52 + C++ host compiler + + Sheet.53 + + + + + + + Sheet.54 + + Sheet.55 + + + + Sheet.56 + + + + + Sheet.57 + + + + + + + + + + + Sheet.58 + + Sheet.59 + + + + Sheet.60 + + + + Sheet.61 + + + + + + + Sheet.62 + + + + + + + + + + + Sheet.63 + + Sheet.64 + + + + Sheet.65 + + + Sheet.66 + + + Sheet.67 + + Sheet.68 + + + + Sheet.69 + + + + Sheet.70 + + + Sheet.71 + + + Sheet.72 + + + Sheet.73 + + + Sheet.74 + + + + + + + + + C++ host compiler + + + Dynamic Connector.79 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Data Object.76 + LLVM IR + + Sheet.77 + + + + + + + Sheet.78 + + + + + LLVM IR + + + Dynamic Connector.127 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Task.104 + Target specific LLVM compiler + + Sheet.105 + + + + + + + Sheet.106 + + Sheet.107 + + + + Sheet.108 + + + + + Sheet.109 + + + + + + + + + + + Sheet.110 + + Sheet.111 + + + + Sheet.112 + + + + Sheet.113 + + + + + + + Sheet.114 + + + + + + + + + + + Sheet.115 + + Sheet.116 + + + + Sheet.117 + + + Sheet.118 + + + Sheet.119 + + Sheet.120 + + + + Sheet.121 + + + + Sheet.122 + + + Sheet.123 + + + Sheet.124 + + + Sheet.125 + + + Sheet.126 + + + + + + + + + Target specific LLVM compiler + + + Dynamic Connector.131 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Data Object.128 + Target binary + + Sheet.129 + + + + + + + Sheet.130 + + + + + Target binary + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Task.136 + Offload-bundler + + Sheet.137 + + + + + + + Sheet.138 + + Sheet.139 + + + + Sheet.140 + + + + + Sheet.141 + + + + + + + + + + + Sheet.142 + + Sheet.143 + + + + Sheet.144 + + + + Sheet.145 + + + + + + + Sheet.146 + + + + + + + + + + + Sheet.147 + + Sheet.148 + + + + Sheet.149 + + + Sheet.150 + + + Sheet.151 + + Sheet.152 + + + + Sheet.153 + + + + Sheet.154 + + + Sheet.155 + + + Sheet.156 + + + Sheet.157 + + + Sheet.158 + + + + + + + + + Offload-bundler + + + Dynamic Connector.160 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Dynamic Connector.164 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Dynamic Connector.168 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Data Object.165 + Host object file + + Sheet.166 + + + + + + + Sheet.167 + + + + + Host object file + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Task.176 + Offload-wrapper + + Sheet.177 + + + + + + + Sheet.178 + + Sheet.179 + + + + Sheet.180 + + + + + Sheet.181 + + + + + + + + + + + Sheet.182 + + Sheet.183 + + + + Sheet.184 + + + + Sheet.185 + + + + + + + Sheet.186 + + + + + + + + + + + Sheet.187 + + Sheet.188 + + + + Sheet.189 + + + Sheet.190 + + + Sheet.191 + + Sheet.192 + + + + Sheet.193 + + + + Sheet.194 + + + Sheet.195 + + + Sheet.196 + + + Sheet.197 + + + Sheet.198 + + + + + + + + + Offload-wrapper + + + Dynamic Connector.204 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Data Object.201 + Fat binary file + + Sheet.202 + + + + + + + Sheet.203 + + + + + Fat binary file + + + Dynamic Connector.205 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Dynamic Connector.206 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Dynamic Connector.211 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Sheet.220 + + + + + + + + + + + + Entity 2 + + Sheet.215 + Host code + + + + + + + Host code + + Sheet.216 + Device code + + + + + + + Device code + + + Sheet.219 + Fat object + + + + Fat object + + + Dynamic Connector.221 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Sheet.222 + + + + + + + + + + + + Entity 2 + + Sheet.224 + Host code + + + + + + + Host code + + Sheet.225 + Device code + + + + + + + Device code + + + Sheet.226 + Fat object + + + + Fat object + + + Dynamic Connector.227 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Data Object.228 + Integration header + + Sheet.229 + + + + + + + Sheet.230 + + + + + Integrationheader + + + Dynamic Connector.231 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Dynamic Connector.232 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/sycl/doc/Multi-source-compilation-flow.png b/sycl/doc/Multi-source-compilation-flow.png new file mode 100644 index 0000000000000..521a7425cd45c Binary files /dev/null and b/sycl/doc/Multi-source-compilation-flow.png differ diff --git a/sycl/doc/SYCL_compiler_and_runtime_design.md b/sycl/doc/SYCL_compiler_and_runtime_design.md new file mode 100644 index 0000000000000..d7c3cc5e5ce8e --- /dev/null +++ b/sycl/doc/SYCL_compiler_and_runtime_design.md @@ -0,0 +1,390 @@ +# SYCL\* Compiler and Runtime architecture design + +## Introduction + +This document describes the architecture of the SYCL compiler and runtime +library. Base SYCL specification version is +[1.2.1](https://www.khronos.org/registry/SYCL/specs/sycl-1.2.1.pdf). + +## SYCL Compiler architecture + +SYCL application compilation flow: + +![High level component diagram for SYCL Compiler](Compiler-HLD.svg) + +SYCL compiler logically can be split into the host compiler and a number of +device compilers—one per each supported target. Clang driver orchestrates the +compilation process, it will invoke the device compiler once per each requested +target, then it will invoke the host compiler to compile the host part of a +SYCL source. The result of compilation is a set of so-called "fat objects" - +one fat object per SYCL source file. A fat object contains compiled host code +and a number of compiled device code instances—one per each target. Fat +objects can be linked into "fat binary". + +SYCL sources can be also compiled as a regular C++ code, in this mode there is +no "device part" of the code—everything is executed on the host. + +Device compiler is further split into the following major components: + +- **Front-end** - parses input source, outlines "device part" of the code, +applies additional restrictions on the device code (e.g. no exceptions or +virtual calls), generated LLVM IR for the device code only and "integration +header" which provides information like kernel name, parameters order and data +type for the runtime library. +- **Middle-end** - transforms the initial LLVM IR* to get consumed by the +back-end. Today middle-end transformations include just a couple of passes: + - OpenCL C++* to SPIR-V* built-in function names mapper + - Address space handling pass + - TBD: potentially the middle-end optimizer can run any LLVM IR + transformation with only one limitation: back-end compiler should be able to + handle transformed LLVM IR. + - Optionally: LLVM IR → SPIR-V translator. +- **Back-end** - produces native "device" code in ahead-of-time compilation +mode. + +### SYCL support in the driver + +SYCL offload support in the driver is based on the clang driver concepts and +defines: + +- target triple and a native tool chain for each target (including "virtual" +targets like SPIR-V). +- SYCL offload action based on generic offload action + +#### Enable SYCL offload + +To enable compilation with SYCL specification conformance, a special option +must be passed to the clang driver: + +`-fsycl` + +With this option specified, the driver will invoke the host SYCL compiler and a +number of device compilers for targets specified in the `-fsycl-targets` +option. If this option is not specified, then single SPIR-V target is assumed, +and single device compiler for this target is invoked. + +#### Ahead of time (AOT) compilation + +Ahead-of-time compilation is the process of invoking the back-end at compile +time to produce the final binary, as opposed to just-in-time (JIT) compilation +when final code generation is deferred until application runtime time. + +AOT compilation reduces application execution time by skipping JIT compilation +and final device code can be tested before deploy. + +JIT compilation provides portability of device code and target specific +optimizations. + +#### List of native targets + +The ahead-of-time compilation mode implies that there must be a way to specify +a set of target architectures for which to compile device code. By default the +compiler generates SPIR-V and OpenCL device JIT compiler produces native target +binary. + +There are existing options for OpenMP* offload: + +`-fopenmp-targets=triple1,triple2` + +would produce binaries for target architectures identified by target triples +`triple1` and `triple2`. + +A similar approach is used for SYCL: + +`-fsycl-targets=triple1,triple2` + +will produce binaries from SYCL kernels for devices identified by the two +target triples. This basically tells the driver which device compilers must be +invoked to compile the SYCL kernel code. By default, the JIT compilation +approach is assumed and device code is compiled for a single target triple - +`[spir,spir64]-*-*`. + +#### Device code formats + +Each target may support a number of code forms, each device compiler defines +and understands mnemonics designating a particular code form, for example +"`visa:3.3`" could designate virtual ISA version 3.3 for Intel GPU target (Gen +architecture). User can specify desired code format using the target-specific +option mechanism, similar to OpenMP. + +`-Xsycl-target= ` + +For example, to support offload to CPU, FPGA, Gen9/vISA3.3, Gen9/SPIR-V the +following options would be used: + +`-fsycl -fsycl-targets=x86,fpga,gen9 -Xsycl-target=gen9 "-fmt:visa -fmt:spirv"` + +The `` parameter is passed by the driver directly to the SYCL device +compiler for the corresponding target w/o parsing it. For each target there is +some default code form which is generated in the absence of overriding via the +`-Xsycl-target` option. + +**TBD:** Having multiple code forms for the same target in the fat binary might +mean invoking device compiler multiple times. Multiple invocations are not +needed if these forms can be dumped at various compilation stages by the single +device compilation, like SPIR-V → visa → ISA. But if e.g. `gen9:visa3.2` and +`gen9:visa3.3` are needed at the same time, then some mechanism is needed. +Should it be a dedicated target triple for each needed visa version or Gen +generation? + +#### Separate Compilation and Linking + +The compiler supports linking of device code obtained from different source +files before generating the final SPIR-V to be fed to the back-end. The basic +mechanism is to produce "fat objects" as a result of compilation—object files +containing both host and device code for all targets—then break fat objects +into their constituents before linking and link host code and device code +(per-target) separately and finally produce a "fat binary" - a host executable +with embedded linked images for each target specified at the command line. + +![Multi source compilation flow](Multi-source-compilation-flow.png) + +The clang driver orchestrates compilation and linking process based on a +SYCL-specific offload action builder and invokes external tools as needed. On +the diagram above, every dark-blue box is a tool invoked as a separate process +by the clang driver. + +Compilation starts with compiling the input source `a.cpp` for one of the +targets requested via the command line - `T2`. When doing this first +compilation, the driver requests the device compiler to generate an +"integration header" via a special option. Device compilation for other targets +\- `T1` - don't need to generate the integration header, as it must be the same +for all the targets. + +*Design note: Current design does not use the host compiler to produce the +integration header for two reasons: first, it must be possible to use any host +compiler to produce SYCL heterogeneous application, and second, even if the +same clang is used for the host compilation, information provided in the +integration header is used (included) by the SYCL runtime implementation so it +must be ready before host compilation starts.* + +Now, after all the device compilations are completed resulting in `a_T2.bin` +and `a_T1.bin`, and the integration header `a.h` is generated, the driver +invokes the host compiler passing it the integration header via `-include` +option to produce the host object `a.obj`. Then the offload bundler tool is +invoked to pack `a_T2.bin`, `a_T1.bin` and `a.obj` into `a_fat.obj` - the fat +object file for the source `a.cpp`. + +The compilation process is repeated for all the sources in the application +(maybe on different machines). + +Device linking starts with breaking the fat objects back into constituents with +the unbundler tool (bundler invoked with `-unbundle` option). For each fat +object the unbundler produces a target list file which contains pairs +"`, `" each representing a device object extracted +from the fat object and its target. Once all the fat objects are unbundled, the +driver uses the target list files to construct a list of targets available for +linking and a list of corresponding object files for each: "`, +, `". Then the driver invokes linkers for +each of the targets to produce device binary images for those targets. + +*Current implementation uses LLVM IR as a default device binary format for `fat +objects` and translates "linked LLVM IR" to SPIR-V. One of the reasons for this +decision is that SPIR-V doesn't support linking template functions, which could +be defined in multiple modules and linker must resolve multiple definitions. +LLVM IR uses function attributes to satisfy "one definition rule", which have +no counterparts in SPIR-V.* + +Host linking starts after all device images are produced - with invocation of +the offload wrapper tool. Its main function is to create a host object file +wrapping all the device images and provide the runtime with access to that +information. So when creating the host wrapper object the offload wrapper tool +does the following: + +- creates a `.sycl_offloading.descriptor` symbol which is a structure +containing the number of device images and the array of the device images +themselves + +```C++ +struct __tgt_device_image { + void *ImageStart; + void *ImageEnd; +}; +struct __tgt_bin_desc { + int32_t NumDeviceImages; + __tgt_device_image *DeviceImages; +}; +__tgt_bin_desc .sycl_offloading.descriptor; +``` + +- creates a `void .sycl_offloading.descriptor_reg()` function and registers it +for execution at module loading; this function invokes the `__tgt_register_lib` +registration function (the name can also be specified via an option) which must +be implemented by the runtime and which registers the device images with the +runtime: + +```C++ +void __tgt_register_lib(__tgt_bin_desc *desc); +``` + +- creates a `void .sycl_offloading.descriptor_unreg()` function and registers +it for execution at module unloading; this function calls the +`__tgt_unregister_lib` function (the name can also be specified via an option) +which must be implemented by the runtime and which unregisters the device +images with the runtime: + +```C++ +void __tgt_unregister_lib(__tgt_bin_desc *desc); +``` + +Once the offload wrapper object file is ready, the driver finally invokes the +host linker giving it the following input: + +- all the application host objects (result of compilation or unbundling) +- the offload wrapper object file +- all the host libraries needed by the application +- the SYCL runtime library + +The result is so-called "fat binary image" containing the host code, code for +all the targets plus the registration/unregistration functions and the +information about the device binary images. + +When compilation and linking is done in single compiler driver invocation, the +bundling and unbundling steps are skipped. + +*Design note: the described scheme differs from current llvm.org +implementation. Current design uses Linux-specific linker script approach and +requires that all the linked fat objects are compiled for the same set of +targets. The described design uses OS-neutral offload-wrapper tool and does not +impose restrictions on fat objects.* + +### Integration with SPIR-V format + +This section explains how to generate SPIR-V specific types and operations from +C++ classes and functions. + +Translation of SYCL C++ programs to the code executable on heterogeneous +systems can be considered as three step process: + +1) translation of SYCL C++ programs into LLVM IR +1) translation from LLVM IR to SPIR-V +1) translation from SPIR-V to machine code + +LLVM-IR to SPIR-V translation is performed by a dedicated tool - +[translator](https://github.com/KhronosGroup/SPIRV-LLVM-Translator). +This tool correctly translates most of regular LLVM IR types/operations/etc to +SPIR-V. + +For example: + +- Type: `i32` → `OpTypeInt` +- Operation: `load` → `OpLoad` +- Calls: `call` → `OpFunctionCall` + +SPIR-V defines special built-in types and operations that do not have +corresponding equivalents in LLVM IR. E.g. + +- Type: ??? → `OpTypeEvent` +- Operation: ??? → `OpGroupAsyncCopy` + +Translation from LLVM IR to SPIR-V for special types is also supported, but +such LLVM IR must comply to some special requirements. Unfortunately there is +no canonical form of special built-in types and operations in LLVM IR, moreover +we can't re-use existing representation generated by OpenCL C front-end +compiler. For instance here is how `OpGroupAsyncCopy` operation looks in LLVM IR +produced by OpenCL C front-end compiler. + +```LLVM +@_Z21async_work_group_copyPU3AS3fPU3AS1Kfjj(float addrspace(3)*, float addrspace(1)*, i32, i32) +``` + +It's a regular function, which can conflict with user code produced from C++ +source. + + +SYCL compiler uses solution developed for OpenCL C++ compiler prototype: + +- Compiler: https://github.com/KhronosGroup/SPIR/tree/spirv-1.1 +- Headers: https://github.com/KhronosGroup/libclcxx + +SPIR-V types and operations that do not have LLVM equivalents are **declared** +(but not defined) in the headers and satisfy following requirements: + +- the type must be pre-declared as a C++ class in `cl::__spirv` namespace +- the type must not have actual definition in C++ program +- the operation is expressed in C++ as `extern` function not throwing C++ + exceptions +- the operation must not have the actual definition in C++ program + +For example, the following C++ code is successfully recognized and translated +into SPIR-V operation `OpGroupAsyncCopy`: + +```C++ +namespace cl { + namespace __spirv { + // This class does not have definition, it is only predeclared here. + // The pointers to this class objects can be passed to or returned from + // SPIR-V built-in functions. Only in such cases the class is recognized + // as SPIR-V type OpTypeEvent. + class OpTypeEvent; + + template + extern OpTypeEvent *OpGroupAsyncCopy(int32_t Scope, __local dataT *Dest, + __global dataT *Src, size_t NumElements, + size_t Stride, OpTypeEvent *E) noexcept; + } // namespace __spirv +} // namespace cl + +cl::__spirv::OpTypeEvent *e = + cl::__spirv::OpGroupAsyncCopy(cl::__spirv::Scope::Workgroup, + dst, src, numElements, 1, 0); +``` + +OpenCL C++ compiler uses a special module pass in clang that transforms the +names of C++ classes, globals and functions from the namespace `cl::__spirv::` +to +["SPIR-V representation in LLVM IR"](https://github.com/KhronosGroup/SPIRV-LLVM-Translator/blob/master/docs/SPIRVRepresentationInLLVM.rst) +which is recognized by the LLVM IR to SPIR-V translator. + +In the OpenCL C++ prototype project the pass is located at the directory: +`lib/CodeGen/OclCxxRewrite`. The file with the pass is: +`lib/CodeGen/OclCxxRewrite/BifNameReflower.cpp`. The other files in +`lib/CodeGen/OclCxxRewrite` are utility files implementing Itanium demangler +and other helping functionality. + +That LLVM module pass has been ported from OpenCL C++ prototype to the SYCL +compiler as is. It made possible using simple declarations of C++ classes and +external functions as if they were the SPIR-V specific types and operations. + +#### Some details and agreements on using SPIR-V special types and operations + +The SPIR-V specific C++ enumerators and classes are declared in the file: +`sycl/include/CL/__spirv/spirv_types.hpp`. + +The SPIR-V specific C++ function declarations are in the file: +`sycl/include/CL/__spirv/spirv_ops.hpp`. + +The SPIR-V specific functions are implemented in for the SYCL host device here: +`sycl/source/spirv_ops.cpp`. + +### Address spaces handling + +SYCL 1.2.1 language defines several address spaces where data can reside, the +same as OpenCL - global, local, private and constant. From the spec: "In OpenCL +C, these address spaces are manually specified using OpenCL-specific keywords. +In SYCL, the device compiler is expected to auto-deduce the address space for +pointers in common situations of pointer usage. However, there are situations +where auto-deduction is not possible". + +We believe that requirement for the compiler to automatically deduce address +spaces is too strong. Instead the following approach will be implemented in the +compiler: + +*TBD* + +### Compiler/Runtime interface + +## SYCL Runtime architecture + +*TBD* + +## Supported extensions + +- [Intel subgroups](extensions/sub_group_ndrange/sub_group_ndrange.md) + +## Unsupported extensions/proposals + +- [Ordered queue](extensions/ordered_queue/ordered_queue.adoc) +- [Unified shared memory](extensions/usm/usm.adoc) + +\*Other names and brands may be claimed as the property of others.