Skip to content

Commit

Permalink
[SYCL][Doc] Update docs to reflect PI removal.
Browse files Browse the repository at this point in the history
  • Loading branch information
aarongreig committed Aug 13, 2024
1 parent eabbb15 commit 628ef9e
Show file tree
Hide file tree
Showing 25 changed files with 1,122 additions and 2,020 deletions.
24 changes: 12 additions & 12 deletions sycl/doc/EnvironmentVariables.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ compiler and runtime.
| `SYCL_ENABLE_DEFAULT_CONTEXTS` | '1' or '0' | Enable ('1') or disable ('0') creation of default platform contexts in SYCL runtime. The default context for each platform contains all devices in the platform. Refer to [Platform Default Contexts](extensions/supported/sycl_ext_oneapi_default_context.asciidoc) extension to learn more. Enabled by default on Linux and disabled on Windows. |
| `SYCL_RT_WARNING_LEVEL` | Positive integer | The higher warning level is used the more warnings and performance hints the runtime library may print. Default value is '0', which means no warning/hint messages from the runtime library are allowed. The value '1' enables performance warnings from device runtime/codegen. The values greater than 1 are reserved for future use. |
| `SYCL_USM_HOSTPTR_IMPORT` | Integer | Enable by specifying non-zero value. Buffers created with a host pointer will result in host data promotion to USM, improving data transfer performance. To use this feature, also set SYCL_HOST_UNIFIED_MEMORY=1. |
| `SYCL_EAGER_INIT` | Integer | Enable by specifying non-zero value. Tells the SYCL runtime to do as much as possible initialization at objects construction as opposed to doing lazy initialization on the fly. This may mean doing some redundant work at warmup but ensures fastest possible execution on the following hot and reportable paths. It also instructs PI plugins to do the same. Default is "0". |
| `SYCL_EAGER_INIT` | Integer | Enable by specifying non-zero value. Tells the SYCL runtime to do as much as possible initialization at objects construction as opposed to doing lazy initialization on the fly. This may mean doing some redundant work at warmup but ensures fastest possible execution on the following hot and reportable paths. It also instructs UR adapters to do the same. Default is "0". |
| `SYCL_REDUCTION_PREFERRED_WORKGROUP_SIZE` | See [below](#sycl_reduction_preferred_workgroup_size) | Controls the preferred work-group size of reductions. |
| `SYCL_ENABLE_FUSION_CACHING` | '1' or '0' | Enable ('1') or disable ('0') caching of JIT compilations for kernel fusion. Caching avoids repeatedly running the JIT compilation pipeline if the same sequence of kernels is fused multiple times. Default value is '1'. |
| `SYCL_JIT_AMDGCN_PTX_KERNELS` | '1' or '0' | Enable ('1') or disable ('0') JIT compilation of kernels. Only supported for Nvidia and AMD backends. Note, that it is required to have a valid binary for the desired backend (AMD or CUDA), that was compiled with `-fsycl-embed-ir` in order to use JIT-ing. When JIT-ing is enabled SYCL runtime will try to cache and reuse JIT-compiled kernels, furthermore if a kernel uses specialization constants the compiler will attempt to materialize the values in place, turning them to de-facto compile time constants. Default is '0'. |
Expand Down Expand Up @@ -152,23 +152,23 @@ For a description of parallel for range rounding in DPC++ see
| | | `MinRangeX`: The minimum X dimension of the range such that range rounding is activated (Default 1024) |


## Controlling DPC++ Level Zero Plugin
## Controlling DPC++ Level Zero adapter

| Environment variable | Values | Description |
| -------------------- | ------ | ----------- |
| `SYCL_ENABLE_PCI` (Deprecated) | Integer | When set to 1, enables obtaining the GPU PCI address when using the Level Zero backend. The default is 1. This option is kept for compatibility reasons and is immediately deprecated. |
| `SYCL_PI_LEVEL_ZERO_DISABLE_USM_ALLOCATOR` | Any(\*) | Disable USM allocator in Level Zero plugin (each memory request will go directly to Level Zero runtime) |
| `SYCL_PI_LEVEL_ZERO_TRACK_INDIRECT_ACCESS_MEMORY` | Any(\*) | Enable support of the kernels with indirect access and corresponding deferred release of memory allocations in the Level Zero plugin. |
| `SYCL_PI_LEVEL_ZERO_DISABLE_USM_ALLOCATOR` | Any(\*) | Disable USM allocator in Level Zero adapter (each memory request will go directly to Level Zero runtime) |
| `SYCL_PI_LEVEL_ZERO_TRACK_INDIRECT_ACCESS_MEMORY` | Any(\*) | Enable support of the kernels with indirect access and corresponding deferred release of memory allocations in the Level Zero adapter. |

`(*) Note: Any means this environment variable is effective when set to any non-null value.`

## Controlling DPC++ CUDA Plugin
## Controlling DPC++ CUDA adapter

| Environment variable | Values | Description |
| -------------------- | ------ | ----------- |
| `SYCL_PI_CUDA_MAX_LOCAL_MEM_SIZE` | Integer | Specifies the maximum size of a local memory allocation in bytes. If the value exceeds the device's capabilities then a `sycl::runtime_error` is thrown. In order for the full error message to be printed, `SYCL_RT_WARNING_LEVEL=2` must be set. The default value for `SYCL_PI_CUDA_MAX_LOCAL_MEM_SIZE` is determined by the hardware. |

## Controlling DPC++ HIP Plugin
## Controlling DPC++ HIP adapter

| Environment variable | Values | Description |
| -------------------- | ------ | ----------- |
Expand Down Expand Up @@ -231,7 +231,7 @@ variables in production code.</span>
| after_addHostAcc | print graph after addHostAccessor method |
| always | print graph before and after each of the above methods |

## Debugging variables for Level Zero Plugin
## Debugging variables for Level Zero adapter

:warning: **Warning:** <span style="color:red">the environment variables
described below are used for development and debugging of DPC++ compiler
Expand All @@ -240,15 +240,15 @@ variables in production code.</span>

| Environment variable | Values | Description |
| -------------------- | ------ | ----------- |
| `SYCL_PI_LEVEL_ZERO_SINGLE_THREAD_MODE` | Integer | A single-threaded app has an opportunity to enable this mode to avoid overhead from mutex locking in the Level Zero plugin. A value greater than 0 enables single thread mode. A value of 0 disables single thread mode. The default is 0. |
| `SYCL_PI_LEVEL_ZERO_SINGLE_THREAD_MODE` | Integer | A single-threaded app has an opportunity to enable this mode to avoid overhead from mutex locking in the Level Zero adapter. A value greater than 0 enables single thread mode. A value of 0 disables single thread mode. The default is 0. |
| `SYCL_PI_LEVEL_ZERO_USM_ALLOCATOR` | [EnableBuffers][;[MaxPoolSize][;[host\|device\|shared:][MaxPoolableSize][,[Capacity][,SlabMinSize]]]...] | EnableBuffers enables pooling for SYCL buffers, default 1, set to 0 to disable. MaxPoolSize is the maximum size of the pool, by default there is no size limit. MemType is host, device, shared or read_only_shared. Other parameters are values specified as positive integers with optional K, M or G suffix. MaxPoolableSize is the maximum allocation size that may be pooled, default 0 for shared, 2MB for host, 4MB for device and read_only_shared. Capacity is the number of allocations in each size range freed by the program but retained in the pool for reallocation, default 4. Size ranges follow this pattern: 64, 96, 128, 192, and so on, i.e., powers of 2, with one range in between. SlabMinSize is the minimum allocation size, 64KB for host and device, 2MB for shared and read_only_shared. Example: SYCL_PI_LEVEL_ZERO_USM_ALLOCATOR=1;32M;host:1M,4,64K;device:1M,4,64K;shared:0,0,2M|
| `SYCL_PI_LEVEL_ZERO_BATCH_SIZE` | Integer | Sets a preferred number of compute commands to batch into a command list before executing the command list. A value of 0 causes the batch size to be adjusted dynamically. A value greater than 0 specifies fixed size batching, with the batch size set to the specified value. The default is 0. |
| `SYCL_PI_LEVEL_ZERO_COPY_BATCH_SIZE` | Integer | Sets a preferred number of copy commands to batch into a command list before executing the command list. A value of 0 causes the batch size to be adjusted dynamically. A value greater than 0 specifies fixed size batching, with the batch size set to the specified value. The default is 0. |
| `SYCL_PI_LEVEL_ZERO_FILTER_EVENT_WAIT_LIST` | Integer | When set to 0, disables filtering of signaled events from wait lists when using the Level Zero backend. The default is 0. |
| `SYCL_PI_LEVEL_ZERO_USE_COPY_ENGINE` | Any(\*) | This environment variable enables users to control use of copy engines for copy operations. If the value is an integer, it will allow the use of copy engines, if available in the device, in Level Zero plugin to transfer SYCL buffer or image data between the host and/or device(s) and to fill SYCL buffer or image data in device or shared memory. The value of this environment variable can also be a pair of the form "lower_index:upper_index" where the indices point to copy engines in a list of all available copy engines. The default is 0:0 when immediate command lists are being used on the device and 1 otherwise. (Also see description of SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS). |
| `SYCL_PI_LEVEL_ZERO_USE_COPY_ENGINE` | Any(\*) | This environment variable enables users to control use of copy engines for copy operations. If the value is an integer, it will allow the use of copy engines, if available in the device, in Level Zero adapter to transfer SYCL buffer or image data between the host and/or device(s) and to fill SYCL buffer or image data in device or shared memory. The value of this environment variable can also be a pair of the form "lower_index:upper_index" where the indices point to copy engines in a list of all available copy engines. The default is 0:0 when immediate command lists are being used on the device and 1 otherwise. (Also see description of SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS). |
| `SYCL_PI_LEVEL_ZERO_USE_COMPUTE_ENGINE` | Integer | It can be set to an integer (>=0) in which case all compute commands will be submitted to the command-queue with the given index in the compute command group. If it is instead set to a negative value then all available compute engines may be used. The default value is "0" |
| `SYCL_PI_LEVEL_ZERO_USE_COPY_ENGINE_FOR_D2D_COPY` (experimental) | Integer | Allows the use of copy engine, if available in the device, in Level Zero plugin for device to device copy operations. The default is 0. This option is experimental and will be removed once heuristics are added to make a decision about use of copy engine for device to device copy operations. |
| `SYCL_PI_LEVEL_ZERO_DEVICE_SCOPE_EVENTS` | Any(\*) | Enable support of device-scope events whose state is not visible to the host. If enabled mode is SYCL_PI_LEVEL_ZERO_DEVICE_SCOPE_EVENTS=1 the Level Zero plugin would create all events having device-scope only and create proxy host-visible events for them when their status is needed (wait/query) on the host. If enabled mode is SYCL_PI_LEVEL_ZERO_DEVICE_SCOPE_EVENTS=2 the Level Zero plugin would create all events having device-scope and add proxy host-visible event at the end of each command-list submission. The default is 0, meaning all events have host visibility. SYCL_PI_LEVEL_ZERO_DEVICE_SCOPE_EVENTS is ignored when using immediate command lists (SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS = 1) and all events use default scope of 0. |
| `SYCL_PI_LEVEL_ZERO_USE_COPY_ENGINE_FOR_D2D_COPY` (experimental) | Integer | Allows the use of copy engine, if available in the device, in Level Zero adapter for device to device copy operations. The default is 0. This option is experimental and will be removed once heuristics are added to make a decision about use of copy engine for device to device copy operations. |
| `SYCL_PI_LEVEL_ZERO_DEVICE_SCOPE_EVENTS` | Any(\*) | Enable support of device-scope events whose state is not visible to the host. If enabled mode is SYCL_PI_LEVEL_ZERO_DEVICE_SCOPE_EVENTS=1 the Level Zero adapter would create all events having device-scope only and create proxy host-visible events for them when their status is needed (wait/query) on the host. If enabled mode is SYCL_PI_LEVEL_ZERO_DEVICE_SCOPE_EVENTS=2 the Level Zero adapter would create all events having device-scope and add proxy host-visible event at the end of each command-list submission. The default is 0, meaning all events have host visibility. SYCL_PI_LEVEL_ZERO_DEVICE_SCOPE_EVENTS is ignored when using immediate command lists (SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS = 1) and all events use default scope of 0. |
| `SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS` | Integer | When set to a positive value enables use of Level Zero immediate commandlists, which means there is no batching and all commands are immediately submitted for execution. When set to 1, unique immediate commandlists are created for each SYCL queue. When set to 2, unique immediate commandlists are created per host thread per SYCL queue. Default is 1 on Intel® Data Center GPU Max Series running Linux and 0 elsewhere. |
| `SYCL_PI_LEVEL_ZERO_USE_MULTIPLE_COMMANDLIST_BARRIERS` | Integer | When set to a positive value enables use of multiple Level Zero commandlists when submitting barriers. Default is 1. |
| `SYCL_PI_LEVEL_ZERO_USE_COPY_ENGINE_FOR_FILL` | Integer | When set to a positive value enables use of a copy engine for memory fill operations. Default is 0. |
Expand All @@ -260,7 +260,7 @@ variables in production code.</span>
| `SYCL_PI_LEVEL_ZERO_USM_RESIDENT` | Integer | Bit-mask controls if/where to make USM allocations resident at the time of allocation. Input value is of the form 0xHSD, where 4-bits of D control device allocations, 4-bits of S control shared allocations, and 4-bits of H control host allocations. Each 4-bit component is holding one of the following values: "0" - then no special residency is forced, "1" - then allocation is made resident at the device of allocation, or "2" - then allocation is made resident on all devices in the context of allocation that have P2P access to the device of allocation. Default is 0x002, i.e. force full residency for device allocations only. |
| `SYCL_PI_LEVEL_ZERO_USE_NATIVE_USM_MEMCPY2D` | Integer | When set to a positive value enables the use of Level Zero USM 2D memory copy operations. Default is 0. |

## Debugging variables for CUDA Plugin
## Debugging variables for CUDA adapter

:warning: **Warning:** <span style="color:red">the environment variables
described below are used for development and debugging of DPC++ compiler
Expand Down
4 changes: 2 additions & 2 deletions sycl/doc/FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,8 +138,8 @@ OpenCL 2.1, so any device, capable of OpenCL 2.1, should be supported.
Otherwise, your OpenCL device must support `cl_khr_il_program` extension.

Furthermore, developers can extend capabilities of the DPC++ Runtime to
non-OpenCL devices by writing correspondent plugins. To learn more, please
check out our [Plugin Interface Guide](design/PluginInterface.md).
non-OpenCL devices by writing correspondent adapters. To learn more, please
check out the [Unified Runtime project](https://github.com/oneapi-src/unified-runtime).

### Q: DPC++ applications hang on Intel GPUs while working well on other devices
**A:** One of the common reasons is Intel GPUs feature called "hang check".
Expand Down
22 changes: 10 additions & 12 deletions sycl/doc/GetStartedGuide.md
Original file line number Diff line number Diff line change
Expand Up @@ -255,24 +255,22 @@ instruction on how to install this refer to

The DPC++ build assumes that ROCm is installed in `/opt/rocm`, if it is
installed somewhere else, the directory must be provided through the CMake
variable `SYCL_BUILD_PI_HIP_ROCM_DIR` which can be passed using the
variable `UR_HIP_ROCM_DIR` which can be passed using the
`--cmake-opt` option of `configure.py` as follows:

```sh
python $DPCPP_HOME/llvm/buildbot/configure.py --hip \
--cmake-opt=-DSYCL_BUILD_PI_HIP_ROCM_DIR=/usr/local/rocm
--cmake-opt=-DUR_HIP_ROCM_DIR=/usr/local/rocm
```
If further customization is required — for instance when the layout of
individual directories can not be inferred from `SYCL_BUILD_PI_HIP_ROCM_DIR`
individual directories can not be inferred from `UR_HIP_ROCM_DIR`
it is possible to specify the location of HIP include, HSA include and HIP
library directories, using the following CMake variables:
* `SYCL_BUILD_PI_HIP_INCLUDE_DIR`,
* `SYCL_BUILD_PI_HIP_HSA_INCLUDE_DIR`,
* `SYCL_BUILD_PI_HIP_LIB_DIR`.
Please note that a similar customization would also be required for Unified
Runtime, see [the list of options provided by its
CMake](https://github.com/oneapi-src/unified-runtime#cmake-standard-options)
for details.
* `UR_HIP_INCLUDE_DIR`,
* `UR_HIP_HSA_INCLUDE_DIR`,
* `UR_HIP_LIB_DIR`.
These options are all passed through to Unified Runtime, more detail about them
can be found [here](https://github.com/oneapi-src/unified-runtime#cmake-standard-options).

[LLD](https://llvm.org/docs/AMDGPUUsage.html) is necessary for the AMDGPU
compilation chain. The AMDGPU backend generates a standard ELF relocatable code
Expand Down Expand Up @@ -342,7 +340,7 @@ under the hood to provide implementations of various SYCL backends. By default
the source code for the Unified Runtime will be acquired using CMake's
[FetchCotent](https://cmake.org/cmake/help/latest/module/FetchContent.html). The
specific repository URL and revision tag used can be found in the file
`sycl/plugins/unified_runtime/CMakeLists.txt` searching for the variables
`sycl/cmake/modules/FetchUnifiedRuntime.cmake` searching for the variables
`UNIFIED_RUNTIME_REPO` and `UNIFIED_RUNTIME_TAG`.

In order to enable developers, a number of CMake variables are available to
Expand All @@ -360,7 +358,7 @@ control which revision of Unified Runtime should be used when building DPC++:
* The path specified by variable `SYCL_UR_SOURCE_DIR` will be used with
`add_directory()`. This can be used to point at an adjacent directory
containing a clone of the Unified Runtime repository.
* The path `sycl/plugins/unified_runtime/unified-runtime` will be used, if it
* The path `sycl/unified-runtime` will be used, if it
exists. This can be used as-if an in-tree build.
* `SYCL_UR_SOURCE_DIR` is a variable used to specify the path to the Unified
Runtime repository when `SYCL_UR_USE_FETCH_CONTENT` is set of `OFF`.
Expand Down
Loading

0 comments on commit 628ef9e

Please sign in to comment.