diff --git a/openmp/docs/OpenMP.md b/openmp/docs/OpenMP.md index 769bac8f7bce5..3071f27f97492 100644 --- a/openmp/docs/OpenMP.md +++ b/openmp/docs/OpenMP.md @@ -21,7 +21,6 @@ this ROCm release. See the list of supported GPUs for {doc}`Linux + + +.. note:: + +The compiler also accepts the alternative offloading notation: + +.. code-block:: bash + + -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march= + + +Obtain the value of `gpu-arch` by running the following command: + +.. code-block:: bash + + % /opt/rocm-{version}/bin/rocminfo | grep gfx + + +[//]: # (dated link below, needs updating) + +See the complete list of compiler command-line references `here `_. + diff --git a/openmp/docs/how-to/use-rocprof.rst b/openmp/docs/how-to/use-rocprof.rst new file mode 100644 index 0000000000000..518d96e538647 --- /dev/null +++ b/openmp/docs/how-to/use-rocprof.rst @@ -0,0 +1,35 @@ +.. meta:: + :description: Install OpenMP + :keywords: install, openmp, llvm, aomp, AMD, ROCm + + +Using `rocprof` with OpenMP +----------------------------- + +The following steps describe a typical workflow for using `rocprof` with OpenMP code compiled with AOMP: + +1. Run `rocprof` with the program command line: + + .. code-block:: bash + + % rocprof + + + This produces a `results.csv` file in the user’s current directory that shows basic stats such as kernel names, grid size, number of registers used etc. The user can choose to specify the preferred output file name using the + o option. + +2. Add options for a detailed result: + + .. code-block:: bash + + --stats: % rocprof --stats + + + The stats option produces timestamps for the kernels. Look into the output CSV file for the field, `DurationNs`, which is useful in getting an understanding of the critical kernels in the code. + + Apart from `--stats`, the option `--timestamp` on produces a timestamp for the kernels. + +3. After learning about the required kernels, the user can take a detailed look at each one of them. `rocprof` has support for hardware counters: a set of basic and a set of derived ones. See the complete list of counters using + options --list-basic and --list-derived. `rocprof` accepts either a text or an XML file as an input. + +For more details on `rocprof`, refer to the {doc}`ROCProfilerV1 User Manual `. diff --git a/openmp/docs/how-to/use-tracing-options.md b/openmp/docs/how-to/use-tracing-options.md new file mode 100644 index 0000000000000..33c170520aeb5 --- /dev/null +++ b/openmp/docs/how-to/use-tracing-options.md @@ -0,0 +1,39 @@ + + +### Using tracing options + +#### Prerequisite + +When using the `--sys-trace` option, compile the OpenMP program with: + +```bash + + -Wl,-rpath,/opt/rocm-{version}/lib -lamdhip64 + +``` + +The following tracing options are widely used to generate useful information: + +* **`--hsa-trace`**: This option is used to get a JSON output file with the HSA API execution traces and a flat profile in a CSV file. + +* **`--sys-trace`**: This allows programmers to trace both HIP and HSA calls. Since this option results in loading ``libamdhip64.so``, follow the + prerequisite as mentioned above. + +A CSV and a JSON file are produced by the above trace options. The CSV file presents the data in a tabular format, and the JSON file can be visualized using +Google Chrome at chrome://tracing/ or [Perfetto](https://perfetto.dev/). Navigate to Chrome or Perfetto and load the JSON file to see the timeline of the +HSA calls. + +For more details on tracing, refer to the {doc}`ROCProfilerV1 User Manual `. + +### Environment variables + +| Environment Variable | Purpose | +| --------------------------- | ---------------------------- | +| `OMP_NUM_TEAMS` | To set the number of teams for kernel launch, which is otherwise chosen by the implementation by default. You can set this number (subject to implementation limits) for performance tuning. | +| `LIBOMPTARGET_KERNEL_TRACE` | To print useful statistics for device operations. Setting it to 1 and running the program emits the name of every kernel launched, the number of teams and threads used, and the corresponding register usage. Setting it to 2 additionally emits timing information for kernel launches and data transfer operations between the host and the device. | +| `LIBOMPTARGET_INFO` | To print informational messages from the device runtime as the program executes. Setting it to a value of 1 or higher, prints fine-grain information and setting it to -1 prints complete information. | +| `LIBOMPTARGET_DEBUG` | To get detailed debugging information about data transfer operations and kernel launch when using a debug version of the device library. Set this environment variable to 1 to get the detailed information from the library. | +| `GPU_MAX_HW_QUEUES` | To set the number of HSA queues in the OpenMP runtime. The HSA queues are created on demand up to the maximum value as supplied here. The queue creation starts with a single initialized queue to avoid unnecessary allocation of resources. The provided value is capped if it exceeds the recommended, device-specific value. | +| `LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES` | To set the threshold size up to which data transfers are initiated asynchronously. The default threshold size is 1*1024*1024 bytes (1MB). | +| `OMPX_FORCE_SYNC_REGIONS` | To force the runtime to execute all operations synchronously, i.e., wait for an operation to complete immediately. This affects data transfers and kernel execution. While it is mainly designed for debugging, it may have a minor positive effect on performance in certain situations. | +::: diff --git a/openmp/docs/index.rst b/openmp/docs/index.rst index 5d39a1c8f080b..f53ad38ed196a 100644 --- a/openmp/docs/index.rst +++ b/openmp/docs/index.rst @@ -1,139 +1,45 @@ -.. title:: Welcome to the documentation of OpenMP in LLVM! +.. meta:: + :description: OpenMP + :keywords: install, openmp, llvm, aomp, AMD, ROCm -.. note:: - This document is a work in progress and most of the expected content is not - yet available. While you can expect changes, we always welcome feedback and - additions. Please post on the `Discourse forums (Runtimes - - OpenMP) `__.. -.. toctree:: - :hidden: - :maxdepth: 1 +The ROCm™ installation includes an LLVM-based implementation that fully supports the OpenMP 4.5 standard and a subset of OpenMP 5.0, 5.1, and 5.2 standards. Fortran, C/C++ compilers, and corresponding runtime libraries are included. +Along with host APIs, the OpenMP compilers support offloading code and data onto GPU devices. This document briefly describes the installation location of the OpenMP toolchain, example usage of device offloading, and usage of `rocprof` with OpenMP applications. The GPUs supported are the same as those supported by this ROCm release. See the list of supported GPUs for {doc}`Linux` and {doc}`Windows`. - LLVM/OpenMP Documentation +The ROCm OpenMP compiler is implemented using LLVM compiler technology. The following image illustrates the internal steps taken to translate a user’s application into an executable that can offload computation to the AMDGPU. The compilation is a two-pass process. Pass 1 compiles the application to generate the CPU code and Pass 2 links the CPU code to the AMDGPU device code. +You can access code on the `GitHub repository `_. -LLVM/OpenMP Design & Overview -============================= +.. grid:: 2 + :gutter: 3 -OpenMP impacts various parts of the LLVM project, from the frontends (`Clang -`_ and Flang), through -middle-end :ref:`optimizations `, up to the -multitude of available :ref:`OpenMP runtimes `. + .. grid-item-card:: Install -A high-level overview of OpenMP in LLVM can be found :doc:`here `. + * :doc:`OpenMP installation <./install/install>` + * :doc:`Build OpenMP <./install/build>` -.. toctree:: - :hidden: - :maxdepth: 1 + .. grid-item-card:: Conceptual - design/Overview + * :doc:`OpenMP features <./conceptual/openmp-features>` -OpenACC Support -=============== + .. grid-item-card:: How to -:doc:`OpenACC support ` is under development for -both Flang and Clang. For this purpose, LLVM's OpenMP runtimes are -being extended to serve as OpenACC runtimes. In some cases, Clang -supports :doc:`OpenMP extensions ` to make -the additional functionality also available in OpenMP applications. + * :doc:`` + * :doc:`` + * :doc:`` -.. toctree:: - :hidden: - :maxdepth: 1 + .. grid-item-card:: Reference - openacc/Overview + * :doc:`OpenMP API specification for parallel programming <./reference/api>` + * :doc:`Command line argument reference <./reference/CommandLineArgumentReference>` + * :doc:`OpenMP FAQ <./reference/faq>` + + .. grid-item-card:: Tutorials -LLVM/OpenMP Optimizations -========================= + * `AOMP samples `_ -LLVM, since `version 11 `_ (12 Oct -2020), has an :doc:`OpenMP-Aware optimization pass ` -as well as the ability to :doc:`perform "scalar optimizations" across OpenMP region -boundaries `. +To contribute to the documentation, refer to +`Contributing to ROCm `_. -In-depth discussion of the topic can be found :doc:`here `. - -.. toctree:: - :hidden: - :maxdepth: 1 - - optimizations/Overview - -LLVM/OpenMP Optimization Remarks -================================ - -LLVM has an elaborate ecosystem around `analysis and optimization remarks -`_ issues during -compilation. The remarks can be enabled from the clang frontend `[1]`_ `[2]`_ -in various formats `[3]`_ `[4]`_ to be used by tools, i.a., `opt-viewer` or -`llvm-opt-report` (dated). - -The OpenMP optimizations in LLVM have been developed with remark support as a -priority. For a list of OpenMP specific remarks and more information on them, -please refer to :doc:`remarks/OptimizationRemarks`. - - -.. _`[1]`: https://clang.llvm.org/docs/UsersManual.html#options-to-emit-optimization-reports -.. _`[2]`: https://clang.llvm.org/docs/ClangCommandLineReference.html#diagnostic-flags -.. _`[3]`: https://clang.llvm.org/docs/ClangCommandLineReference.html#cmdoption-clang-foptimization-record-file -.. _`[4]`: https://clang.llvm.org/docs/ClangCommandLineReference.html#cmdoption-clang1-fsave-optimization-record - -+ `[1]`_ https://clang.llvm.org/docs/UsersManual.html#options-to-emit-optimization-reports -+ `[2]`_ https://clang.llvm.org/docs/ClangCommandLineReference.html#diagnostic-flags -+ `[3]`_ https://clang.llvm.org/docs/ClangCommandLineReference.html#cmdoption-clang-foptimization-record-file -+ `[4]`_ https://clang.llvm.org/docs/ClangCommandLineReference.html#cmdoption-clang1-fsave-optimization-record - - -.. toctree:: - :hidden: - :maxdepth: 1 - - remarks/OptimizationRemarks - -OpenMP Command-Line Argument Reference -====================================== -In addition to the -`Clang command-line argument reference `_ -we also recommend the OpenMP -:doc:`command-line argument reference ` -page that offers a detailed overview of options specific to OpenMP. It also -contains a list of OpenMP offloading related command-line arguments. - - -.. toctree:: - :hidden: - :maxdepth: 1 - - CommandLineArgumentReference - -Support, Getting Involved, and Frequently Asked Questions (FAQ) -=============================================================== - -Dealing with OpenMP can be complicated. For help with the setup of an OpenMP -(offload) capable compiler toolchain, its usage, and common problems, consult -the :doc:`Support and FAQ ` page. - -We also encourage everyone interested in OpenMP in LLVM to :doc:`get involved -`. - - -.. toctree:: - :hidden: - :maxdepth: 1 - - SupportAndFAQ - -Release Notes -============= - -The current (in-progress) release notes can be found :doc:`here ` while -release notes for releases, starting with LLVM 12, will be available on `the -Download Page `_. - - -.. toctree:: - :hidden: - :maxdepth: 1 - - In-Progress ReleaseNotes +You can find licensing information on the +`Licensing `_ page. diff --git a/openmp/docs/install/build.rst b/openmp/docs/install/build.rst new file mode 100644 index 0000000000000..0150c54838dc0 --- /dev/null +++ b/openmp/docs/install/build.rst @@ -0,0 +1,338 @@ +.. meta:: + :description: Install OpenMP + :keywords: install, openmp, llvm, aomp, AMD, ROCm + + +======================================== +How to Build the LLVM* OpenMP* Libraries +======================================== +This repository requires `CMake `_ v2.8.0 or later. LLVM and Clang need a more recent version which also applies for in-tree builds. For +more information than available in this document, see `LLVM's CMake documentation `_ and the +`official documentation `_. + +.. contents:: + :local: + +How to Call CMake Initially, then Repeatedly +============================================ +- When calling CMake for the first time, all needed compiler options must be specified on the command line. After this initial call to CMake, the compiler + definitions must not be included for further calls to CMake. Other options can be specified on the command line multiple times including all definitions + in the build options section below. + +- Example of configuring, building, reconfiguring, rebuilding: + + .. code-block:: console + + $ mkdir build + $ cd build + $ cmake -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ .. # Initial configuration + $ make + ... + $ make clean + $ cmake -DCMAKE_BUILD_TYPE=Debug .. # Second configuration + $ make + ... + $ rm -rf * + $ cmake -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ .. # Third configuration + $ make + +- Notice in the example how the compiler definitions are only specified for an empty build directory, but other build options are used at any time. + +- The file ``CMakeCache.txt`` which is created after the first call to CMake is a configuration file which holds all values for the build options. These + values can be changed using a text editor to modify ``CMakeCache.txt`` as opposed to using definitions on the command line. + +- To have CMake create a particular type of build generator file simply include the ``-G `` option: + + .. code-block:: console + + $ cmake -G "Unix Makefiles" ... + + You can see a list of generators CMake supports by executing the cmake command with no arguments. + +Instructions to Build +===================== +.. code-block:: console + + $ cd openmp_top_level/ [ this directory with libomptarget/, runtime/, etc. ] + $ mkdir build + $ cd build + + [ Unix* Libraries ] + $ cmake -DCMAKE_C_COMPILER= -DCMAKE_CXX_COMPILER= .. + + [ Windows* Libraries ] + $ cmake -G -DCMAKE_C_COMPILER= -DCMAKE_CXX_COMPILER= -DCMAKE_ASM_MASM_COMPILER=[ml | ml64] -DCMAKE_BUILD_TYPE=Release .. + + $ make + $ make install + +CMake Options +============= +Builds with CMake can be customized by means of options as already seen above. One possibility is to pass them via the command line: + +.. code-block:: console + + $ cmake -DOPTION= path/to/source + +.. note:: The first value listed is the respective default for that option. + +Generic Options +--------------- +For full documentation consult the CMake manual or execute ``cmake --help-variable VARIABLE_NAME`` to get information about a specific +variable. + +**CMAKE_BUILD_TYPE** = ``Release|Debug|RelWithDebInfo`` + + Build type can be ``Release``, ``Debug``, or ``RelWithDebInfo`` which chooses + the optimization level and presence of debugging symbols. + +**CMAKE_C_COMPILER** = + Specify the C compiler. + +**CMAKE_CXX_COMPILER** = + Specify the C++ compiler. + +**CMAKE_Fortran_COMPILER** = + Specify the Fortran compiler. This option is only needed when + **LIBOMP_FORTRAN_MODULES** is ``ON`` (see below). So typically, a Fortran + compiler is not needed during the build. + +**CMAKE_ASM_MASM_COMPILER** = ``ml|ml64`` + This option is only relevant for Windows*. + +Options for all Libraries +------------------------- + +**OPENMP_ENABLE_WERROR** = ``OFF|ON`` + Treat warnings as errors and fail, if a compiler warning is triggered. + +**OPENMP_LIBDIR_SUFFIX** = ``""`` + Extra suffix to append to the directory where libraries are to be installed. + +**OPENMP_TEST_C_COMPILER** = ``${CMAKE_C_COMPILER}`` + Compiler to use for testing. Defaults to the compiler that was also used for + building. + +**OPENMP_TEST_CXX_COMPILER** = ``${CMAKE_CXX_COMPILER}`` + Compiler to use for testing. Defaults to the compiler that was also used for + building. + +**OPENMP_TEST_Fortran_COMPILER** = ``${CMAKE_Fortran_COMPILER}`` + Compiler to use for testing. Defaults to the compiler that was also used for + building. Will default to flang if build is in-tree. + +**OPENMP_LLVM_TOOLS_DIR** = ``/path/to/built/llvm/tools`` + Additional path to search for LLVM tools needed by tests. + +**OPENMP_LLVM_LIT_EXECUTABLE** = ``/path/to/llvm-lit`` + Specify full path to ``llvm-lit`` executable for running tests. The default + is to search the ``PATH`` and the directory in **OPENMP_LLVM_TOOLS_DIR**. + +**OPENMP_FILECHECK_EXECUTABLE** = ``/path/to/FileCheck`` + Specify full path to ``FileCheck`` executable for running tests. The default + is to search the ``PATH`` and the directory in **OPENMP_LLVM_TOOLS_DIR**. + +**OPENMP_NOT_EXECUTABLE** = ``/path/to/not`` + Specify full path to ``not`` executable for running tests. The default + is to search the ``PATH`` and the directory in **OPENMP_LLVM_TOOLS_DIR**. + +Options for ``libomp`` +---------------------- + +**LIBOMP_ARCH** = ``aarch64|aarch64_32|arm|i386|loongarch64|mic|mips|mips64|ppc64|ppc64le|x86_64|riscv64|s390x`` + The default value for this option is chosen based on probing the compiler for + architecture macros (e.g., is ``__x86_64__`` predefined by compiler?). + +**LIBOMP_MIC_ARCH** = ``knc|knf`` + Intel(R) Many Integrated Core Architecture (Intel(R) MIC Architecture) to + build for. This value is ignored if **LIBOMP_ARCH** does not equal ``mic``. + +**LIBOMP_LIB_TYPE** = ``normal|profile|stubs`` + Library type can be ``normal``, ``profile``, or ``stubs``. + +**LIBOMP_USE_VERSION_SYMBOLS** = ``ON|OFF`` + Use versioned symbols for building the library. This option only makes sense + for ELF based libraries where version symbols are supported (Linux*, some BSD* + variants). It is ``OFF`` by default for Windows* and macOS*, but ``ON`` for + other Unix based operating systems. + +**LIBOMP_ENABLE_SHARED** = ``ON|OFF`` + Build a shared library. If this option is ``OFF``, static OpenMP libraries + will be built instead of dynamic ones. + + .. note:: + + Static libraries are not supported on Windows*. + +**LIBOMP_FORTRAN_MODULES** = ``OFF|ON`` + Create the Fortran modules (requires Fortran compiler). + +macOS* Fat Libraries +"""""""""""""""""""" +On macOS* machines, it is possible to build universal (or fat) libraries, which include both i386 and x86_64 architecture objects in a single archive. + +.. code-block:: console + + $ cmake -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_OSX_ARCHITECTURES='i386;x86_64' .. + $ make + +There is also an option **LIBOMP_OSX_ARCHITECTURES** which can be set in case +this is an LLVM source tree build. It will only apply for the ``libomp`` library +avoids having the entire LLVM/Clang build produce universal binaries. + +Optional Features +""""""""""""""""" + +**LIBOMP_USE_ADAPTIVE_LOCKS** = ``ON|OFF`` + Include adaptive locks, based on Intel(R) Transactional Synchronization Extensions (Intel(R) TSX). This feature is x86 specific and turned ``ON`` + by default for IA-32 architecture and Intel(R) 64 architecture. + +**LIBOMP_USE_INTERNODE_ALIGNMENT** = ``OFF|ON`` + Align certain data structures on 4096-byte. This option is useful on multi-node systems where a small ``CACHE_LINE`` setting leads to false sharing. + +**LIBOMP_OMPT_SUPPORT** = ``ON|OFF`` + Include support for the OpenMP Tools Interface (OMPT). + This option is supported and ``ON`` by default for x86, x86_64, AArch64, PPC64, RISCV64, LoongArch64, and s390x on Linux* and macOS*. + This option is ``OFF`` if this feature is not supported for the platform. + +**LIBOMP_OMPT_OPTIONAL** = ``ON|OFF`` + Include support for optional OMPT functionality. This option is ignored if + **LIBOMP_OMPT_SUPPORT** is ``OFF``. + +**LIBOMP_STATS** = ``OFF|ON`` + Include stats-gathering code. + +**LIBOMP_USE_DEBUGGER** = ``OFF|ON`` + Include the friendly debugger interface. + +**LIBOMP_USE_HWLOC** = ``OFF|ON`` + Use `OpenMPI's hwloc library `_ for + topology detection and affinity. + +**LIBOMP_HWLOC_INSTALL_DIR** = ``/path/to/hwloc/install/dir`` + Specify install location of hwloc. The configuration system will look for ``hwloc.h`` in ``${LIBOMP_HWLOC_INSTALL_DIR}/include`` and the library in + ``${LIBOMP_HWLOC_INSTALL_DIR}/lib``. The default is ``/usr/local``. This option is only used if **LIBOMP_USE_HWLOC** is ``ON``. + +Additional Compiler Flags +""""""""""""""""""""""""" + +These flags are **appended**, they do not overwrite any of the preset flags. + +**LIBOMP_CPPFLAGS** = + Additional C preprocessor flags. + +**LIBOMP_CXXFLAGS** = + Additional C++ compiler flags. + +**LIBOMP_ASMFLAGS** = + Additional assembler flags. + +**LIBOMP_LDFLAGS** = + Additional linker flags. + +**LIBOMP_LIBFLAGS** = + Additional libraries to link. + +**LIBOMP_FFLAGS** = + Additional Fortran compiler flags. + +Options for ``libomptarget`` +---------------------------- + +An installed LLVM package is a prerequisite for building ``libomptarget`` +library. So ``libomptarget`` may only be built in two cases: + +- As a project of a regular LLVM build via **LLVM_ENABLE_PROJECTS**, + **LLVM_EXTERNAL_PROJECTS**, or **LLVM_ENABLE_RUNTIMES** or +- as a standalone project build that uses a pre-installed LLVM package. + In this mode one has to make sure that the default CMake + ``find_package(LLVM)`` call `succeeds `_. + +**LIBOMPTARGET_OPENMP_HEADER_FOLDER** = ``""`` + Path of the folder that contains ``omp.h``. This is required for testing + out-of-tree builds. + +**LIBOMPTARGET_OPENMP_HOST_RTL_FOLDER** = ``""`` + Path of the folder that contains ``libomp.so``, and ``libLLVMSupport.so`` + when profiling is enabled. This is required for testing. + +Options for ``NVPTX device RTL`` +-------------------------------- + +**LIBOMPTARGET_NVPTX_ENABLE_BCLIB** = ``ON|OFF`` + Enable CUDA LLVM bitcode offloading device RTL. This is used for link time optimization of the OMP runtime and application code. This option is enabled + by default if the build system determines that `CMAKE_C_COMPILER` is able to compile and link the library. + +**LIBOMPTARGET_NVPTX_CUDA_COMPILER** = ``""`` + Location of a CUDA compiler capable of emitting LLVM bitcode. Currently only the Clang compiler is supported. This is only used when building the CUDA LLVM + bitcode offloading device RTL. If unspecified, either the Clang from the build itself is used (i.e. an in-tree build with LLVM_ENABLE_PROJECTS including + clang), or the Clang compiler that the build uses as C compiler (CMAKE_C_COMPILER; only if it is Clang). The latter is common for a stage2-build or when using -DLLVM_ENABLE_RUNTIMES=openmp. + +**LIBOMPTARGET_NVPTX_BC_LINKER** = ``""`` + Location of a linker capable of linking LLVM bitcode objects. This is only used when building the CUDA LLVM bitcode offloading device RTL. If + unspecified, either the llvm-link in that same directory as LIBOMPTARGET_NVPTX_CUDA_COMPILER is used, or the llvm-link from the same build (available in an in-tree build). + +**LIBOMPTARGET_NVPTX_ALTERNATE_HOST_COMPILER** = ``""`` + Host compiler to use with NVCC. This compiler is not going to be used to produce any binary. Instead, this is used to overcome the input compiler + checks done by NVCC. E.g. if using a default host compiler that is not compatible with NVCC, this option can be use to pass to NVCC a valid compiler + to avoid the error. + + **LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES** = ``35`` + List of CUDA compute capabilities that should be supported by the NVPTX device RTL. E.g. for compute capabilities 6.0 and 7.0, the option "60;70" + should be used. Compute capability 3.5 is the minimum required. + + **LIBOMPTARGET_NVPTX_DEBUG** = ``OFF|ON`` + Enable printing of debug messages from the NVPTX device RTL. + +**LIBOMPTARGET_LIT_ARGS** = ``""`` + Arguments given to lit. ``make check-libomptarget`` and ``make check-libomptarget-*`` are affected. For example, use ``LIBOMPTARGET_LIT_ARGS="-j4"`` to force ``lit`` to start only four parallel jobs instead of by default the number of threads in the system. + +Example Usages of CMake +======================= + +Typical Invocations +------------------- + +.. code-block:: console + + $ cmake -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ .. + $ cmake -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ .. + $ cmake -DCMAKE_C_COMPILER=icc -DCMAKE_CXX_COMPILER=icpc .. + +Advanced Builds with Various Options +------------------------------------ + +- Build the i386 Linux* library using GCC* + + .. code-block:: console + + $ cmake -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DLIBOMP_ARCH=i386 .. + +- Build the x86_64 debug Mac library using Clang* + + .. code-block:: console + + $ cmake -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DLIBOMP_ARCH=x86_64 -DCMAKE_BUILD_TYPE=Debug .. + +- Build the library (architecture determined by probing compiler) using the Intel(R) C Compiler and the Intel(R) C++ Compiler. Also, create Fortran + modules with the Intel(R) Fortran Compiler. + + .. code-block:: console + + $ cmake -DCMAKE_C_COMPILER=icc -DCMAKE_CXX_COMPILER=icpc -DCMAKE_Fortran_COMPILER=ifort -DLIBOMP_FORTRAN_MODULES=on .. + +- Have CMake find the C/C++ compiler and specify additional flags for the preprocessor and C++ compiler. + + .. code-blocks:: console + + $ cmake -DLIBOMP_CPPFLAGS='-DNEW_FEATURE=1 -DOLD_FEATURE=0' -DLIBOMP_CXXFLAGS='--one-specific-flag --two-specific-flag' .. + +- Build the stubs library + + .. code-blocks:: console + + $ cmake -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DLIBOMP_LIB_TYPE=stubs .. + +**Footnotes** + +* Other names and brands may be claimed as the property of others. diff --git a/openmp/docs/install/install.rst b/openmp/docs/install/install.rst new file mode 100644 index 0000000000000..ae73ffe0b0ef7 --- /dev/null +++ b/openmp/docs/install/install.rst @@ -0,0 +1,89 @@ +.. meta:: + :description: Install OpenMP + :keywords: install, openmp, llvm, aomp, AMD, ROCm + + +************** +Installation +************** +The OpenMP toolchain is automatically installed as part of the standard ROCm installation and is available under /opt/rocm-{version}/llvm. The sub-directories are: + +* bin: Compilers (flang and clang) and other binaries. +* examples: The usage section below shows how to compile and run these programs. +* include: Header files. +* lib: Libraries including those required for target offload. +* lib-debug: Debug versions of the above libraries. + +Prerequisites +---------------- + +* Linux Kernel versions above 5.14 +* Latest KFD driver packaged in ROCm stack +* Xnack, as USM support can only be tested with applications compiled with Xnack capability + +Xnack capability +================= + +When enabled, Xnack capability allows GPU threads to access CPU (system) memory, allocated with OS-allocators, such as malloc, new, and mmap. Xnack must be enabled both at compile- and run-time. To enable Xnack support at compile-time, use: + +`--offload-arch=gfx908:xnack+` +Or use another functionally equivalent option Xnack-any: + +`--offload-arch=gfx908` +To enable Xnack functionality at runtime on a per-application basis, use environment variable: + +`HSA_XNACK=1` +When Xnack support is not needed: + +Building OpenMP +================ + +Build the applications to maximize resource utilization using: +`--offload-arch=gfx908:xnack-` +At runtime, set the HSA_XNACK environment variable to 0. + +Unified shared memory pragma +============================== + +This OpenMP pragma is available on MI200 through xnack+ support. + +omp requires unified_shared_memory +==================================== +As stated in the OpenMP specifications, this pragma makes the map clause on target constructs optional. By default, on MI200, all memory allocated on the host is fine grain. Using the map clause on a target clause is allowed, which transforms the access semantics of the associated memory to coarse grain. + +A simple program demonstrating the use of this feature is: + +$ cat parallel_for.cpp +#include +#include + +#define N 64 +#pragma omp requires unified_shared_memory + +.. code-block:: bash + + int main() { + int n = N; + int *a = new int[n]; + int *b = new int[n]; + + for(int i = 0; i < n; i++) + b[i] = i; + + #pragma omp target parallel for map(to:b[:n]) + for(int i = 0; i < n; i++) + a[i] = b[i]; + + for(int i = 0; i < n; i++) + if(a[i] != i) + printf("error at %d: expected %d, got %d\n", i, i+1, a[i]); + + return 0; + } + $ clang++ -O2 -target x86_64-pc-linux-gnu -fopenmp --offload-arch=gfx90a:xnack+ parallel_for.cpp + $ HSA_XNACK=1 ./a.out + +In the above code example, pointer “a” is not mapped in the target region, while pointer “b” is. Both are valid pointers on the GPU device and passed by-value to the kernel implementing the target region. This means the pointer values on the host and the device are the same. + +The difference between the memory pages pointed to by these two variables is that the pages pointed by “a” are in fine-grain memory, while the pages pointed to by “b” are in coarse-grain memory during and after the execution of the target region. This is accomplished in the OpenMP runtime library with calls to the ROCr runtime to set the pages pointed by “b” as coarse grain. + diff --git a/openmp/docs/reference/CommandLineArgumentReference.rst b/openmp/docs/reference/CommandLineArgumentReference.rst new file mode 100644 index 0000000000000..8c50482ca8e08 --- /dev/null +++ b/openmp/docs/reference/CommandLineArgumentReference.rst @@ -0,0 +1,186 @@ +OpenMP Command-Line Argument Reference +====================================== +Welcome to the OpenMP in LLVM command line argument reference. The content is +not a complete list of arguments but includes the essential command-line +arguments you may need when compiling and linking OpenMP. +Section :ref:`general_command_line_arguments` lists OpenMP command line options +for multicore programming while :ref:`offload_command_line_arguments` lists +options relevant to OpenMP target offloading. + +.. _general_command_line_arguments: + +OpenMP Command-Line Arguments +----------------------------- + +``-fopenmp`` +^^^^^^^^^^^^ +Enable the OpenMP compilation toolchain. The compiler will parse OpenMP +compiler directives and generate parallel code. + +``-fopenmp-extensions`` +^^^^^^^^^^^^^^^^^^^^^^^ +Enable all ``Clang`` extensions for OpenMP directives and clauses. A list of +current extensions and their implementation status can be found on the +`support `_ +page. + +``-fopenmp-simd`` +^^^^^^^^^^^^^^^^^ +This option enables OpenMP only for single instruction, multiple data +(SIMD) constructs. + +``-static-openmp`` +^^^^^^^^^^^^^^^^^^ +Use the static OpenMP host runtime while linking. + +``-fopenmp-version=`` +^^^^^^^^^^^^^^^^^^^^^^^^^^ +Set the OpenMP version to a specific version ```` of the OpenMP standard. +For example, you may use ``-fopenmp-version=45`` to select version 4.5 of +the OpenMP standard. The default value is ``-fopenmp-version=51`` for ``Clang``. + +.. _offload_command_line_arguments: + +Offloading Specific Command-Line Arguments +------------------------------------------ + +.. _fopenmp-targets: + +``-fopenmp-targets`` +^^^^^^^^^^^^^^^^^^^^ +| Specify which OpenMP offloading targets should be supported. For example, you + may specify ``-fopenmp-targets=amdgcn-amd-amdhsa,nvptx64``. This option is + often optional when :ref:`offload_arch` is provided. +| It is also possible to offload to CPU architectures, for instance with + ``-fopenmp-targets=x86_64-pc-linux-gnu``. + +.. _offload_arch: + +``--offload-arch`` +^^^^^^^^^^^^^^^^^^ +| Specify the device architecture for OpenMP offloading. For instance + ``--offload-arch=sm_80`` to target an Nvidia Tesla A100, + ``--offload-arch=gfx90a`` to target an AMD Instinct MI250X, or + ``--offload-arch=sm_80,gfx90a`` to target both. +| It is also possible to specify :ref:`fopenmp-targets` without specifying + ``--offload-arch``. In that case, the executables ``amdgpu-arch`` or + ``nvptx-arch`` will be executed as part of the compiler driver to + detect the device architecture automatically. +| Finally, the device architecture will also be automatically inferred with + ``--offload-arch=native``. + +``--offload-device-only`` +^^^^^^^^^^^^^^^^^^^^^^^^^ +Compile only the code that goes on the device. This option is mainly for +debugging purposes. It is primarily used for inspecting the intermediate +representation (IR) output when compiling for the device. It may also be used +if device-only runtimes are created. + +``--offload-host-only`` +^^^^^^^^^^^^^^^^^^^^^^^ +Compile only the code that goes on the host. With this option enabled, the +``.llvm.offloading`` section with embedded device code will not be included in +the intermediate representation. + +``--offload-host-device`` +^^^^^^^^^^^^^^^^^^^^^^^^^ +Compile the target regions for both the host and the device. That is the +default option. + +``-Xopenmp-target `` +^^^^^^^^^^^^^^^^^^^^^^^^^ +Pass an argument ```` to the offloading toolchain, for instance +``-Xopenmp-target -march=sm_80``. + +``-Xopenmp-target= `` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Pass an argument ```` to the offloading toolchain for the target +````. That is especially useful when an argument must differ for each +triple. For instance ``-Xopenmp-target=nvptx64 --offload-arch=sm_80 +-Xopenmp-target=amdgcn --offload-arch=gfx90a`` to specify the device +architecture. Alternatively, :ref:`Xarch_host` and :ref:`Xarch_device` can +pass an argument to the host and device compilation toolchain. + +``-Xoffload-linker `` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Pass an argument ```` to the offloading linker for the target specified in +````. + +.. _Xarch_device: + +``-Xarch_device `` +^^^^^^^^^^^^^^^^^^^^^^^ +Pass an argument ```` to the device compilation toolchain. + +.. _Xarch_host: + +``-Xarch_host `` +^^^^^^^^^^^^^^^^^^^^^ +Pass an argument ```` to the host compilation toolchain. + +``-foffload-lto[=]`` +^^^^^^^^^^^^^^^^^^^^^^^^^ +Enable device link time optimization (LTO) and select the LTO mode ````. +Select either ``-foffload-lto=thin`` or ``-foffload-lto=full``. Thin LTO takes +less time while still achieving some performance gains. If no argument is set, +this option defaults to ``-foffload-lto=full``. + +``-fopenmp-offload-mandatory`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +| This option is set to avoid generating the host fallback code + executed when offloading to the device fails. That is + helpful when the target contains code that cannot be compiled for the host, for + instance, if it contains unguarded device intrinsics. +| This option can also be used to reduce compile time. +| This option should not be used when one wants to verify that the code is being + offloaded to the device. Instead, set the environment variable + ``OMP_TARGET_OFFLOAD='MANDATORY'`` to confirm that the code is being offloaded to + the device. + +``-fopenmp-target-debug[=]`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Enable debugging in the device runtime library (RTL). Note that it is both +necessary to configure the debugging in the device runtime at compile-time with +``-fopenmp-target-debug=`` and enable debugging at runtime with the +environment variable ``LIBOMPTARGET_DEVICE_RTL_DEBUG=``. Further, it is +currently only supported for Nvidia targets as of July 2023. Alternatively, the +environment variable ``LIBOMPTARGET_DEBUG`` can be set to debug both Nvidia and +AMD GPU targets. For more information, see the +`debugging instructions `_. +The debugging instructions list the supported debugging arguments. + +``-fopenmp-target-jit`` +^^^^^^^^^^^^^^^^^^^^^^^ +| Emit code that is Just-in-Time (JIT) compiled for OpenMP offloading. Embed + LLVM-IR for the device code in the object files rather than binary code for the + respective target. At runtime, the LLVM-IR is optimized again and compiled for + the target device. The optimization level can be set at runtime with + ``LIBOMPTARGET_JIT_OPT_LEVEL``, for instance, + ``LIBOMPTARGET_JIT_OPT_LEVEL=3`` corresponding to optimizations level ``-O3``. + See the + `OpenMP JIT details `_ + for instructions on extracting the embedded device code before or after the + JIT and more. +| We want to emphasize that JIT for OpenMP offloading is good for debugging as + the target IR can be extracted, modified, and injected at runtime. + +``--offload-new-driver`` +^^^^^^^^^^^^^^^^^^^^^^^^ +In upstream LLVM, OpenMP only uses the new driver. However, enabling this +option for experimental linking with CUDA or HIP files is necessary. + +``--offload-link`` +^^^^^^^^^^^^^^^^^^ +Use the new offloading linker `clang-linker-wrapper` to perform the link job. +`clang-linker-wrapper` is the default offloading linker for OpenMP. This option +can be used to use the new offloading linker in toolchains that do not automatically +use it. It is necessary to enable this option when linking with CUDA or HIP files. + +``-nogpulib`` +^^^^^^^^^^^^^ +Do not link the device library for CUDA or HIP device compilation. + +``-nogpuinc`` +^^^^^^^^^^^^^ +Do not include the default CUDA or HIP headers, and do not add CUDA or HIP +include paths. diff --git a/openmp/docs/reference/api.rst b/openmp/docs/reference/api.rst new file mode 100644 index 0000000000000..e4c094694e5ec --- /dev/null +++ b/openmp/docs/reference/api.rst @@ -0,0 +1,12 @@ +.. meta:: + :description: Install OpenMP + :keywords: install, openmp, llvm, aomp, AMD, ROCm + +*************************** +OpenMP API specification +*************************** + +Refer to OpenMP API specification `website `_ for details on using OpenMP APIs in ROCm. + + + diff --git a/openmp/docs/reference/faq.rst b/openmp/docs/reference/faq.rst new file mode 100644 index 0000000000000..8fc0f097f5f02 --- /dev/null +++ b/openmp/docs/reference/faq.rst @@ -0,0 +1,362 @@ + +.. meta:: + :description: Install OpenMP + :keywords: install, openmp, llvm, aomp, AMD, ROCm + + + +FAQ +--- + +.. note:: + The FAQ is a work in progress and most of the expected content is not yet available. While you can expect changes, we always welcome feedback and + additions. You may post on the `Discourse forums (Runtimes - OpenMP) `__. + + +Q: How to contribute a patch to the webpage or any other part? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +All patches go through the regular `LLVM review process at `_. + + +.. _build_offload_capable_compiler: + +Q: How to build an OpenMP GPU offload capable compiler? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +To build an *effective* OpenMP offload capable compiler, only one extra CMake option, ``LLVM_ENABLE_RUNTIMES="openmp"``, is needed when building LLVM (Generic +information about building LLVM is available `here `__.). Make sure all backends that are targeted by OpenMP are enabled. That can be done by adjusting the CMake +option ``LLVM_TARGETS_TO_BUILD``. The corresponding targets for offloading to AMD and NVIDIA GPUs are ``"AMDGPU"`` and ``"NVPTX"``, respectively. By default, + +Clang will be built with all backends enabled. When building with ``LLVM_ENABLE_RUNTIMES="openmp"` + +.. note:: + +OpenMP should not be enabled in ``LLVM_ENABLE_PROJECTS`` because it is enabled by default. + +For Nvidia offload, see :ref:`build_nvidia_offload_capable_compiler`. + +For AMDGPU offload, see :ref:`build_amdgpu_offload_capable_compiler`. + +.. note:: + The compiler that generates the offload code should be the same (version) as the compiler that builds the OpenMP device runtimes. The OpenMP host runtime + can be built by a different compiler. + +.. _advanced_builds: https://llvm.org//docs/AdvancedBuilds.html + +.. _build_nvidia_offload_capable_compiler: + +Q: How to build an OpenMP Nvidia offload capable compiler? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +The Cuda SDK is required on the machine that will execute the openmp application. + +If your build machine is not the target machine or automatic detection of the available GPUs failed, you should also set: + +- ``LIBOMPTARGET_DEVICE_ARCHITECTURES=sm_,...`` where ```` is the numeric + compute capability of your GPU. + +For example, set + +``LIBOMPTARGET_DEVICE_ARCHITECTURES=sm_70,sm_80`` to target the NVIDIA Volta and Ampere architectures. + + +.. _build_amdgpu_offload_capable_compiler: + +Q: How to build an OpenMP AMDGPU offload capable compiler? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +A subset of the `ROCm `_ toolchain is required to build the LLVM toolchain and to execute the openmp application. +Either install ROCm somewhere that cmake's find_package can locate it, or build the required subcomponents ROCt and ROCr from source. + +The two components used are ROCT-Thunk-Interface, roct, and ROCR-Runtime, ROCr. Roct is the userspace part of the linux driver. It calls into the driver which +ships with the linux kernel. It is an implementation detail of ROCr from OpenMP's perspective. Rocr is an implementation of `HSA `_. + +.. code-block:: text + + SOURCE_DIR=same-as-llvm-source # e.g. the checkout of llvm-project, next to openmp + BUILD_DIR=somewhere + INSTALL_PREFIX=same-as-llvm-install + + cd $SOURCE_DIR + git clone git@github.com:RadeonOpenCompute/ROCT-Thunk-Interface.git -b roc-4.2.x \ + --single-branch + git clone git@github.com:RadeonOpenCompute/ROCR-Runtime.git -b rocm-4.2.x \ + --single-branch + + cd $BUILD_DIR && mkdir roct && cd roct + cmake $SOURCE_DIR/ROCT-Thunk-Interface/ -DCMAKE_INSTALL_PREFIX=$INSTALL_PREFIX \ + -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=OFF + make && make install + + cd $BUILD_DIR && mkdir rocr && cd rocr + cmake $SOURCE_DIR/ROCR-Runtime/src -DIMAGE_SUPPORT=OFF \ + -DCMAKE_INSTALL_PREFIX=$INSTALL_PREFIX -DCMAKE_BUILD_TYPE=Release \ + -DBUILD_SHARED_LIBS=ON + make && make install + +``IMAGE_SUPPORT`` requires building rocr with clang and is not used by openmp. + +Provided cmake's find_package can find the ROCR-Runtime package, LLVM will build a tool ``bin/amdgpu-arch`` which will print a string like ``gfx906`` when +run if it recognises a GPU on the local system. LLVM will also build a shared library, libomptarget.rtl.amdgpu.so, which is linked against rocr. + +With those libraries installed, then LLVM build and installed, try: + +.. code-block:: shell + + clang -O2 -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa example.c -o example && ./example + +If your build machine is not the target machine or automatic detection of the available GPUs failed, you should also set: + +- ``LIBOMPTARGET_DEVICE_ARCHITECTURES=gfx,...`` where ```` is the + shader core instruction set architecture. For instance, set + ``LIBOMPTARGET_DEVICE_ARCHITECTURES=gfx906,gfx90a`` to target AMD GCN5 + and CDNA2 devices. + +Q: What are the known limitations of OpenMP AMDGPU offload? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +LD_LIBRARY_PATH or rpath/runpath are required to find libomp.so and libomptarget.so + +There is no libc. That is, malloc and printf do not exist. Libm is implemented in terms of the rocm device library, which will be searched for if linking with '-lm'. + +Some versions of the driver for the radeon vii (gfx906) will error unless the environment variable 'export HSA_IGNORE_SRAMECC_MISREPORT=1' is set. + +It is a recent addition to LLVM and the implementation differs from what is shipped in ROCm and AOMP for some time. Early adopters will encounter +bugs. + +Q: What are the LLVM components used in offloading and how are they found? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +The libraries used by an executable compiled for target offloading are: + +- ``libomp.so`` (or similar), the host openmp runtime +- ``libomptarget.so``, the target-agnostic target offloading openmp runtime +- plugins loaded by libomptarget.so: + + - ``libomptarget.rtl.amdgpu.so`` + - ``libomptarget.rtl.cuda.so`` + - ``libomptarget.rtl.x86_64.so`` + - and others + +- dependencies of those plugins, e.g. cuda/rocr for nvptx/amdgpu + +The compiled executable is dynamically linked against a host runtime. For example, ``libomp.so``, and against the target offloading runtime, ``libomptarget.so``. These +are found like any other dynamic library, by setting rpath or runpath on the executable, by setting ``LD_LIBRARY_PATH``, or by adding them to the system search. + +``libomptarget.so`` is only supported to work with the associated ``clang`` compiler. On systems with globally installed ``libomptarget.so`` this can be +problematic. For this reason it is recommended to use a `Clang configuration file `__ to +automatically configure the environment. For example, store the following file as ``openmp.cfg`` next to your ``clang`` executable. + +.. code-block:: text + + # Library paths for OpenMP offloading. + -L '/../lib' + -Wl,-rpath='/../lib' + +The plugins will try to find their dependencies in plugin-dependent fashion. + +The CUDA plugin is dynamically linked against libcuda if cmake found it at the compiler build time. Otherwise it will attempt to dlopen ``libcuda.so``. It does +not have rpath set. + +The amdgpu plugin is linked against ROCr if cmake found it at compiler build time. Otherwise it will attempt to dlopen ``libhsa-runtime64.so.1``. It has rpath +set to ``$ORIGIN``, so installing ``libhsa-runtime64.so.1`` in the same directory is a way to locate it without environment variables. + +In addition to those, there is a compiler runtime library called deviceRTL. This is compiled from mostly common code into an architecture specific +bitcode library, e.g. ``libomptarget-nvptx-sm_70.bc``. + +Clang and the deviceRTL need to match closely as the interface between them changes frequently. Using both from the same monorepo checkout is strongly +recommended. + +Unlike the host side which lets environment variables select components, the deviceRTL that is located in the clang lib directory is preferred. Only if +it is absent, the ``LIBRARY_PATH`` environment variable is searched to find a bitcode file with the right name. This can be overridden by passing a clang +flag, ``--libomptarget-nvptx-bc-path`` or ``--libomptarget-amdgcn-bc-path``. That can specify a directory or an exact bitcode file to use. + + +Q: Does OpenMP offloading support work in pre-packaged LLVM releases? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +For now, the answer is most likely *no*. Please see :ref:`build_offload_capable_compiler`. + +Q: Does OpenMP offloading support work in packages distributed as part of my OS? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +For now, the answer is most likely *no*. Please see :ref:`build_offload_capable_compiler`. + + +.. _math_and_complex_in_target_regions: + +Q: Does Clang support `` and `` operations in OpenMP target on GPUs? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Yes, LLVM/Clang allows math functions and complex arithmetic inside of OpenMP target regions that are compiled for GPUs. + +Clang provides a set of wrapper headers that are found first when `math.h` and `complex.h`, for C, `cmath` and `complex`, for C++, or similar headers are +included by the application. These wrappers will eventually include the system version of the corresponding header file after setting up a target device +specific environment. The fact that the system header is included is important because they differ based on the architecture and operating system and may +contain preprocessor, variable, and function definitions that need to be available in the target region regardless of the targeted device architecture. +However, various functions may require specialized device versions. For example, `sin`, and others are only available on certain devices, e.g., `__umul64hi`. To +provide "native" support for math and complex on the respective architecture, Clang will wrap the "native" math functions, e.g., as provided by the device +vendor, in an OpenMP begin/end declare variant. These functions will then be picked up instead of the host versions while host only variables and function +definitions are still available. Complex arithmetic and functions are support through a similar mechanism. It is worth noting that this support requires +`extensions to the OpenMP begin/end declare variant context selector `__ +that are exposed through LLVM/Clang to the user as well. + +Q: What is a way to debug errors from mapping memory to a target device? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +An experimental way to debug these errors is to use :ref:`remote process offloading `. +By using ``libomptarget.rtl.rpc.so`` and ``openmp-offloading-server``, it is possible to explicitly perform memory transfers between processes on the host +CPU and run sanitizers while doing so in order to catch these errors. + +Q: Can I use dynamically linked libraries with OpenMP offloading? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Dynamically linked libraries can be only used if there is no device code split between the library and application. Anything declared on the device inside the +shared library will not be visible to the application when it's linked. + +Q: How to build an OpenMP offload capable compiler with an outdated host compiler? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Enabling the OpenMP runtime will perform a two-stage build for you. If your host compiler is different from your system-wide compiler, you may need +to set ``CMAKE_{C,CXX}_FLAGS`` like ``--gcc-install-dir=/usr/lib/gcc/x86_64-linux-gnu/12`` so that clang will be able to find the correct GCC toolchain in the second stage of the build. + +For example, if your system-wide GCC installation is too old to build LLVM and you would like to use a newer GCC, set ``--gcc-install-dir=`` +to inform clang of the GCC installation you would like to use in the second stage. + +Q: How can I include OpenMP offloading support in my CMake project? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Currently, there is an experimental CMake find module for OpenMP target offloading provided by LLVM. It will attempt to find OpenMP target offloading +support for your compiler. The flags necessary for OpenMP target offloading will be loaded into the ``OpenMPTarget::OpenMPTarget_`` target or the +``OpenMPTarget__FLAGS`` variable if successful. Currently supported devices are ``AMDGPU`` and ``NVPTX``. + +To use this module, simply add the path to CMake's current module path and call ``find_package``. The module will be installed with your OpenMP installation by +default. Including OpenMP offloading support in an application should now only require a few additions. + +.. code-block:: cmake + + cmake_minimum_required(VERSION 3.20.0) + project(offloadTest VERSION 1.0 LANGUAGES CXX) + + list(APPEND CMAKE_MODULE_PATH "${PATH_TO_OPENMP_INSTALL}/lib/cmake/openmp") + + find_package(OpenMPTarget REQUIRED NVPTX) + + add_executable(offload) + target_link_libraries(offload PRIVATE OpenMPTarget::OpenMPTarget_NVPTX) + target_sources(offload PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/src/Main.cpp) + +Using this module requires at least CMake version 3.20.0. Supported languages are C and C++ with Fortran support planned in the future. Compiler support is +best for Clang but this module should work for other compiler vendors such as IBM, GNU. + +Q: What does 'Stack size for entry function cannot be statically determined' mean? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +This is a warning that the Nvidia tools will sometimes emit if the offloading region is too complex. Normally, the CUDA tools attempt to statically determine +how much stack memory each thread. This way when the kernel is launched each thread will have as much memory as it needs. If the control flow of the kernel +is too complex, containing recursive calls or nested parallelism, this analysis can fail. If this warning is triggered it means that the kernel may run out of +stack memory during execution and crash. The environment variable ``LIBOMPTARGET_STACK_SIZE`` can be used to increase the stack size if this occurs. + +Q: Can OpenMP offloading compile for multiple architectures? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Since LLVM version 15.0, OpenMP offloading supports offloading to multiple architectures at once. This allows for executables to be run on different +targets, such as offloading to AMD and NVIDIA GPUs simultaneously, as well as multiple sub-architectures for the same target. Additionally, static libraries +will only extract archive members if an architecture is used, allowing users to create generic libraries. + +The architecture can either be specified manually using ``--offload-arch=``. If ``--offload-arch=`` is present no ``-fopenmp-targets=`` flag is present then the +targets will be inferred from the architectures. Conversely, if ``--fopenmp-targets=`` is present with no ``--offload-arch`` then the target +architecture will be set to a default value, usually the architecture supported by the system LLVM was built on. + +For example, an executable can be built that runs on AMDGPU and NVIDIA hardware given that the necessary build tools are installed for both. + +.. code-block:: shell + + clang example.c -fopenmp --offload-arch=gfx90a --offload-arch=sm_80 + +If just given the architectures we should be able to infer the triples, otherwise we can specify them manually. + +.. code-block:: shell + + clang example.c -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa,nvptx64-nvidia-cuda \ + -Xopenmp-target=amdgcn-amd-amdhsa --offload-arch=gfx90a \ + -Xopenmp-target=nvptx64-nvidia-cuda --offload-arch=sm_80 + +When linking against a static library that contains device code for multiple architectures, only the images used by the executable will be extracted. + +.. code-block:: shell + + clang example.c -fopenmp --offload-arch=gfx90a,gfx90a,sm_70,sm_80 -c + llvm-ar rcs libexample.a example.o + clang app.c -fopenmp --offload-arch=gfx90a -o app + +The supported device images can be viewed using the ``--offloading`` option with ``llvm-objdump``. + +.. code-block:: shell + + clang example.c -fopenmp --offload-arch=gfx90a --offload-arch=sm_80 -o example + llvm-objdump --offloading example + + a.out: file format elf64-x86-64 + + OFFLOADING IMAGE [0]: + kind elf + arch gfx90a + triple amdgcn-amd-amdhsa + producer openmp + + OFFLOADING IMAGE [1]: + kind elf + arch sm_80 + triple nvptx64-nvidia-cuda + producer openmp + +Q: Can I link OpenMP offloading with CUDA or HIP? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +OpenMP offloading files can currently be experimentally linked with CUDA and HIP files. This will allow OpenMP to call a CUDA device function or vice-versa. +However, the global state will be distinct between the two images at runtime. This means any global variables will potentially have different values when +queried from OpenMP or CUDA. + +Linking CUDA and HIP currently requires enabling a different compilation mode for CUDA / HIP with ``--offload-new-driver`` and to link using +``--offload-link``. Additionally, ``-fgpu-rdc`` must be used to create a linkable device image. + +.. code-block:: shell + + clang++ openmp.cpp -fopenmp --offload-arch=sm_80 -c + clang++ cuda.cu --offload-new-driver --offload-arch=sm_80 -fgpu-rdc -c + clang++ openmp.o cuda.o --offload-link -o app + +Q: Are libomptarget and plugins backward compatible? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +No. libomptarget and plugins are now built as LLVM libraries starting from LLVM 15. Because LLVM libraries are not backward compatible, libomptarget and plugins +are not as well. Given that fact, the interfaces between 1) the Clang compiler and libomptarget, 2) the Clang compiler and device runtime library, and +3) libomptarget and plugins are not guaranteed to be compatible with an earlier version. Users are responsible for ensuring compatibility when not using the +Clang compiler and runtime libraries from the same build. Nevertheless, in order to better support third-party libraries and toolchains that depend on existing +libomptarget entry points, contributors are discouraged from making modifications to them. + +Q: Can I use libc functions on the GPU? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +LLVM provides basic ``libc`` functionality through the LLVM C Library. For building instructions, refer to the associated `LLVM libc documentation +`_. Once built, this provides a static library called ``libcgpu.a``. See the documentation for a +list of `supported functions `_ as well. To utilize these functions, simply link this library as any other when building +with OpenMP. + +.. code-block:: shell + + clang++ openmp.cpp -fopenmp --offload-arch=gfx90a -lcgpu + +For more information on how this is implemented in LLVM/OpenMP's offloading runtime, refer to the `runtime documentation `_. + +Q: What command line options can I use for OpenMP? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +We recommend taking a look at the OpenMP + +:doc:`command line argument reference ` page. + +Q: Why is my build taking a long time? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +When installing OpenMP and other LLVM components, the build time on multicore systems can be significantly reduced with parallel build jobs. As suggested in +*LLVM Techniques, Tips, and Best Practices*, one could consider using ``ninja`` as the generator. This can be done with the CMake option ``cmake -G Ninja``. Afterward, +use ``ninja install`` and specify the number of parallel jobs with ``-j``. The build time can also be reduced by setting the build type to ``Release`` with the +``CMAKE_BUILD_TYPE`` option. Recompilation can also be sped up by caching previous compilations. Consider enabling ``Ccache`` with ``CMAKE_CXX_COMPILER_LAUNCHER=ccache``. + +Q: Did this FAQ not answer your question? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Feel free to post questions or browse old threads at `LLVM Discourse `__. diff --git a/openmp/docs/sphinx/_toc.yml.in b/openmp/docs/sphinx/_toc.yml.in index 543d59fbf66df..44e1ea6ee5be2 100644 --- a/openmp/docs/sphinx/_toc.yml.in +++ b/openmp/docs/sphinx/_toc.yml.in @@ -2,10 +2,44 @@ defaults: numbered: False root: index subtrees: +- caption: Install + entries: + - file: install/install.rst + title: OpenMP installation + - file: install/build.rst + title: Build OpenMP + +- caption: Conceptual + entries: + - file: conceptual/openmp-features.rst + title: OpenMP features + - caption: Reference entries: - - file: CommandLineArgumentReference + - file: reference/api.rst + title: OpenMP API specification for parallel programming + - file: reference/faq.rst + title: OpenMP FAQ + - file: reference/CommandLineArgumentReference.rst + title: Command line argument reference + +- caption: How to + entries: + - file: how-to/use-openmp.rst + title: Use OpenMP + - file: how-to/use-rocprof.rst + title: Use ROCProf in OpenMP + - file: how-to/use-tracing-options.md + title: Use tracing options in OpenMP + +- caption: Tutorials + entries: + - url: https://github.com/ROCm/aomp/tree/aomp-dev/examples + title: AOMP samples + - caption: About entries: - - file: ReleaseNotes - - file: SupportAndFAQ + - file: license.md + + +