From 22c980cf597a4204e836e538736342112243606f Mon Sep 17 00:00:00 2001 From: Gabriel Vainer <67750020+vai9er@users.noreply.github.com> Date: Mon, 19 Dec 2022 05:39:28 -0500 Subject: [PATCH] [doc] Update performance.md (#6911) Issue: # ### Brief Summary Co-authored-by: Zhao Liang --- .../performance_tuning/performance.md | 102 +++++------------- 1 file changed, 26 insertions(+), 76 deletions(-) diff --git a/docs/lang/articles/performance_tuning/performance.md b/docs/lang/articles/performance_tuning/performance.md index 01294856f53dc..526d4e567e082 100644 --- a/docs/lang/articles/performance_tuning/performance.md +++ b/docs/lang/articles/performance_tuning/performance.md @@ -6,13 +6,7 @@ sidebar_position: 2 ## For-loop decorators -In Taichi kernels, for-loop in the outermost scope is automatically -parallelized. Our compiler automatically tunes the parameters to best explore -the target architecture. Nevertheless, for Ninjas who strive for the last few % -of performance, we also provide some APIs to allow developers fine-tune their -applications. For example, specifying a suitable `block_dim` could yield an almost -3x performance boost in -[examples/mpm3d.py](https://github.com/taichi-dev/taichi/blob/master/python/taichi/examples/mpm3d.py). +As discussed in previous topics, Taichi kernels automatically parallelize for-loops in the outermost scope. Our compiler sets the settings automatically to best explore the target architecture. Nonetheless, for Ninjas seeking the final few percent of speed, we give several APIs to allow developers to fine-tune their programs. Specifying a proper `block dim`, for example, might result in a nearly 3x speed gain in [examples/mpm3d.py](https://github.com/taichi-dev/taichi/blob/master/python/taichi/examples/mpm3d.py). You can use `ti.loop_config` to set the loop directives for the next for loop. Available directives are: @@ -49,25 +43,17 @@ def fill(): ### Background: Thread hierarchy of GPUs -To better understand how the mentioned for-loop is parallelized, we briefly -introduce the **thread hierarchy** on modern GPU architectures. - -From a fine-grained to a coarse-grained level, the computation units can be -defined as: **iteration** < **thread** < **block** < **grid**. - -- **iteration**: An iteration is the **body of a for-loop**. Each - iteration corresponding to a specific `i` value in for-loop. -- **thread**: Iterations are grouped into threads. A thread is the - minimal unit that is parallelized. All iterations within a thread - are executed in **serial**. We usually use 1 iteration per thread - for maximizing parallel performance. -- **block**: Threads are grouped into blocks. All threads within a - block are executed in **parallel**. Threads within the same block - can share their **block local storage**. -- **grid**: Blocks are grouped into grids. Grid is the minimal unit - that being **launched** from host. All blocks within a grid are +It is worthy to quickly discuss the **thread hierarchy** on contemporary GPU architectures in order to help you understand how the previously mentioned for-loop is parallelized. + +From fine-grained to coarse-grained, the computation units are as follows: **iteration**, **thread**, **block**, **grid**. + +- **iteration**: The **body of a for-loop** is an iteration. Each iteration corresponds to a different I value in the for-loop. +- **thread**: Iterations are classified as threads. A thread is the smallest parallelized unit. All iterations inside a thread are **serial** in nature. To maximize parallel efficiency, we normally employ one iteration per thread. +- **block**: Threads are organized into groups called blocks. **Parallel** execution is used for all threads within a block. Threads within a block can share **block local storage**. +- **grid**: Blocks are grouped into grids. A Grid is the minimal unit + that being **launched** from the host. All blocks within a grid are executed in **parallel**. In Taichi, each **parallelized for-loop** - is a grid. + is represented as a grid. For more details, please see [the CUDA C programming guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy). @@ -94,21 +80,11 @@ def func(): ## Data layouts -You might have been familiar with [Fields](../basic/field.md) in Taichi. Since -Taichi decouples data structure from computation, developers have the -flexibility to play with different data layouts. Like in other programming -languages, selecting an efficient layout can drastically improve performance. -For more information on advanced data layouts in Taichi, please -see the [Fields (advanced)](../basic/layout.md) section. +Because Taichi separates data structures from processing, developers may experiment with alternative data layouts. Choosing an efficient layout, like in other programming languages, may significantly enhance performance.Please consult the [Fields (advanced)](../basic/layout.md) section for further information on advanced data layouts in Taichi. ## Local Storage Optimizations -Taichi comes with a few optimizations that leverage the *fast memory* (e.g. CUDA -shared memory, L1 cache) for performance optimization. The idea is straightforward: -Wherever possible, Taichi substitutes the access to the global memory (slow) with -that to the local one (fast), and writes the data in the local memory (e.g., CUDA -shared memory) back to the global memory in the end. Such transformations preserve -the semantics of the original program (will be explained later). +Taichi has a few speed enhancements that take use of *fast memory* (e.g., CUDA shared memory, L1 cache). Simply, Taichi replaces access to global memory (slow) with access to local memory (quick) wherever feasible, and writes data in local memory (e.g., CUDA shared memory) back to global memory at the conclusion. Such changes keep the original program's meaning (will be explained later). ### Thread Local Storage (TLS) @@ -131,19 +107,11 @@ def sum(): sum() ``` -Internally, Taichi's parallel loop is implemented using -[Grid-Stride Loops](https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/). -What this means is that each physical CUDA thread could handle more than one item in `x`. -That is, the number of threads launched for `sum` can be fewer than the shape of `x`. +Taichi's parallel loop is implemented internally with [Grid-Stride Loops](https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/). +This means that each physical CUDA thread may handle several items in `x`. +In other words, the number of threads started for'sum' can be less than the form of `x`. -One optimization enabled by this strategy is to substitute the global memory access -with a *thread-local* one. Concretely, instead of directly and atomically adding -`x[i]` into the destination `s[None]`, which resides in the global memory, Taichi -preallocates a thread-local buffer upon entering the thread, accumulates -(*non-atomically*) the value of `x` into this buffer, then adds the result of the -buffer back to `s[None]` atomically before exiting the thread. Assuming each thread -handles `N` items in `x`, the number of atomic adds is reduced to one-N-th its -original size. +One optimization offered by this method is the substitution of a *thread-local* memory access for a global memory access. Instead of directly and atomically adding `x[i]` to the global memory destination's[None],' Taichi preallocates a thread-local buffer upon entering the thread, accumulates (*non-atomically*) the value of `x` into this buffer, and then atomically adds the result of the buffer back to`s[None]` before exiting the thread. If each thread handles `N` items in `x`, the number of atomic additions is reduced to one-Nth of its original amount. Additionally, the last atomic add to the global memory `s[None]` is optimized using CUDA's warp-level intrinsics, further reducing the number of required atomic adds. @@ -157,10 +125,7 @@ of 8M floats on an Nvidia GeForce RTX 3090 card: * TLS disabled: 5.2 x 1e3 us * TLS enabled: 5.7 x 1e1 us -TLS has led to an approximately 100x speedup. We also show that TLS reduction sum -achieves comparable performance with CUDA implementations, see -[benchmark](https://github.com/taichi-dev/taichi_benchmark/tree/main/reduce_sum) for -details. +TLS has resulted in a 100x increase in speed. We also demonstrate that TLS reduction sum achieves equivalent performance to CUDA implementations; for more information, see [benchmark](https://github.com/taichi-dev/taichi benchmark/tree/main/reduce sum). ### Block Local Storage (BLS) @@ -169,13 +134,7 @@ hierarchy matches `ti.root.(sparse SNode)+.dense`), Taichi will assign one CUDA thread block to each `dense` container (or `dense` block). BLS optimization works specifically for such kinds of fields. -BLS aims to accelerate the stencil computation patterns by leveraging the CUDA -shared memory. This optimization starts with the users annotating the set of fields -they would like to cache via `ti.block_local`. Taichi then attempts to figure out -the accessing range w.r.t the `dense` block of these annotated fields at -*compile time*. If succeeded, Taichi generates code that first fetches all the -accessed data in range into a *block local* buffer (CUDA's shared memory), then -substitutes all the accesses to the corresponding slots into this buffer. +BLS intends to enhance stencil computing processes by utilising CUDA shared memory. This optimization begins with users annotating the set of fields they want to cache using `ti.block local`. At *compile time*, Taichi tries to identify the accessing range in relation to the `dense` block of these annotated fields. If Taichi is successful, it creates code that first loads all of the accessible data in range into a *block local* buffer (CUDA's shared memory), then replaces all accesses to the relevant slots into this buffer. Here is an example illustrating the usage of BLS. `a` is a sparse field with a block size of `4x4`. @@ -203,30 +162,21 @@ buffer is shown below: ![](../static/assets/bls_indices_mapping.png) -From a user's perspective, you do not need to worry about these underlying details. -Taichi does all the inference and the global/block-local mapping automatically. -That is, Taichi will preallocate a CUDA shared memory buffer of size `5x6`, -pre-load `a`'s data into this buffer, and replace all the accesses to `a` (in the -global memory) with the buffer in the loop body. While this simple example does -not modify `a`, if a block-cached field does get written, Taichi would also generate -code that writes the buffer back to the global memory. +You do not need to be concerned about these fundamental elements as a user. +Taichi automatically does all inference and global/block-local mapping. +That is, Taichi will preallocate a CUDA shared memory buffer of size `5x6`, preload `a`'s contents into this buffer, then replace all `a` (global memory) accesses with the buffer in the loop body. While this basic example does not change `a` itself, if a block-cached field is written, Taichi will produce code that returns the buffer to global memory. :::note -BLS does not come for free. Remember that BLS is designed for the stencil -computation, where there are a large amount of overlapped accesses to the global -memory. If this is not the case, the pre-loading/post-storing could actually -hurt the performance. +BLS does not come cheap. Remember that BLS is intended for stencil computations with a high number of overlapping global memory accesses. If this is not the case, pre-loading and post-storing may actually degrade performance. -On top of that, recent generations of Nvidia's GPU cards have been closing the gap -on the read-only access between the global memory and the shared memory. Currently, -we found BLS to be more effective for caching the destinations of the atomic operations. +Furthermore, recent generations of Nvidia GPU cards have closed the read-only access gap between global memory and shared memory. Currently, we discovered that BLS is more effective for storing the destinations of atomic actions. -As a rule of thumb, run benchmarks to decide whether to enable BLS or not. +As a general rule of thumb, we recommend running benchmarks to determine whether or not you should enable BLS. ::: ## Offline Cache -A Taichi kernel is implicitly compiled the first time it is called. The compilation results are kept in an *online* in-memory cache to reduce the overhead in the subsequent function calls. As long as the kernel function is unchanged, it can be directly loaded and launched. The cache, however, is no longer available when the program terminates. Then, if you run the program again, Taichi has to re-compile all kernel functions and reconstruct the *online* in-memory cache. And the first launch of a Taichi function is always slow due to the compilation overhead. +The first time a Taichi kernel is called, it is implicitly compiled. To decrease the cost in subsequent function calls, the compilation results are retained in a *online* in-memory cache. It may be loaded and launched immediately as long as the kernel function remains unaltered. When the application exits, the cache is no longer accessible. When you restart the programme, Taichi must recompile all kernel routines and rebuild the *online* in-memory cache. Because of the compilation overhead, the first launch of a Taichi function can typically be slow. We address this problem by introducing the *offline* cache feature, which dumps and saves the compilation cache on disk for future runs. The first launch overhead can be drastically reduced in repeated runs. Taichi now constructs and maintains an offline cache by default, as well as providing several options in `ti.init()` for configuring the offline cache behavior. * `offline_cache: bool`: Enables or disables offline cache. Default: `True`.