Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc] Update performance.md #6911

Merged
merged 3 commits into from
Dec 19, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
102 changes: 26 additions & 76 deletions docs/lang/articles/performance_tuning/performance.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,7 @@ sidebar_position: 2

## For-loop decorators

In Taichi kernels, for-loop in the outermost scope is automatically
parallelized. Our compiler automatically tunes the parameters to best explore
the target architecture. Nevertheless, for Ninjas who strive for the last few %
of performance, we also provide some APIs to allow developers fine-tune their
applications. For example, specifying a suitable `block_dim` could yield an almost
3x performance boost in
[examples/mpm3d.py](https://github.com/taichi-dev/taichi/blob/master/python/taichi/examples/mpm3d.py).
As discussed in previous topics, Taichi kernels automatically parallelize for-loops in the outermost scope. Our compiler sets the settings automatically to best explore the target architecture. Nonetheless, for Ninjas seeking the final few percent of speed, we give several APIs to allow developers to fine-tune their programs. Specifying a proper `block dim`, for example, might result in a nearly 3x speed gain in [examples/mpm3d.py](https://github.com/taichi-dev/taichi/blob/master/python/taichi/examples/mpm3d.py).

You can use `ti.loop_config` to set the loop directives for the next for loop. Available directives are:

Expand Down Expand Up @@ -49,25 +43,17 @@ def fill():

### Background: Thread hierarchy of GPUs

To better understand how the mentioned for-loop is parallelized, we briefly
introduce the **thread hierarchy** on modern GPU architectures.

From a fine-grained to a coarse-grained level, the computation units can be
defined as: **iteration** < **thread** < **block** < **grid**.

- **iteration**: An iteration is the **body of a for-loop**. Each
iteration corresponding to a specific `i` value in for-loop.
- **thread**: Iterations are grouped into threads. A thread is the
minimal unit that is parallelized. All iterations within a thread
are executed in **serial**. We usually use 1 iteration per thread
for maximizing parallel performance.
- **block**: Threads are grouped into blocks. All threads within a
block are executed in **parallel**. Threads within the same block
can share their **block local storage**.
- **grid**: Blocks are grouped into grids. Grid is the minimal unit
that being **launched** from host. All blocks within a grid are
It is worthy to quickly discuss the **thread hierarchy** on contemporary GPU architectures in order to help you understand how the previously mentioned for-loop is parallelized.

From fine-grained to coarse-grained, the computation units are as follows: **iteration**, **thread**, **block**, **grid**.

- **iteration**: The **body of a for-loop** is an iteration. Each iteration corresponds to a different I value in the for-loop.
- **thread**: Iterations are classified as threads. A thread is the smallest parallelized unit. All iterations inside a thread are **serial** in nature. To maximize parallel efficiency, we normally employ one iteration per thread.
- **block**: Threads are organized into groups called blocks. **Parallel** execution is used for all threads within a block. Threads within a block can share **block local storage**.
- **grid**: Blocks are grouped into grids. A Grid is the minimal unit
that being **launched** from the host. All blocks within a grid are
executed in **parallel**. In Taichi, each **parallelized for-loop**
is a grid.
is represented as a grid.

For more details, please see [the CUDA C programming
guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy).
Expand All @@ -94,21 +80,11 @@ def func():

## Data layouts

You might have been familiar with [Fields](../basic/field.md) in Taichi. Since
Taichi decouples data structure from computation, developers have the
flexibility to play with different data layouts. Like in other programming
languages, selecting an efficient layout can drastically improve performance.
For more information on advanced data layouts in Taichi, please
see the [Fields (advanced)](../basic/layout.md) section.
Because Taichi separates data structures from processing, developers may experiment with alternative data layouts. Choosing an efficient layout, like in other programming languages, may significantly enhance performance.Please consult the [Fields (advanced)](../basic/layout.md) section for further information on advanced data layouts in Taichi.

## Local Storage Optimizations

Taichi comes with a few optimizations that leverage the *fast memory* (e.g. CUDA
shared memory, L1 cache) for performance optimization. The idea is straightforward:
Wherever possible, Taichi substitutes the access to the global memory (slow) with
that to the local one (fast), and writes the data in the local memory (e.g., CUDA
shared memory) back to the global memory in the end. Such transformations preserve
the semantics of the original program (will be explained later).
Taichi has a few speed enhancements that take use of *fast memory* (e.g., CUDA shared memory, L1 cache). Simply, Taichi replaces access to global memory (slow) with access to local memory (quick) wherever feasible, and writes data in local memory (e.g., CUDA shared memory) back to global memory at the conclusion. Such changes keep the original program's meaning (will be explained later).

### Thread Local Storage (TLS)

Expand All @@ -131,19 +107,11 @@ def sum():
sum()
```

Internally, Taichi's parallel loop is implemented using
[Grid-Stride Loops](https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/).
What this means is that each physical CUDA thread could handle more than one item in `x`.
That is, the number of threads launched for `sum` can be fewer than the shape of `x`.
Taichi's parallel loop is implemented internally with [Grid-Stride Loops](https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/).
This means that each physical CUDA thread may handle several items in `x`.
In other words, the number of threads started for'sum' can be less than the form of `x`.

One optimization enabled by this strategy is to substitute the global memory access
with a *thread-local* one. Concretely, instead of directly and atomically adding
`x[i]` into the destination `s[None]`, which resides in the global memory, Taichi
preallocates a thread-local buffer upon entering the thread, accumulates
(*non-atomically*) the value of `x` into this buffer, then adds the result of the
buffer back to `s[None]` atomically before exiting the thread. Assuming each thread
handles `N` items in `x`, the number of atomic adds is reduced to one-N-th its
original size.
One optimization offered by this method is the substitution of a *thread-local* memory access for a global memory access. Instead of directly and atomically adding `x[i]` to the global memory destination's[None],' Taichi preallocates a thread-local buffer upon entering the thread, accumulates (*non-atomically*) the value of `x` into this buffer, and then atomically adds the result of the buffer back to`s[None]` before exiting the thread. If each thread handles `N` items in `x`, the number of atomic additions is reduced to one-Nth of its original amount.

Additionally, the last atomic add to the global memory `s[None]` is optimized using
CUDA's warp-level intrinsics, further reducing the number of required atomic adds.
Expand All @@ -157,10 +125,7 @@ of 8M floats on an Nvidia GeForce RTX 3090 card:
* TLS disabled: 5.2 x 1e3 us
* TLS enabled: 5.7 x 1e1 us

TLS has led to an approximately 100x speedup. We also show that TLS reduction sum
achieves comparable performance with CUDA implementations, see
[benchmark](https://github.com/taichi-dev/taichi_benchmark/tree/main/reduce_sum) for
details.
TLS has resulted in a 100x increase in speed. We also demonstrate that TLS reduction sum achieves equivalent performance to CUDA implementations; for more information, see [benchmark](https://github.com/taichi-dev/taichi benchmark/tree/main/reduce sum).

### Block Local Storage (BLS)

Expand All @@ -169,13 +134,7 @@ hierarchy matches `ti.root.(sparse SNode)+.dense`), Taichi will assign one CUDA
thread block to each `dense` container (or `dense` block). BLS optimization works
specifically for such kinds of fields.

BLS aims to accelerate the stencil computation patterns by leveraging the CUDA
shared memory. This optimization starts with the users annotating the set of fields
they would like to cache via `ti.block_local`. Taichi then attempts to figure out
the accessing range w.r.t the `dense` block of these annotated fields at
*compile time*. If succeeded, Taichi generates code that first fetches all the
accessed data in range into a *block local* buffer (CUDA's shared memory), then
substitutes all the accesses to the corresponding slots into this buffer.
BLS intends to enhance stencil computing processes by utilising CUDA shared memory. This optimization begins with users annotating the set of fields they want to cache using `ti.block local`. At *compile time*, Taichi tries to identify the accessing range in relation to the `dense` block of these annotated fields. If Taichi is successful, it creates code that first loads all of the accessible data in range into a *block local* buffer (CUDA's shared memory), then replaces all accesses to the relevant slots into this buffer.

Here is an example illustrating the usage of BLS. `a` is a sparse field with a
block size of `4x4`.
Expand Down Expand Up @@ -203,30 +162,21 @@ buffer is shown below:

![](../static/assets/bls_indices_mapping.png)

From a user's perspective, you do not need to worry about these underlying details.
Taichi does all the inference and the global/block-local mapping automatically.
That is, Taichi will preallocate a CUDA shared memory buffer of size `5x6`,
pre-load `a`'s data into this buffer, and replace all the accesses to `a` (in the
global memory) with the buffer in the loop body. While this simple example does
not modify `a`, if a block-cached field does get written, Taichi would also generate
code that writes the buffer back to the global memory.
You do not need to be concerned about these fundamental elements as a user.
Taichi automatically does all inference and global/block-local mapping.
That is, Taichi will preallocate a CUDA shared memory buffer of size `5x6`, preload `a`'s contents into this buffer, then replace all `a` (global memory) accesses with the buffer in the loop body. While this basic example does not change `a` itself, if a block-cached field is written, Taichi will produce code that returns the buffer to global memory.

:::note
BLS does not come for free. Remember that BLS is designed for the stencil
computation, where there are a large amount of overlapped accesses to the global
memory. If this is not the case, the pre-loading/post-storing could actually
hurt the performance.
BLS does not come cheap. Remember that BLS is intended for stencil computations with a high number of overlapping global memory accesses. If this is not the case, pre-loading and post-storing may actually degrade performance.

On top of that, recent generations of Nvidia's GPU cards have been closing the gap
on the read-only access between the global memory and the shared memory. Currently,
we found BLS to be more effective for caching the destinations of the atomic operations.
Furthermore, recent generations of Nvidia GPU cards have closed the read-only access gap between global memory and shared memory. Currently, we discovered that BLS is more effective for storing the destinations of atomic actions.

As a rule of thumb, run benchmarks to decide whether to enable BLS or not.
As a general rule of thumb, we recommend running benchmarks to determine whether or not you should enable BLS.
:::

## Offline Cache

A Taichi kernel is implicitly compiled the first time it is called. The compilation results are kept in an *online* in-memory cache to reduce the overhead in the subsequent function calls. As long as the kernel function is unchanged, it can be directly loaded and launched. The cache, however, is no longer available when the program terminates. Then, if you run the program again, Taichi has to re-compile all kernel functions and reconstruct the *online* in-memory cache. And the first launch of a Taichi function is always slow due to the compilation overhead.
The first time a Taichi kernel is called, it is implicitly compiled. To decrease the cost in subsequent function calls, the compilation results are retained in a *online* in-memory cache. It may be loaded and launched immediately as long as the kernel function remains unaltered. When the application exits, the cache is no longer accessible. When you restart the programme, Taichi must recompile all kernel routines and rebuild the *online* in-memory cache. Because of the compilation overhead, the first launch of a Taichi function can typically be slow.

We address this problem by introducing the *offline* cache feature, which dumps and saves the compilation cache on disk for future runs. The first launch overhead can be drastically reduced in repeated runs. Taichi now constructs and maintains an offline cache by default, as well as providing several options in `ti.init()` for configuring the offline cache behavior.
* `offline_cache: bool`: Enables or disables offline cache. Default: `True`.
Expand Down