taichi-dev · archibate · Sep 3, 2020 · Aug 13, 2020 · Aug 13, 2020 · Aug 14, 2020
diff --git a/docs/index.rst b/docs/index.rst
@@ -39,11 +39,12 @@ The Taichi Programming Language
    meta
    layout
    sparse
-   offset
    differentiable_programming
+   performance
    odop
    compilation
    syntax_sugars
+   offset
 
 
 .. toctree::

diff --git a/docs/performance.rst b/docs/performance.rst
@@ -0,0 +1,84 @@
+
+Performance tuning
+==================
+
+For-loop decorators
+-------------------
+
+In Taichi kernels, for-loops in the outermost scope is automatically parallelized.
+
+However, there are some implementation details about **how it is parallelized**.
+
+Taichi provides some API to modify these parameters.
+This allows advanced users to manually fine-tune the performance.
+
+For example, specifying a suitable ``ti.block_dim`` could yield an almost 3x performance boost in `examples/mpm3d.py <https://github.com/taichi-dev/taichi/blob/master/examples/mpm3d.py>`_.
+
+.. note::
+
+   For performance profiling utilities, see :ref:`profiler`.
+
+Thread hierarchy of GPUs
+************************
+
+GPUs have a **thread hierarchy**.
+
+From small to large, the computation units are:
+**iteration** < **thread** < **block** < **grid**.
+
+- **iteration**:
+  Iteration is the **body of a for-loop**.
+  Each iteration corresponding to a specific ``i`` value in for-loop.
+
+- **thread**:
+  Iterations are grouped into threads.
+  Threads are the minimal unit that is parallelized.
+  All iterations within a thread are executed in **serial**.
+  We usually use 1 iteration per thread for maximizing parallel performance.
+
+- **block**:
+  Threads are grouped into blocks.
+  All threads within a block are executed in **parallel**.
+  Threads within the same block can share their **block local storage**.
+
+- **grid**:
+  Blocks are grouped into grids.
+  Grid is the minimal unit that being **launched** from host.
+  All blocks within a grid are executed in **parallel**.
+  In Taichi, each **parallelized for-loop** is a grid.
+
+For more details, please see `the CUDA C programming guide <https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy>`_. The OpenGL and Metal backends follow a similar thread hierarchy.
+
+API reference
+*************
+
+Programmers may **prepend** some decorator(s) to tweak the property of a for-loop, e.g.:
+
+.. code-block:: python
+
+    @ti.kernel
+    def func():
+        for i in range(8192):  # no decorator, use default settings
+            ...
+
+        ti.block_dim(128)      # change the property of next for-loop:
+        for i in range(8192):  # will be parallelized with block_dim=128
+            ...
+
+        for i in range(8192):  # no decorator, use default settings
+            ...
+
+
+Here is a list of available decorators:
+
+
+.. function:: ti.block_dim(n)
+
+    :parameter n: (int) threads per block / block dimension
+
+    Specify the **threads per block** of the next parallel for-loop.
+
+
+    .. note::
+
+        The argument ``n`` must be a power-of-two for now.
diff --git a/docs/profiler.rst b/docs/profiler.rst
@@ -1,3 +1,4 @@
+.. _profiler:
 
 Profiler
 ========
@@ -43,6 +44,13 @@ The outputs would be:
     [ 77.27%] compute_c4_0_kernel_2_serial                min   0.004 ms   avg   0.004 ms   max   0.004 ms   total   0.000 s [      1x]
 
 
+.. note::
+
+   Currently the result of ``KernelProfiler`` could be incorrect on OpenGL backend
+   due to its lack of support for ``ti.sync()``.
+
+
+
 ScopedProfiler
 ##############
 
@@ -70,6 +78,6 @@ ScopedProfiler
 
 ``ti.print_profile_info()`` prints profiling results in a hierarchical format.
 
-.. Note::
+.. note::
 
     ``ScopedProfiler`` is a C++ class in the core of Taichi. It is not exposed to Python users.