Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc] Add performance.rst introducing the 'ti.block_dim' API #1691

Merged
merged 11 commits into from
Sep 3, 2020
3 changes: 2 additions & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,11 +39,12 @@ The Taichi Programming Language
meta
layout
sparse
offset
differentiable_programming
performance
odop
compilation
syntax_sugars
offset


.. toctree::
Expand Down
84 changes: 84 additions & 0 deletions docs/performance.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@

Performance tuning
==================

For-loop decorators
-------------------

In Taichi kernels, for-loops in the outermost scope is automatically parallelized.

However, there are some implementation details about **how it is parallelized**.

Taichi provides some API to modify these parameters.
This allows advanced users to manually fine-tune the performance.

For example, specifying a suitable ``ti.block_dim`` could yield an almost 3x performance boost in `examples/mpm3d.py <https://github.com/taichi-dev/taichi/blob/master/examples/mpm3d.py>`_.

.. note::

For performance profiling utilities, see :ref:`profiler`.

Thread hierarchy of GPUs
************************

GPUs have a **thread hierarchy**.

From small to large, the computation units are:
**iteration** < **thread** < **block** < **grid**.

- **iteration**:
Iteration is the **body of a for-loop**.
Each iteration corresponding to a specific ``i`` value in for-loop.

- **thread**:
Iterations are grouped into threads.
Threads are the minimal unit that is parallelized.
All iterations within a thread are executed in **serial**.
We usually use 1 iteration per thread for maximizing parallel performance.

- **block**:
Threads are grouped into blocks.
All threads within a block are executed in **parallel**.
Threads within the same block can share their **block local storage**.

- **grid**:
Blocks are grouped into grids.
Grid is the minimal unit that being **launched** from host.
All blocks within a grid are executed in **parallel**.
In Taichi, each **parallelized for-loop** is a grid.

For more details, please see `the CUDA C programming guide <https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy>`_. The OpenGL and Metal backends follow a similar thread hierarchy.

API reference
*************

Programmers may **prepend** some decorator(s) to tweak the property of a for-loop, e.g.:

.. code-block:: python

@ti.kernel
def func():
for i in range(8192): # no decorator, use default settings
...

ti.block_dim(128) # change the property of next for-loop:
for i in range(8192): # will be parallelized with block_dim=128
...

for i in range(8192): # no decorator, use default settings
...


Here is a list of available decorators:


.. function:: ti.block_dim(n)

:parameter n: (int) threads per block / block dimension

Specify the **threads per block** of the next parallel for-loop.


.. note::

The argument ``n`` must be a power-of-two for now.
10 changes: 9 additions & 1 deletion docs/profiler.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
.. _profiler:

Profiler
========
Expand Down Expand Up @@ -43,6 +44,13 @@ The outputs would be:
[ 77.27%] compute_c4_0_kernel_2_serial min 0.004 ms avg 0.004 ms max 0.004 ms total 0.000 s [ 1x]


.. note::

Currently the result of ``KernelProfiler`` could be incorrect on OpenGL backend
due to its lack of support for ``ti.sync()``.



ScopedProfiler
##############

Expand Down Expand Up @@ -70,6 +78,6 @@ ScopedProfiler

``ti.print_profile_info()`` prints profiling results in a hierarchical format.

.. Note::
.. note::

``ScopedProfiler`` is a C++ class in the core of Taichi. It is not exposed to Python users.