Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[misc] Rc v1.1.0 patch4 #5725

Merged
merged 8 commits into from
Aug 10, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
249 changes: 249 additions & 0 deletions docs/lang/articles/advanced/quant.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,249 @@
---
sidebar_position: 3
---

# Using quantized data types

High-resolution simulations can deliver great visual quality, but they are often
limited by available memory, especially on GPUs. For the sake of saving memory,
Taichi provides low-precision ("quantized") data types. You can define your own integers,
fixed-point numbers or floating-point numbers with non-standard number of bits so
that you can choose a proper setting with minimum memory for your applications.
Taichi provides a suite of tailored domain-specific optimizations to ensure the
runtime performance with quantized data types close to that with full-precision
data types.

:::note
Quantized data types are only supported on CPU and CUDA backends for now.
:::

## Quantized data types

### Quantized integers

Modern computers represent integers using the [two's complement](https://en.wikipedia.org/wiki/Two%27s_complement)
format. *Quantized integers* in Taichi adopt the same format, and can contain
non-standard number of bits:

```python
i10 = ti.types.quant.int(bits=10) # 10-bit signed (default) integer type
u5 = ti.types.quant.int(bits=5, signed=False) # 5-bit unsigned integer type
```

### Quantized fixed-point numbers

[Fixed-point numbers](https://en.wikipedia.org/wiki/Fixed-point_arithmetic) are
an old way to represent real numbers. The internal representation of a fixed-point number is simply an integer, and
its actual value equals to the integer multiplied by a predefined scaling
factor. Based on the support for quantized integers, Taichi provides *quantized
fixed-point numbers* as follows:

```python
fixed_type_a = ti.types.quant.fixed(bits=10, max_value=20.0) # 10-bit signed (default) fixed-point type within [-20.0, 20.0]
fixed_type_b = ti.types.quant.fixed(bits=5, signed=False, max_value=100.0) # 5-bit unsigned fixed-point type within [0.0, 100.0]
fixed_type_c = ti.types.quant.fixed(bits=6, signed=False, scale=1.0) # 6-bit unsigned fixed-point type within [0, 64.0]
```

`scale` is the scaling factor mentioned above. Because fixed-point numbers are
especially useful when you know the actual value is guaranteed to be within a
range, Taichi allows you to simply set `max_value` and will calculate the
scaling factor accordingly.

### Quantized floating-point numbers

[Floating-point numbers](https://en.wikipedia.org/wiki/Floating-point_arithmetic)
are the standard way to represent real numbers on modern computers. A
floating-point number is composed of exponent bits, fraction bits, and a sign
bit. There are various floating-point formats:

![image](../static/assets/floating-point_formats.png)

In Taichi, you can define a *quantized floating-point number* with arbitrary
combination of exponent bits and fraction bits (the sign bit is made part of
fraction bits):

```python
float_type_a = ti.types.quant.float(exp=5, frac=10) # 15-bit signed (default) floating-point type with 5 exponent bits
float_type_b = ti.types.quant.float(exp=6, frac=9, signed=False) # 15-bit unsigned floating-point type with 6 exponent bits
```

### Compute types

All the parameters you've seen above are specifying the *storage type* of a
quantized data type. However, most quantized data types have no native support
on hardware, so an actual value of that quantized data type needs to convert to
a primitive type ("*compute type*") when it is involved in computation.

The default compute type for quantized integers is `ti.i32`, while the default
compute type for quantized fixed-point/floating-point numbers is `ti.f32`. You
can change the compute type by specifying the `compute` parameter:

```python
i21 = ti.types.quant.int(bits=21, compute=ti.i64)
bfloat16 = ti.types.quant.float(exp=8, frac=8, compute=ti.f32)
```

## Data containers for quantized data types

Because the storage types are not primitive types, you may now wonder how
quantized data types can work together with data containers that Taichi
provides. In fact, some special constructs are introduced to eliminate the gap.

### Bitpacked fields

`ti.BitpackedFields` packs a group of fields whose `dtype`s are
quantized data types together so that they are stored with one primitive type.
You can then place a `ti.BitpackedFields` instance under any SNode as if each member field
is placed individually.

```python
a = ti.field(float_type_a) # 15 bits
b = ti.field(fixed_type_b) # 5 bits
c = ti.field(fixed_type_c) # 6 bits
d = ti.field(u5) # 5 bits
bitpack = ti.BitpackedFields(max_num_bits=32)
bitpack.place(a, b, c, d) # 31 out of 32 bits occupied
ti.root.dense(ti.i, 10).place(bitpack)
```

#### Shared exponent

When multiple fields with quantized floating-point types are packed together,
there is chance that they can share a common exponent. For example, in a 3D
velocity vector, if you know the x-component has a much larger absolute value
compared to y- and z-components, then you probably do not care about the exact
value of the y- and z-components. In this case, using a shared exponent can
leave more bits for components with larger absolute values. You can use
`place(x, y, z, shared_exponent=True)` to make fields `x, y, z` share a common
exponent.

#### Your first program

You probably cannot wait to write your first Taichi program with quantized data
types. The easiest way is to modify the data definitions of an existing example.
Assume you want to save memory for
[examples/simulation/euler.py](https://github.com/taichi-dev/taichi/blob/master/python/taichi/examples/simulation/euler.py).
Because most data definitions in the example are similar, here only field `Q` is
used for illustration:

```python
Q = ti.Vector.field(4, dtype=ti.f32, shape=(N, N))
```

An element of `Q` now occupies 4 x 32 = 128 bits. If you can fit it in
64 bits, then the memory usage is halved. A direct and first attempt is to
use quantized floating-point numbers with a shared exponent:

```python
float_type_c = ti.types.quant.float(exp=8, frac=14)
Q_old = ti.Vector.field(4, dtype=float_type_c)
bitpack = ti.BitpackedFields(max_num_bits=64)
bitpack.place(Q_old, shared_exponent=True)
ti.root.dense(ti.ij, (N, N)).place(bitpack)
```

Surprisingly, you find that there is no obvious difference in visual effects
after the change, and you now successfully finish a Taichi program with
quantized data types! More attempts are left to you.

#### More complicated quantization schemes

Here comes a more complicated scenario. In a 3D Eulerian fluid simulation, a
voxel may need to store a 3D vector for velocity, and an integer value for cell
category with three possible values: "source", "Dirichlet boundary", and
"Neumann boundar". You can actually store all information with a single 32-bit
`ti.BitpackedFields`:

```python
velocity_component_type = ti.types.quant.float(exp=6, frac=8, compute=ti.f32)
velocity = ti.Vector.field(3, dtype=velocity_component_type)

# Since there are only three cell categories, 2 bits are enough.
cell_category_type = ti.types.quant.int(bits=2, signed=False, compute=ti.i32)
cell_category = ti.field(dtype=cell_category_type)

voxel = ti.BitpackedFields(max_num_bits=32)
# Place three components of velocity into the voxel, and let them share the exponent.
voxel.place(velocity, shared_exponent=True)
# Place the 2-bit cell category.
voxel.place(cell_category)
# Create 512 x 512 x 256 voxels.
ti.root.dense(ti.ijk, (512, 512, 256)).place(voxel)
```

The compression scheme above allows you to store 13 bytes (4B x 3 + 1B) into
just 4 bytes. Note that you can still use velocity and cell_category in the
computation code, as if they are `ti.f32` and `ti.u8`.

![image](../static/assets/bitpacked_fields_layout_example.png)

### Quant arrays

Bitpacked fields are actually laid in an array of structure (AOS) order.
However, there are also cases where a single quantized type is required to get
laid in an array. For example, you may want to store 8 x u4 values in a single
u32 type, to represent bin values of a histogram:

![image](../static/assets/quant_array_layout_example.png)

Quant array is exactly what you need. A `quant_array` is a SNode which
can reinterpret a primitive type into an array of a quantized type:

```python
bin_value_type = ti.types.quant.int(bits=4, signed=False)

# The quant array for 512 x 512 bin values
array = ti.root.dense(ti.ij, (512, 64)).quant_array(ti.j, 8, max_num_bits=32)
# Place the unsigned 4-bit bin value into the array
array.place(bin_value_type)
```

:::note
1. Only one field can be placed under a `quant_array`.
2. Only quantized integer types and quantized fixed-point types are supported as
the `dtype` of the field under a `quant_array`.
3. The size of the `dtype` of the field times the shape of the `quant_array`
must be less than or equal to the `max_num_bits` of the `quant_array`.
:::

#### Bit vectorization

For quant arrays of 1-bit quantized integer types ("boolean"), Taichi provides
an additional optimization - bit vectorization. It aims at vectorizing
operations on such quant arrays under struct fors:

```python
u1 = ti.types.quant.int(1, False)
N = 512
M = 32
x = ti.field(dtype=u1)
y = ti.field(dtype=u1)
ti.root.dense(ti.i, N // M).quant_array(ti.i, M, max_num_bits=M).place(x)
ti.root.dense(ti.i, N // M).quant_array(ti.i, M, max_num_bits=M).place(y)

@ti.kernel
def assign_vectorized():
ti.loop_config(bit_vectorize=True)
for i, j in x:
y[i, j] = x[i, j] # 32 bits are handled at a time

assign_vectorized()
```

## Advanced examples

The following examples are picked from the
[QuanTaichi paper](https://yuanming.taichi.graphics/publication/2021-quantaichi/quantaichi.pdf),
so you can dig into details there.

### [Game of Life](https://github.com/taichi-dev/quantaichi/tree/main/gol)

![image](https://github.com/taichi-dev/quantaichi/raw/main/pics/teaser_gol.jpg)

### [Eulerian Fluid](https://github.com/taichi-dev/quantaichi/tree/main/eulerian_fluid)

![image](https://github.com/taichi-dev/quantaichi/raw/main/pics/smoke_result.png)

### [MLS-MPM](https://github.com/taichi-dev/taichi_elements/blob/master/demo/demo_quantized_simulation_letters.py)

![image](https://github.com/taichi-dev/quantaichi/raw/main/pics/mpm-235.jpg)
76 changes: 76 additions & 0 deletions docs/lang/articles/differentiable/differentiable_programming.md
Original file line number Diff line number Diff line change
Expand Up @@ -448,3 +448,79 @@ Check out [the DiffTaichi paper](https://arxiv.org/pdf/1910.00935.pdf)
and [video](https://www.youtube.com/watch?v=Z1xvAZve9aE) to learn more
about Taichi differentiable programming.
:::


## Forward-Mode Autodiff

There are two modes of automatic differentiation, forward and reverse mode. The forward mode provides a function to compute Jacobian-Vector Product (JVP), which can compute one column of the Jacobian matrix at a time. The reverse mode supports computing Vector-Jacobian Product (VJP), i.e., one row of the Jacobian matrix at a time. Therefore, for functions which have more inputs than outputs, reverse mode is more efficient. The `ti.ad.Tape` and `kernel.grad()` are built on the reverse mode. The forward mode is more efficient when handling functions whose outputs are more than inputs. Taichi autodiff also supports forward mode.

### Using `ti.ad.FwdMode`
The usage of `ti.ad.FwdMode` is very similar to `ti.ad.Tape`. Here we reuse the example for reverse mode above for an explanation.
1. Enable `needs_dual=True` option when declaring fields involved in the derivative chain.
2. Use context manager with `ti.ad.FwdMode(loss=y, param=x)`: to capture the kernel invocations which you want to automatically differentiate. The `loss` and `param` are the output and input of the function respectively.
3. Now dy/dx value at current x is available at function output `y.dual[None]`.
The following code snippet explains the steps above:

```python
import taichi as ti
ti.init()

x = ti.field(dtype=ti.f32, shape=(), needs_dual=True)
y = ti.field(dtype=ti.f32, shape=(), needs_dual=True)


@ti.kernel
def compute_y():
y[None] = ti.sin(x[None])


with ti.ad.FwdMode(loss=y, param=x):
compute_y()

print('dy/dx =', y.dual[None], ' at x =', x[None])
```

:::note
The `dual` here indicates `dual number`in math. The reason for using the name is that forwar-mode autodiff is equivalent to evaluating function with dual numbers.
:::

:::note
The `ti.ad.FwdMode` automatically clears the dual field of `loss`.
:::

ti.ad.FwdMode support multiple inputs and outputs. The param can be a N-D field and the loss can be an individual or a list of N-D fields. The argument `seed` is the 'vector' in Jacobian-vector product, which used to control the parameter that is computed derivative with respect to. Here we show three cases with multiple inputs and outputs. With `seed=[1.0, 0.0] `or `seed=[0.0, 1.0]` , we can compute the derivatives solely with respect to `x_0` or `x_1`.

```python
import taichi as ti
ti.init()
N_param = 2
N_loss = 5
x = ti.field(dtype=ti.f32, shape=N_param, needs_dual=True)
y = ti.field(dtype=ti.f32, shape=N_loss, needs_dual=True)


@ti.kernel
def compute_y():
for i in range(N_loss):
for j in range(N_param):
y[i] += i * ti.sin(x[j])


# Compute derivatives respect to x_0
with ti.ad.FwdMode(loss=y, param=x, seed=[1.0, 0.0]):
compute_y()
print('dy/dx_0 =', y.dual, ' at x_0 =', x[0])

# Compute derivatives respect to x_1
with ti.ad.FwdMode(loss=y, param=x, seed=[0.0, 1.0]):
compute_y()
print('dy/dx_1 =', y.dual, ' at x_1 =', x[1])
```

:::note
The `seed` argument is required if the `param` is not a scalar field.
:::

:::tip
Similar to reverse mode autodiff, Taichi provides an API `ti.root.lazy_dual()` that automatically places the dual fields following the layout of their primal fields.
:::
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading